6,659 Matching Annotations
  1. Nov 2025
    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This paper investigates the neural mechanisms underlying the change in perception when viewing ambiguous figures. Each possible percept is related to an attractor-like brain state and a perceptual switch corresponds to a transition between these states. The hypothesis is that these switches are promoted by bursts of noradrenaline that change the gain of neural circuits. The authors present several lines of evidence consistent with this view: pupil diameter changes during the time point of the perceptual change; a gain change in neural network models promotes a state transition; and large-scale fMRI dynamics in a different experiment suggests a lower barrier between brain states at the change point. However, some assumptions of the computational model seem not well justified and the theoretical analysis is incomplete. The paper would also benefit from a more in-depth analysis of the experimental data.

      Strengths:

      The main strength of the paper is that it attempts to combine experimental measurements - from psychophysics, pupil measurements, and fMRI dynamics - and computational modeling to provide an emerging picture of how a perceptual switch emerges. This integrative approach is highly useful because the model has the potential to make the underlying mechanisms explicit and to make concrete predictions.

      Weaknesses:

      A general weakness is that the link between the three parts of the paper is not very strong. Pupil and fMRI measurements come from different experiments and additional analysis showing that the two experiments are comparable should be included. Crucially, the assumptions underlying the RNN modeling are unclear and the conclusions drawn from the simulation may depend on those assumptions.

      With this comment in mind we have made substantial effort to better integrate the three different aspects of our paper. On the pupillometry side, we now show that the dynamic uncertainty associated with perceptual categorisation shares a similar waveform with the observed fluctuations in pupil diameter around the switch point (Fig 2B). To better link the modelling to the behaviour we have also made the gain of the activation function of each sigmoidal unit change dynamically as a function of the uncertainty (i.e. the entropy) of the network’s classification generating phasic changes in gain that mimic the observed phasic changes in pupil dilation explicitly linking the dynamics of gain in the RNN to the observed dynamics of pupil diameter (our non-invasive proxy for neuromodulatory tone). Finally we note that the predictions of the RNN (flattened egocentric landscape and peaks in low-dimensional brain state velocity at the time point of the perceptual switch) were tested directly in the whole-brain BOLD data, which links the modelling and BOLD analysis. Finally we note that whilst we agree that an experiment in which pupilometry and BOLD data were collected simultaneously would be ideal, these data were not available to us at the time of this study.

      Main points:

      Perceptual tasks in pupil and fMRI experiments: how comparable are these two tasks? It seems that the timing is very different, with long stimulus presentations and breaks in the fMRI task and a rapid sequence in the pupil task. Detailed information about the task timing in the pupil task is missing. What evidence is there that the same mechanisms underlie perceptual switches at these different timescales? Quantification of the distributions of switching times/switching points in both tasks is missing. Do the subjects in the fMRI task show the same overall behavior as in the pupil task? More information is needed to clarify these points.

      We recognize the need for a more detailed and comparative analysis of the perceptual tasks used in our pupil and fMRI experiments, particularly regarding differences in timing, task structure, and instructions. The fMRI task incorporates jittered inter-trial intervals (ITIs) of 2, 4, 6, and 8 seconds, designed to enable effective deconvolution of the BOLD response (Stottinger et al., 2018). In contrast, the pupil task presents a more rapid sequence of stimuli without ITIs. These timing differences are reflected in the mean perceptual switch points: the 8th image in the fMRI task and the 9th image in the pupil task. This small yet consistent difference suggests subtle influences of task design on behavior.

      Despite these structural and instructional differences, our analyses indicate that overall behavioral patterns remain consistent across the two modalities. The distributions of switching times align closely, and no significant behavioral deviations were observed that might suggest a fundamental difference in the underlying mechanisms driving perceptual switches. These findings suggest that the additional time and structural differences in the fMRI task do not significantly alter the behavioral outcomes compared to the pupil task.

      To address these issues, we have added paragraphs in the Results, Methods, and Limitations sections of the manuscript. In the Results section, we provide a detailed comparison of switching point distributions across the two tasks, emphasizing behavioral consistencies and any observed variations. In the Methods section, we include an expanded description of task timing, instructions, and the presence or absence of catch trials to ensure clarity regarding the experimental setups. Finally, in the Limitations section, we acknowledge the structural differences between the tasks, particularly the lack of catch trials and rapid stimulus presentation in the pupil task, and discuss how these differences may influence perceptual dynamics.

      These additions aim to clarify how task-specific factors, such as timing, instructions, and catch trials, influence perceptual dynamics while highlighting the consistency in behavioral outcomes across both experimental setups. We believe these revisions address the concerns raised and enhance the manuscript’s transparency and rigor.

      Computational model:

      (1) Modeling noradrenaline effects in the RNN: The pupil data suggests phasic bursts of NA would promote perceptual switches. But as I understand, in the RNN neuromodulation is modeled as different levels of gain throughout the trial. Making the neural gain time-dependent would allow investigation of whether a phasic gain change can explain the experimentally observed distribution of switching times.

      We thank the reviewer for this very helpful suggestion. We updated the RNN so that, post-training, gain changes dynamically as a function of the network's classification uncertainty (i.e. the entropy of the network's output). Specifically, the gain dynamics of each unit in the neural network are governed by a linear ODE with a forcing function given by the entropy of the network’s classification (i.e. the uncertainty of the classification). This explicitly tests the hypothesis that uncertainty driven increases in gain near the perceptual switch (when the input is maximally ambiguous) speeds perceptual switches, and allows us to distinguish between tonic and phasic increases in gain (in the absence of uncertainty forcing gain decays exponentially to a tonic value of 1). Importantly, in line with our hypothesis, we found that switch times decreased as we increased the impact of uncertainty on gain (i.e. switch times decreased as the magnitude of uncertainty forcing increased). Finally, we wish to note that although making gain dynamical is relatively simple conceptually, actually implementing it and then analysing the dynamics turned out to be highly non-trivial. To our knowledge our model is the first RNN of reasonable size to implement dynamical gain requiring us to push the RNN modelling beyond the current state of the art (see Fig 2 - 4).

      (2) Modeling perceptual switches: in the results, it is described that the networks were trained to output a categorical response, but the firing rates in Fig 2B do not seem categorical but rather seem to follow the input stimulus. The output signals of the network are not shown. If I understand correctly, a trivial network that would just represent the two input signals without any internal computation and relay them to the output would do the task correctly (because "the network's choice at each time point was the maximum of the two-dimensional output", p. 22). This seems like cheating: the very operation that the model should perform is to signal the change, in a categorical manner, not to represent the gradually changing input signals.

      The output of the network was indeed trained to be categorical via a cross entropy loss function with the output defined by the max of the projection of the excitatory hidden units onto the output weights which is boilerplate RNN modelling practice. As requested we now show the output in Fig 2B. On the broader question of whether a trivially small network could solve the task we are in total agreement that with the right set of hand-crafted weights a two neuron sigmoidal network with winner-take-all readout could solve the task. We disagree, however, that using an RNN is cheating in any way. Many tasks in neuroscience can be trivially solved with a very small number of recurrent units (e.g. basically all 2AF tasks). The question we were interested in is how the brain might solve the task, and more specifically how neuromodulator control of gain changes the dynamics of our admittedly very simple task. We could have done this by hand crafting a small network to solve the task but we wanted to use the RNN modelling as a means of both hypothesis testing and hypothesis generation. We now expand on and justify this modelling choice in the second paragraph of the discussion:

      “We chose to use an RNN, instead of a simpler (more transparent) model as we wanted to use the RNN as a means of both hypothesis generation and hypothesis testing. Specifically, unlike more standard neuronal models which are handcrafted to reproduce a specific effect, when building an RNN the modeller only specifies the network inputs, labels, and the parameter constraints (e.g. Dale’s law) in advance. The dynamics of the RNN are entirely determined by optimisation. Post-training manipulations of the RNN are not built in, or in any way guaranteed to work, making them more analogous to experimental manipulations of an approximately task-optimal brain-like system. Confirmatory results are arguably, therefore, a first steps towards an in vitro experimental test.”

      (3) The mechanism of how increased gain leads to faster switches remains unclear to me. My first intuition was that increasing the gain of excitatory populations (the situation shown in Fig. 2E) in discrete attractor models would lead to deeper attractor wells and this would make it more difficult to switch. That is, a higher gain should lead to slower decisions in this case. However, here the switching time remains constant for a gain between 1 and 1.5. Lowering the gain, on the other hand, leads to slower switching. It is, of course, possible that the RNN behaves differently than classical point attractor models or that my intuition is incorrect (though I believe it is consistent with previous literature, e.g. Niyogi & Wong-Lin 2013 (doi:10.1371/journal.pcbi.1003099) who show higher firing rates - more stable attractors - for increased excitatory gain).

      We thank the reviewer for the astute observation, which we entirely agree with. The energy landscape analysis is a method still under active development within our group and we are still learning how to best explain it and its relationship to more traditional ways of quantifying potential-like energy functions of dynamical systems which we think the reviewer has in mind. We have now included a second type of energy landscape analysis which gives a complementary perspective on the RNN dynamics and is more straightforwardly comparable to typical potential functions. We describe the new analysis in the section “Large-scale neural predictions of recurrent neural network model” as follows:

      “Crucially, there are two complementary viewpoints from which we can construct an energy landscape; the first allocentric (i.e., third-person view) perspective quantifies the energy associated with each position in state space, whereas the second egocentric (i.e., first person view) perspective quantifies the energy associated relative changes independent of the direction of movement or the location in state space. The allocentric perspective is straightforwardly comparable to the potential function of a dynamical system but can only be applied to low dimensional data in settings where a position-like quantity is meaningfully defined. The egocentric perspective is analogous to taking the point of view of a single particle in a physical setting and quantifying the energy associated with movement relative to the particles initial location. An egocentric framework is thus more applicable, when signal magnitude is relative rather than absolute. See materials and methods, and (see Fig S4 for an intuitive explanation of the allocentric and egocentric energy landscape analysis on a toy dynamical system).”

      From the allocentric perspective it is entirely true that increasing gain increases the depth of the landscape, equivalent to increasing the depth of the attractor. However, because the input to the network changes dynamically the location of the approximate fixed-point attractor changes and the network state “chases” this attractor over the course of the trial. Importantly, the location of the energy minima changes more rapidly as gain increases, effectively forcing the network to rapidly change course at the point of the perceptual switch (see Fig 4). To quantify this effect we constructed a new measure - neural work - which describes the amount of “force” exerted on the low-dimensional neural trajectory by the vector field quantified by the allocentric landscape. Specifically we treat the allocentric landscape as analogous to a potential function and then leverage the fact that force is equal to the negative gradient of potential energy to calculate the work (force x displacement) done on the low dimensional trajectory at each time point. This showed that as gain increases the amount of work done on the neuronal trajectory at turning points increases analogous to the application of an external force transiently increasing the kinetic energy of an object. From the perspective of the egocentric landscape this results in a flattening of the landscape as there is a lower energy (i.e. higher probability) assigned to large deviations in the neuronal trajectory around the perceptual switch.

      Because of the novelty of the analyses we went to great lengths to carefully explain the methods in the updated manuscript. In addition we wrote a short tutorial style MATLAB script implementing both the allocentric and egocentric landscape analysis on a toy dynamical system with a known potential function (a supercritical pitchfork bifurcation).

      (4) From the RNN model it is not clear how changes in excitatory and inhibitory gain lead to slower/faster switching. In order to better understand the role of inhibitory and excitatory gain on switching, I would suggest studying a simple discrete attractor model (a rate model, for example as in Wong and Wang 2006 or Roxin and Ledberg, Plos Comp. Bio 2008) which will allow to study these effects in terms of a very few model parameters. The Roxin paper also shows how to map rate models onto simplified one-dimensional systems such as the one in Fig S3. Setting up the model using this framework would allow for making much stronger, principled statements about how gain changes affect the energy landscape, and under which conditions increased inhibitory gain leads to faster switching.

      One possibility is that increasing the excitatory gain in the RNN leads to saturated firing rates. If this is the reason for the different effects of excitatory and inhibitory gain changes, it should be properly explained. Moreover, the biological relevance of this effect should be discussed (assuming that saturation is indeed the explanation).

      We thank the reviewer for this excellent suggestion. After some consideration we decided that studying a reduced model would likely not do justice to the dynamical mechanisms of RNN especially after making gain dynamical rather than stationary. Still we very much share the reviewer’s concern that we need a stronger link between the (now dynamical) gain alterations and energy landscape dynamics. To this end we now describe and interrogate the dynamics of the RNN at a circuit level through selectivity and lesion based analyses, at a population level through analysis of the dynamical regime traversed by the network, and finally, through an extended energy landscape framework which has far stronger links to traditional potential based descriptions of low-dimensional dynamical systems (also see to comment 3. above).

      At a circuit level the speeding of perceptual switches is mediated by inhibition of the initially dominant population we describe in paragraphs 7 and 8 of the section “Computational evidence for neuromodulatory-mediated perceptual switches in a recurrent neural network” as follows:

      “Having confirmed our hypothesis that increasing gain as a function of the network uncertainty increased the speed of perceptual switches, we next sought to understand the mechanisms governing this effect starting with the circuit level and working our way up to the population level (c.f. Sheringtonian and Hopfieldian modes of analysis(66)). Because of the constraint that the input and output weights are strictly positive, we could use their (normalised) value as a measure of stimulus selectivity. Inspection of the firing rates sorted by input weights revealed that the networks had learned to complete the task by segregating both excitatory and inhibitory units into two stimulus-selective clusters (Fig 2C). As the inhibitory units could not contribute to the networks read out, we hypothesised that they likely played an indirect role in perceptual switching by inhibiting the population of excitatory neurons selective for the currently dominant stimulus allowing the competing population to take over and a perceptual switch to occur.

      To test this hypothesis, we sorted the inhibitory units by the selectivity of the excitatory units they inhibit (i.e. by the normalised value of the readout weights). Inspecting the histogram of this selectivity metric revealed a bimodal distribution with peaks at each extreme strongly inhibiting a stimulus selective excitatory population at the exclusion of the other (Fig S2). Based on the fact that leading up to the perceptual switch point both the input and firing rate of the dominant population are higher than the competing population, we hypothesized that gain likely speeds perceptual switches by actively inhibiting the currently dominant population rather than exciting/disinhibiting the competing population. We predicted, therefore, that lesioning the inhibitory units selective for the stimulus that is initially dominant would dramatically slow perceptual switches, whilst lesioning the inhibitory units selective for the stimulus the input is morphing into would have a comparatively minor slowing effect on switch times since the population is not receiving sufficient input to take over until approximately half way through the trial irrespective of the inhibition it receives. As selectivity is not entirely one-to-one, we expect both lesions to slow perceptual switches but differ in magnitude. In line with our prediction, lesioning the inhibitory units strongly selective for the initially dominant population greatly slowed perceptual switches (Fig 3F upper), whereas lesioning the population selective for the stimulus the input morphs into removed the speeding effect of gain but had a comparatively small slowing effect on perceptual switches (Fig 3F lower).”

      At the population level we characterised the dynamics of the 2D parameter space (defined by gain and the difference between the input dimensions) traversed by the network over the course of a trial as input and gain dynamically change. We describe this paragraphs 9-14 of the section “Computational evidence for neuromodulatory-mediated perceptual switches in a recurrent neural network” which we reprint below for the reviewers convenience :

      “Based on the selectivity of the network firing rates we hypothesised that the dynamics were shaped by a fixed-point attractor whose location and existence were determined by gain and  and thus changed dynamically over the course of a single trial(67-70). Because of the large size of the network, we could not solve for the fixed points or study their stability analytically. Instead we opted for a numerical approach and characterised the dynamical regime (i.e. the location and existence of approximate fixed-point attractors) across all combinations of gain and  visited by the network. Specifically, for each combination of elements in the parameter space  we ran 100 simulations with initial conditions (firing rates) drawn from a uniform distribution between [0,1], and let the dynamics run for 10 seconds of simulation time (10 times the length of the task - longer simulation times did not qualitatively change the results) without noise. As we were interested in the existence of fixed-point attractors rather than their precise location, at each time point we computed the difference in firing rate between successive time points across the network. For each simulation we computed both the proportion of trials that converged to a value below  10^-2 giving us proxy for the presence of fixed points, and the time to convergence, giving us a measure of the “strength” of the attractor.

      Across gain values when input had unambiguous values, the network rapidly converged across all initialisations (Fig 3A & 3C-H). When input became ambiguous, however, the dynamics acquired a decaying oscillation and did not converge within the time frame of the simulation. As gain increased, the range of  values characterised by oscillatory dynamics broadened. Crucially, for sufficiently high values of gain, ambiguous  values transitioned the network into a regime characterised by high amplitude inhibition-driven oscillations (Fig 3D & 3G). Each trial can, therefore, be characterised by a trajectory through this 2-dimensional parameter space, with dynamics shaped by the dynamical regimes of each location visited (Fig 3A-B).

      When uncertainty has a small impact on gain the network has a trajectory through an initial regime characterised by the rapid convergence to a fixed point where the population representing the initial stimulus dominated whilst the other was silent (Fig 3C), an uncertain regime characterised by oscillations with all neurons partially activated (Fig 3D), and after passing through the oscillatory regime, the network once again enters a new fixed-point regime where the population representing the initial stimulus is now silent and the other is dominant (Fig 3E).

      For high gain trails, the network again started and finished in states characterised by a rapid convergence to a fixed point representing the dominant input dimension (Fig 3F-H), but differed in how it transitioned between these states. Uncertain inputs now generated high amplitude oscillations with the network flip-flopping between active and silent states (Fig 3G). We hypothesised that, within the task, this has the effect of silencing the initially dominant population, and boosting the competing population. To test this we initialised each network with parameter values well inside the oscillatory regime (u = [ .5, .5]  , gain = 1.5) with initial conditions determined by the selectivity of each unit. Excitatory units selective for input dimension 1, as well as the associated inhibitory units projecting to this population, were fully activated, whilst the excitatory units selective for  input dimension 2 and the associated inhibitory units were silenced. As we predicted, when initialised in this state the network dynamics displayed an out of phase oscillation where the initially dominant population was rapidly silenced and the competing population was boosted after a brief delay (219 (ms), +/-114 Fig S3).”

      From this we concluded that at a population level, heightened gain leading up to the perceptual switch speeds the switch by transiently pushing the dynamics into an unstable dynamical regime replacing the fixed-point attractor representing the input with an oscillatory regime that actively inhibits the currently dominant population and boosts the competing population before transitioning back into a regime with a stable (approximate) fixed-point attractor representing the new stimulus (Fig 3F-H & Fig S3).

      As we describe in the our response to comment 3 above our extended energy-landscape analysis framework now includes an explicit link between the potential of the dynamical system and allocentric landscape, whilst also explaining how a transient deepening of the allocentric landscape (which can be essentially thought of analogous to a traditional potential function) relates to the flattening of the egocentric landscape.

      Finally, whilst we appreciate the interest in further characterising the effect of inhibitory gain compared with excitatory gain the topic is is largely orthogonal the aims of our paper so we have removed the discussion of inhibitory vs excitatory gain. Still, we understand that we need to do our due diligence and check that our results do not break down when we manipulate either inhibitory or excitatory gain in isolation. To this end we checked that dynamical gain still speeded perceptual switches when the effect was isolated to inhibitory or excitatory cells in isolation. We show the behavioural plots below for the reviewer’s interest.

      Author response image 1.

      Switch time as a function of uncertainty forcing

      Alternative mechanisms:

      It is mentioned in the introduction that changes in attention could drive perceptual switches. A priori, attention signals originating in the frontal cortex may be plausible mechanisms for perceptual switches, as an alternative to LC-controlled gain modulation. Does the observed fMRI dynamics allow us to distinguish these two hypotheses? In any case, I would suggest including alternative scenarios that may be compatible with the observed findings in the discussion.

      We agree with the reviewer, in that attention is itself a confound and a process that is challenging to disentangle from the perceptual switching process in the current task. Importantly, we were not arguing for exclusivity in our manuscript, but merely testing the veracity of the hypothesis that the ascending arousal system may play a causal role in mediating and/or speeding perceptual switches. Future work with experiments that more specifically aim to dissociate these different features will be required to tease apart these different possibilities.

      Reviewer #2 (Public Review):

      Strengths

      - the study combines different methods (pupillometry, RNNs, fMRI).

      - the study combines different viewpoints and fields of the scientific literature, including neuroscience, psychology, physics, dynamical systems.

      - This combination of methods and viewpoints is rarely done, it is thus very useful.

      - Overall well-written.

      Weaknesses

      - The study relies on a report paradigm: participants report when they identify a switch in the item category. The sequence corresponds to the drawing of an object being gradually morphed into another object. Perceptual switches are therefore behaviorally relevant, and it is not clear whether the effect reported correspond to the perceptual switch per se, or the detection of an event that should change behavior (participant press a button indicating the perceived category, and thus switch buttons when they identify a perceptual change). The text mentions that motor actions are controlled for, but this fact only indicates that a motor action is performed on each trial (not only on the switch trial); there is still a motor change confounded with the switch. As a result, it is not clear whether the effect reported in pupil size, brain dynamics, and brain states is related to a perceptual change, or a decision process (to report this change).

      We agree with the reviewer that the coupling of the motor change with the perceptual switch is confounded to some degree, but since motor preparation occurs on every trial we suspect that it is more accurate to describe it as confounded with task-relevance more than motor preparation per se.  While it is possible that pupil diameter, network topology and energy landscape features are all related to motor change rather than the perceptual switch, we note that the weight of evidence is against this interpretation, given the simple mechanistic explanation created by the coupling of perceptual uncertainty to network gain.

      - The study presents events that co-occur (perceptual switch, change in pupil size, energy landscape of brain dynamics) but we cannot identify the causes and consequences. Yet, the paper makes several claims about causality (e.g. in the abstract "neuromodulatory tone ... causally mediates perceptual switches", in the results "the system flattening the energy landscape ... facilitated an updating of the content of perception").

      We have made an effort to soften the causal language, where appropriate. In addition, we note that we have changed the title to “Gain neuromodulation mediates task-relevant perceptual switches: evidence from pupillometry, fMRI, and RNN Modelling” to reflect the fact that our claims do not extent to cases of perceptual switches where the stimulus is only passively observed.

      - Some effects may reflect the expectation of a perceptual switch, rather than the perceptual switch per se. Given the structure of the task, participants know that there will be a perceptual switch occurring once during a sequence of morphed drawings. This change is expected to occur roughly in the middle of the sequence, making early switches more surprising, and later switches less surprising. Differences in pupil response to early, medium, and late switches could reflect this expectation. The authors interpret this effect very differently ("the speed of a perceptual switch should be dependent on LC activity").

      The task includes catch trials designed to reduce the expectation of a perceptual switch. In these trials, a perceptual switch occurs either earlier or later than usual. While these trials are valuable for mitigating predictability, we did not focus extensively on them, as they were thoroughly discussed in the original paper. Additionally, due to the limited number of catch trials, it is difficult—if not impossible—to calculate a reliable mean surprise per image set.

      It is also worth noting that the pupil study does not include catch trials, which could contribute to differences in how perceptual switches are processed and interpreted between the fMRI and pupil experiments.

      - The RNN is far more complex than needed for the task. It has two input units that indicate the level of evidence for the two categories being morphed, and it is trained to output the dominant category. A (non-recurrent) network with only these two units and an output unit whose activity is a sigmoid transform of the difference in the inputs can solve the task perfectly. The RNN activity is almost 1-dimensional probably for this reason. In addition, the difficult part of the computation done by the human brain in this task is already solved in the input that is provided to the network (the brain is not provided with the evidence level for each category, and in fact, it does not know in advance what the second category will be).

      We agree that a simpler model could perform the task. We opted to use an RNN rather than hand craft a simpler model as we wanted to use the model as both a method of hypothesis testing and hypothesis generation. We now expand on and justify this modelling choice in the second paragraph of the discussion (also see our response to Reviewer 1 comment 4):

      “We chose to use an RNN, instead of a simpler (more transparent) model as we wanted to use the RNN as a means of both hypothesis generation and hypothesis testing. Specifically, unlike more standard neuronal models which are handcrafted to reproduce a specific effect, when building an RNN the modeller only specifies the network inputs, labels, and the parameter constraints (e.g. Dale’s law) in advance. The dynamics of the RNN are entirely determined by optimisation. Post-training manipulations of the RNN are not built in, or in any way guaranteed to work, making them more analogous to experimental manipulations of an approximately task-optimal brain-like system. Confirmatory results are arguably, therefore, a first steps towards an in vitro experimental test.”

      In other words, a simpler model would not have been appropriate to the aims. In addition we note that low dimensional dynamics are extremely common in the RNN literature and are in no way unique to our model. 

      - Basic fMRI results are missing and would be useful, before using elaborate analyses. For instance, what are the regions that are more active when a switch is detected?

      We explicitly chose to not run a standard voxelwise statistical parametric approach on these data, as the results were reported extensively in the original study (Stottinger et al., 2018).

      - The use of methods from physics may obscure some simple facts and simpler explanations. For instance, does the flatter energy landscape in the higher gain condition reflect a smaller number of states visited in the state space of the RNN because the activity of each unit gets in the saturation range? If correct, then it may be a more straightforward way of explaining the results.

      We appreciate the reviewer's concern as this would indeed be a problem. However, this is not the case for our network. At the time point of the perceptual switch where the egocentric landscape dynamics are at their flattest the RNN firing rates are approximately 50% activated nowhere near the saturation point. In addition, a flatter landscape in the egocentric and allocentric landscape analyses only occurs - mathematically speaking - when there are more states visited not less.

      In addition, we note that we are very sympathetic to the complexity of our physics based analyses and have gone to great lengths to describe them in an accessible manner in both the main text and methods. We have also included tutorial style code demonstrating how the analysis can be used on a toy dynamical system in the supplementary material.

      - Some results are not as expected as the authors claim, at least in the current form of the paper. For instance, they show that, when trained to identify which of two inputs u1 and u2 is the largest (with u2=1-u1, starting with u1=1 and gradually decreasing u1), a higher gain results in the RNN reporting a switch in dominance before the true switch (e.g. when u1=0.6 and u2=0.4), and vice et versa with a lower gain. In other words, it seems to correspond to a change in criterion or bias in the RNN's decision. The authors should discuss more specifically how this result is related to previous studies and models on gain modulation. An alternative finding could have been that the network output is a more (or less) deterministic function of its inputs, but this aspect is not reported.

      We appreciate this comment but it is simply not applicable to our network. There is no criterion in the RNN. We could certainly add one but this would be a significant departure from how decisions are typically modelled in RNNs. The (deterministic) readout is the max of the projection of the (instantaneous) excitatory firing rate onto the readout weights. A shift in criterion would imply that the dynamics are unaffected and the effect can be explained by a shift in the readout weights; this cannot be the case because the readout weights are stationary the change occurs at the level of the activation function.

      We are aware that there is a large literature in decision making and psychophysics that uses the term gain in a slightly different way. Here we are strictly referring to the gain of the activation function. Although we agree that it would be interesting and important to discuss the differing uses of the term gain, this is beyond the scope of the present paper.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We would like to thank the reviewers for their thoughtful comments and constructive suggestions. Point-by-point responses to comments are given below:

      Reviewer #1 (Recommendations For The Authors):

      This manuscript provides an important case study for in-depth research on the adaptability of vertebrates in deep-sea environments. Through analysis of the genomic data of the hadal snailfish, the authors found that this species may have entered and fully adapted to extreme environments only in the last few million years. Additionally, the study revealed the adaptive features of hadal snailfish in terms of perceptions, circadian rhythms and metabolisms, and the role of ferritin in high-hydrostatic pressure adaptation. Besides, the reads mapping method used to identify events such as gene loss and duplication avoids false positives caused by genome assembly and annotation. This ensures the reliability of the results presented in this manuscript. Overall, these findings provide important clues for a better understanding of deep-sea ecosystems and vertebrate evolution.

      Reply: Thank you very much for your positive comments and encouragement.

      However, there are some issues that need to be further addressed.

      1. L119: Please indicate the source of any data used.

      Reply: Thank you very much for the suggestion. All data sources used are indicated in Supplementary file 1.

      1. L138: The demographic history of hadal snailfish suggests a significant expansion in population size over the last 60,000 years, but the results only show some species, do the results for all individuals support this conclusion?

      Reply: Thank you for this suggestion. The estimated demographic history of the hadal snailfish reveals a significant population increase over the past 60,000 years for all individuals. The corresponding results have been incorporated into Figure 1-figure supplements 8B.

      Author response image 1.

      (B) Demographic history for 5 hadal snailfish individuals and 2 Tanaka’s snailfish individuals inferred by PSMC. The generation time of one year for Tanaka snailfish and three years for hadal snailfish.

      1. Figure 1-figure supplements 8: Is there a clear source of evidence for the generation time of 1 year chosen for the PSMC analysis?

      Reply: We apologize for the inclusion of an incorrect generation time in Figure 1-figure supplements 8. It is important to note that different generation times do not change the shape of the PSMC curve, they only shift the curve along the axis. Due to the absence of definitive evidence regarding the generation time of the hadal snailfish, we have referred to Wang et al., 2019, assuming a generation time of one year for Tanaka snailfish and three years for hadal snailfish. The generation time has been incorporated into the main text (lines 516-517): “The generation time of one year for Tanaka snailfish and three years for hadal snailfish.”.

      1. L237: Transcriptomic data suggest that the greatest changes in the brain of hadal snailfish compared to Tanaka's snailfish, what functions these changes are specifically associated with, and how these functions relate to deep-sea adaptation.

      Reply: Thank you for this suggestion. Through comparative transcriptome analysis, we identified 3,587 up-regulated genes and 3,433 down-regulated genes in the brains of hadal snailfish compared to Tanaka's snailfish. Subsequently, we conducted Gene Ontology (GO) functional enrichment analysis on the differentially expressed genes, revealing that the up-regulated genes were primarily associated with cilium, DNA repair, protein binding, ATP binding, and microtubule-based movement. Conversely, the down-regulated genes were associated with membranes, GTP-binding, proton transmembrane transport, and synaptic vesicles, as shown in following table (Supplementary file 15). Previous studies have shown that high hydrostatic pressure induces DNA strand breaks and damage, and that DNA repair-related genes upregulated in the brain may help hadal snailfish overcome these challenges.

      Author response table 1.

      GO enrichment of expression up-regulated and down-regulated genes in hadal snailfish brain.

      We have added new results (Supplementary file 15) and descriptions to show the changes in the brains of hadal snailfish (lines 250-255): “Specifically, there are 3,587 up-regulated genes and 3,433 down-regulated genes in the brain of hadal snailfish compared to Tanaka snailfish, and Gene Ontology (GO) functional enrichment analyses revealed that up-regulated genes in the hadal snailfish are associated with cilium, DNA repair, and microtubule-based movement, while down-regulated genes are enriched in membranes, GTP-binding, proton transmembrane transport, and synaptic vesicles (Supplementary file 15).”

      1. L276: What is the relationship between low bone mineralization and deep-sea adaptation, and can low mineralization help deep-sea fish better adapt to the deep sea?

      Reply: Thank you for this suggestion. The hadal snailfish exhibits lower bone mineralization compared to Tanaka's snailfish, which may have facilitated its adaptation to the deep sea. On one hand, this reduced bone mineralization could have contributed to the hadal snailfish's ability to maintain neutral buoyancy without excessive energy expenditure. On the other hand, the lower bone mineralization may have also rendered their skeleton more flexible and malleable, enhancing their resilience to high hydrostatic pressure. Accordingly, we added the following new descriptions (lines 295-300): “Nonetheless, micro-CT scans have revealed shorter bones and reduced bone density in hadal snailfish, from which it has been inferred that this species has reduced bone mineralization (M. E. Gerringer et al., 2021); this may be a result of lowering density by reducing bone mineralization, allowing to maintain neutral buoyancy without expending too much energy, or it may be a result of making its skeleton more flexible and malleable, which is able to better withstand the effects of HHP.”

      1. L293: The abbreviation HHP was mentioned earlier in the article and does not need to be abbreviated here.

      Reply: Thank you for the correction. We have corrected the word. Line 315.

      1. L345: It should be "In addition, the phylogenetic relationships between different individuals clearly indicate that they have successfully spread to different trenches about 1.0 Mya".

      Reply: Thank you for the correction. We have corrected the word. Line 374.

      1. It is curious what functions are associated with the up-regulated and down-regulated genes in all tissues of hadal snailfish compared to Tanaka's snailfish, and what functions have hadal snailfish lost in order to adapt to the deep sea?

      Reply: Thank you for this suggestion. We added a description of this finding in the results section (lines 337-343): “Next, we identified 34 genes that are significantly more highly expressed in all organs of hadal snailfish in comparison to Tanaka’s snailfish and zebrafish, while only seven genes were found to be significantly more highly expressed in Tanaka’s snailfish using the same criterion (Figure 5-figure supplements 1). The 34 genes are enriched in only one GO category, GO:0000077: DNA damage checkpoint (Adjusted P-value: 0.0177). Moreover, five of the 34 genes are associated with DNA repair.” This suggests that up-regulated genes in all tissues in hadal snailfish are associated with DNA repair in response to DNA damage caused by high hydrostatic pressure, whereas down-regulated genes do not show enrichment for a particular function.

      Overall, the functions lost in hadal snailfish adapted to the deep sea are mainly related to the effects of the dark environment, which can be summarized as follows (lines 375-383): “The comparative genomic analysis revealed that the complete absence of light had a profound effect on the hadal snailfish. In addition to the substantial loss of visual genes and loss of pigmentation, many rhythm-related genes were also absent, although some rhythm genes were still present. The gene loss may not only come from relaxation of natural selection, but also for better adaptation. For example, the grpr gene copies are absent or down-regulated in hadal snailfish, which could in turn increased their activity in the dark, allowing them to survive better in the dark environment (Wada et al., 1997). The loss of gpr27 may also increase the ability of lipid metabolism, which is essential for coping with short-term food deficiencies (Nath et al., 2020).”

      Reviewer #2 (Recommendations For The Authors):

      I have pointed out some of the examples that struck me as worthy of additional thought/writing/comments from the authors. Any changes/comments are relatively minor.

      Reply: Thank you very much for your positive comments on this work.

      For comparative transcriptome analyses, reads were mapped back to reference genomes and TPM values were obtained for gene-level count analyses. 1:1 orthologs were used for differential expression analyses. This is indeed the only way to normalize counts across species, by comparing the same gene set in each species. Differential expression statistics were run in DEseq2. This is a robust way to compare gene expression across species and where fold-change values are reported (e.g. Fig 3, creatively by coloring the gene name) the values are best-practice.

      In other places, TPM values are reported (e.g. Fig 2D, Fig 4C, Fig 5A, Fig 4-Fig supp 4) to illustrate expression differences within a tissue across species. The comparisons look robust, although it is not made clear how the values were obtained in all cases. For example, in Fig 2D the TPM values appear to be from eyes of individual fish, but in Fig 4C and 5A they must be some kind of average? I think that information should be added to the figure legends.

      Of note: TPM values are sensitive to the shape of the RNA abundance distribution from a given sample: A small number of very highly expressed genes might bias TPM values downward for other genes. From one individual to another or from one species to another, it is not obvious to me that we should expect the same TPM distribution from the same tissues, making it a challenging metric for comparison across samples, and especially across species. An alternative measure of RNA abundance is normalized counts that can be output from DEseq2. See:

      Zhao, Y., Li, M.C., Konaté, M.M., Chen, L., Das, B., Karlovich, C., Williams, P.M., Evrard, Y.A., Doroshow, J.H. and McShane, L.M., 2021. TPM, FPKM, or normalized counts? A comparative study of quantification measures for the analysis of RNA-seq data from the NCI patient-derived models repository. Journal of translational medicine, 19(1), pp.1-15.

      If the authors would like to keep the TPM values, I think it would be useful for them to visualize the TPM value distribution that the numbers were derived from. One way to do this would be to make a violin plot for species/tissue and plot the TPM values of interest on that. That would give a visualization of the ranked value of the gene within the context of all other TPM values. A more highly expressed gene would presumably have a higher rank in context of the specific tissue/species and be more towards the upper tail of the distribution. An example violin plot can be found in Fig 6 of:

      Burns, J.A., Gruber, D.F., Gaffney, J.P., Sparks, J.S. and Brugler, M.R., 2022. Transcriptomics of a Greenlandic Snailfish Reveals Exceptionally High Expression of Antifreeze Protein Transcripts. Evolutionary Bioinformatics, 18, p.11769343221118347.

      Alternatively, a comparison of TPM and normalized count data (heatmaps?) would be of use for at least some of the reported TPM values to show whether the different normalization methods give comparable outputs in terms of differential expression. One reason for these questions is that DEseq2 uses normalized counts for statistical analyses, but values are expressed as TPM in the noted figures (yes, TPM accounts for transcript length, but can still be subject to distribution biases).

      Reply: Thank you for your suggestions. Following your suggestions, we modified Fig 2D, Fig 4C, Fig 4-Fig supp 4, and Fig 5-Fig supp 1, respectively. In the differential expression analyses, only one-to-one orthologues of hadal snailfish and Tanaka's snailfish can get the normalized counts output by DEseq2, so we showed the normalized counts by DEseq2 output for Fig 2D, Fig 4C, Fig 4-Fig supp 4, Fig 5-Fig supp 1, and for Fig 5A, since the copy number of fthl27 genes undergoes specific expansion in hadal snailfish, we visualized the ranking of all fthl27 genes across tissues by plotting violins in Fig 5-Fig supp 2.

      Author response image 2.

      (D) Log10-transformation normalized counts for DESeq2 (COUNTDESEQ2) of vision-related genes in the eyes of hadal snailfish and Tanka's snailfish. * represents genes significantly downregulated in hadal snailfish (corrected P < 0.05).

      Author response image 3.

      (C) The deletion of one copy of grpr and another copy of down-regulated expression in hadal snailfish. The relative positions of genes on chromosomes are indicated by arrows, with arrows to the right representing the forward strand and arrows to the left representing the reverse strand. The heatmap presented is the average of the normalized counts for DESeq2 (COUNTDESEQ2) in all replicate samples from each tissue. * represents tissue in which the grpr-1 was significantly down-regulated in hadal snailfish (corrected P < 0.05).

      Author response image 4.

      Expression of the vitamin D related genes in various tissues of hadal snailfish and Tanaka's snailfish. The heatmap presented is the average of the normalized counts for DESeq2 (COUNTDESEQ2) in all replicate samples from each tissue.

      Author response image 5.

      (B) Expression of the ROS-related genes in different tissues of hadal snailfish and Tanaka's snailfish. The heatmap presented is the average of the normalized counts for DESeq2 (COUNTDESEQ2) in all replicate samples from each tissue.

      Author response image 6.

      Ranking of the expression of individual copies of fthl27 gene in hadal snailfish and Tanaka's snailfish in various tissues showed that all copies of fthl27 in hadal snailfish have high expression. The gene expression presented is the average of TPM in all replicate samples from each tissue.

      Line 96: Which BUSCOs? In the methods it is noted that the actinopterygii_odb10 BUSCO set was used. I think it should also be noted here so that it is clear which BUSCO set was used for completeness analysis. It could even be informally the ray-finned fish BUSCOs or Actinopterygii BUSCOs.

      Reply: Thank you for this suggestion. We used Actinopterygii_odb10 database and we added the BUSCO set to the main text as follows (lines 92-95): “The new assembly filled 1.26 Mb of gaps that were present in our previous assembly and have a much higher level of genome continuity and completeness (with complete BUSCOs of 96.0 % [Actinopterygii_odb10 database]) than the two previous assemblies.”

      Lines 102-105: The medaka genome paper proposes the notion that the ancestral chromosome number between medaka, tetraodon, and zebrafish is 24. There may be other evidence of that too. Some of that evidence should be cited here to support the notion that sticklebacks had chromosome fusions to get to 21 chromosomes rather than scorpionfish having chromosome fissions to get to 24. Here's the medaka genome paper:

      Kasahara, M., Naruse, K., Sasaki, S., Nakatani, Y., Qu, W., Ahsan, B., Yamada, T., Nagayasu, Y., Doi, K., Kasai, Y. and Jindo, T., 2007. The medaka draft genome and insights into vertebrate genome evolution. Nature, 447(7145), pp.714-719.

      Reply: Thank you for your great suggestion. Accordingly, we modified the sentence and added the citation as follows (lines 100-105): “We noticed that there is no major chromosomal rearrangement between hadal snailfish and Tanaka’s snailfish, and chromosome numbers are consistent with the previously reported MTZ-ancestor (the last common ancestor of medaka, Tetraodon, and zebrafish) (Kasahara et al., 2007), while the stickleback had undergone several independent chromosomal fusion events (Figure 1-figure supplements 4).”

      Line 161-173: "Along with the expression data, we noticed that these genes exhibit a different level of relaxation of natural selection in hadal snailfish (Figure 2B; Figure 2-figure supplements 1)." With the above statment and evidence, the authors are presumably referring to gene losses and differences in expression levels. I think that since gene expression was not measured in a controlled way it may not be a good measure of selection throughout. The reported genes could be highly expressed under some other condition, selection intact. I find Fig2-Fig supp 1 difficult to interpret. I assume I am looking for regions where Tanaka’s snailfish reads map and Hadal snailfish reads do not, but it is not abundantly clear. Also, other measures of selection might be good to investigate: accumulation of mutations in the region could be evidence of relaxed selection, for example, where essential genes will accumulate fewer mutations than conditional genes or (presumably) genes that are not needed at all. The authors could complete a mutational/SNP analysis using their genome data on the discussed genes if they want to strengthen their case for relaxed selection. Here is a reference (from Arabidopsis) showing these kinds of effects:

      Monroe, J.G., Srikant, T., Carbonell-Bejerano, P., Becker, C., Lensink, M., Exposito-Alonso, M., Klein, M., Hildebrandt, J., Neumann, M., Kliebenstein, D. and Weng, M.L., 2022. Mutation bias reflects natural selection in Arabidopsis thaliana. Nature, 602(7895), pp.101-105.

      Reply: Thank you for pointing out this important issue. Following your suggestion, we have removed the mention of the down-regulation of some visual genes in the eyes of hadal snailfish and the results of the original Fig2-Fig supp 1 that were based on reads mapping to confirm whether the genes were lost or not. To investigate the potential relaxation of natural selection in the opn1sw2 gene in hadal snailfish, we conducted precise gene structure annotation. Our findings revealed that the opn1sw2 gene is pseudogenized in hadal snailfish, indicating a relaxation of natural selection. We have included this result in Figure 2-figure supplements 1.

      Author response image 7.

      Pseudogenization of opn1sw2 in hadal snailfish. The deletion changed the protein’s sequence, causing its premature termination.

      Accordingly, we have toned down the related conclusions in the main text as follows (lines 164-173): “We noticed that the lws gene (long wavelength) has been completely lost in both hadal snailfish and Tanaka’s snailfish; rh2 (central wavelength) has been specifically lost in hadal snailfish (Figure 2B and 2C); sws2 (short wavelength) has undergone pseudogenization in hadal snailfish (Figure 2-figure supplements 1); while rh1 and gnat1 (perception of very dim light) is both still present and expressed in the eyes of hadal snailfish (Figure 2D). A previous study has also proven the existence of rhodopsin protein in the eyes of hadal snailfish using proteome data (Yan, Lian, Lan, Qian, & He, 2021). The preservation and expression of genes for the perception of very dim light suggests that they are still subject to natural selection, at least in the recent past.”

      Line 161-170: What tissue were the transcripts derived from for looking at expression level of opsins? Eyes?

      Reply: Thank you for your suggestions. The transcripts used to observe the expression levels of optic proteins were obtained from the eye.

      Line 191: What does tmc1 do specifically?

      Reply: Thank you for this suggestion. The tmc1 gene encodes transmembrane channel-like protein 1, involved in the mechanotransduction process in sensory hair cells of the inner ear that facilitates the conversion of mechanical stimuli into electrical signals used for hearing and homeostasis. We added functional annotations for the tmc1 in the main text (lines 190-196): “Of these, the most significant upregulated gene is tmc1, which encodes transmembrane channel-like protein 1, involved in the mechanotransduction process in sensory hair cells of the inner ear that facilitates the conversion of mechanical stimuli into electrical signals used for hearing and homeostasis (Maeda et al., 2014), and some mutations in this gene have been found to be associated with hearing loss (Kitajiri, Makishima, Friedman, & Griffith, 2007; Riahi et al., 2014).”

      Line 208: "it is likely" is a bit proscriptive

      Reply: Thank you for this suggestion. We rephrased the sentence as follows (lines 213-215): “Expansion of cldnj was observed in all resequenced individuals of the hadal snailfish (Supplementary file 10), which provides an explanation for the hadal snailfish breaks the depth limitation on calcium carbonate deposition and becomes one of the few species of teleost in hadal zone.”

      Line 199: maybe give a little more info on exactly what cldnj does? e.g. "cldnj encodes a claudin protein that has a role in tight junctions through calcium independent cell-adhesion activity" or something like that.

      Reply: Thank you for this suggestion. We have added functional annotations for the cldnj to the main text (lines 200-204): “Moreover, the gene involved in lifelong otolith mineralization, cldnj, has three copies in hadal snailfish, but only one copy in other teleost species, encodes a claudin protein that has a role in tight junctions through calcium independent cell-adhesion activity (Figure 3B, Figure 3C) (Hardison, Lichten, Banerjee-Basu, Becker, & Burgess, 2005).”

      Lines 199-210: Paragraph on cldnj: there are extra cldnj genes in the hadal snailfish, but no apparent extra expression. Could the authors mention that in their analysis/discussion of the data?

      Reply: Thank you for your suggestions. Despite not observing significant changes in cldnj expression in the brain tissue of hadal snailfish compared to Tanaka's snailfish, it is important to consider that the brain may not be the primary site of cldnj expression. Previous studies in zebrafish have consistently shown expression of cldnj in the otocyst during the critical early growth phase of the otolith, with a lower level of expression observed in the zebrafish brain. However, due to the unavailability of otocyst samples from hadal snailfish in our current study, our findings do not provide confirmation of any additional expression changes resulting from cldnj amplification. Consequently, it is crucial to conduct future comprehensive investigations to explore the expression patterns of cldnj specifically in the otocyst of hadal snailfish. Accordingly, we added a discussion of this result in the main text (lines 209-214): “In our investigation, we found that the expression of cldnj was not significantly up-regulated in the brain of the hadal snailfish than in Tanaka’s snailfish, which may be related to the fact that cldnj is mainly expressed in the otocyst, while the expression in the brain is lower. However, due to the immense challenge in obtaining samples of hadal snailfish, the expression of cldnj in the otocyst deserves more in-depth study in the future.”

      Lines 225-231: I wonder whether low expression of a circadian gene might be a time of day effect rather than an evolutionary trait. Could the authors comment?

      Reply: Thank you for your suggestions. Previous studies have shown that the grpr gene is expressed relatively consistently in mouse suprachiasmatic nucleus (SCN) throughout the day (Figure 4-figure supplements 1) and we hypothesize that the low expression of grpr-1 gene expression in hadal snailfish is an evolutionary trait. We have modified this result in the main text (lines 232-242): “In addition, in the teleosts closely related to hadal snailfish, there are usually two copies of grpr encoding the gastrin-releasing peptide receptor; we noticed that in hadal snailfish one of them is absent and the other is barely expressed in brain (Figure 4C), whereas a previous study found that the grpr gene in the mouse suprachiasmatic nucleus (SCN) did not fluctuate significantly during a 24-hour light/dark cycle and had a relatively stable expression (Pembroke, Babbs, Davies, Ponting, & Oliver, 2015) (Figure 4-figure supplements 1). It has been reported that grpr deficient mice, while exhibiting normal circadian rhythms, show significantly increased locomotor activity in dark conditions (Wada et al., 1997; Zhao et al., 2023). We might therefore speculate that the absence of that gene might in some way benefit the activity of hadal snailfish under complete darkness.”

      Author response image 8.

      (B) Expression of the grpr in a 24-hour light/dark cycle in the mouse suprachiasmatic nucleus (SCN). Data source with http://www.wgpembroke.com/shiny/SCNseq.

      Line 253: What is gpr27? G protein coupled receptor?

      Reply: We apologize for the ambiguous description. Gpr27 is a G protein-coupled receptor, belonging to the family of cell surface receptors. We introduced gpr27 in the main text as follows (lines 270-273): “Gpr27 is a G protein-coupled receptor, belonging to the family of cell surface receptors, involved in various physiological processes and expressed in multiple tissues including the brain, heart, kidney, and immune system.”

      Line 253: Fig4 Fig supp 3 is a good example of pseudogenization!

      Reply: Thank you very much for your recognition.

      Line 279: What is bglap? It regulates bone mineralization, but what specifically does that gene do?

      Reply: We apologize for the ambiguous description. The bglap gene encodes a highly abundant bone protein secreted by osteoblasts that binds calcium and hydroxyapatite and regulates bone remodeling and energy metabolism. We introduced bglap in the main text as follows (lines 300-304): “The gene bglap, which encodes a highly abundant bone protein secreted by osteoblasts that binds calcium and hydroxyapatite and regulates bone remodeling and energy metabolism, had been found to be a pseudogene in hadal fish (K. Wang et al., 2019), which may contribute to this phenotype.”

      Line 299: Introduction of another gene without providing an exact function: acaa1.

      Reply: We apologize for the ambiguous description. The acaa1 gene encodes acetyl-CoA acetyltransferase 1, a key regulator of fatty acid β-oxidation in the peroxisome, which plays a controlling role in fatty acid elongation and degradation. We introduced acaa1 in the main text as follows (lines 319-324): “In regard to the effect of cell membrane fluidity, relevant genetic alterations had been identified in previous studies, i.e., the amplification of acaa1 (encoding acetyl-CoA acetyltransferase 1, a key regulator of fatty acid β-oxidation in the peroxisome, which plays a controlling role in fatty acid elongation and degradation) may increase the ability to synthesize unsaturated fatty acids (Fang et al., 2000; K. Wang et al., 2019).”

      Fig 5 legend: The DCFH-DA experiment is not an immunofluorescence assay. It is better described as a redox-sensitive fluorescent probe. Please take note throughout.

      Reply: Thank you for pointing out our mistakes. We corrected the word. Line 1048 and 1151 as follows: “ROS levels were confirmed by redox-sensitive fluorescent probe using DCFH-DA molecular probe in 293T cell culture medium with or without fthl27-overexpression plasmid added with H2O2 or FAC for 4 hours.”

      Line 326: Manuscript notes that ROS levels in transfected cells are "significantly lower" than the control group, but there is no quantification or statistical analysis of ROS levels. In the methods, I noticed the mention of flow cytometry, but do not see any data from that experiment. Proportion of cells with DCFH-DA fluorescence above a threshold would be a good statistic for the experiment... Another could be average fluorescence per cell. Figure 5B shows some images with green dots and it looks like more green in the "control" (which could better be labeled as "mock-transfection") than in the fthl27 overexpression, but this could certainly be quantified by flow cytometry. I recommend that data be added.

      Reply: Thank you for your suggestions. We apologize for the error in the main text, we used a fluorescence microscope to observe fluorescence in our experiments, not a flow cytometer. We have corrected it in the methods section as follows (lines 651-653): “ROS levels were measured using a DCFH-DA molecular probe, and fluorescence was observed through a fluorescence microscope with an optional FITC filter, with the background removed to observe changes in fluorescence.” Meanwhile, we processed the images with ImageJ to obtain the respective mean fluorescence intensities (MFI) and found that the MFI of the fthl27-overexpression cells were lower than the control group, which indicated that the ROS levels of the fthl27-overexpression cells were significantly lower than the control group. MFI has been added to Figure 5B.

      Author response image 9.

      ROS levels were confirmed by redox-sensitive fluorescent probe using DCFH-DA molecular probe in 293T cell culture medium with or without fthl27-overexpression plasmid added with H2O2 or FAC for 4 hours. Images are merged from bright field images with fluorescent images using ImageJ, while the mean fluorescence intensity (MFI) is also calculated using ImageJ. Green, cellular ROS. Scale bars equal 100 μm.

      Regarding the ROS experiment: Transfection of HEK293T cells should be reasonably straightforward, and the experiment was controlled appropriately with a mock transfection, but some additional parameters are still needed to help interpret the results. Those include: Direct evidence that the transfection worked, like qPCR, western blots (is the fthl27 tagged with an antigen?), coexpression of a fluorescent protein. Then transfection efficiency should be calculated and reported.

      Reply: Thank you for your suggestions. To assess the success of the transfection, we randomly selected a subset of fthl27-transfected HEK293T cells for transcriptome sequencing. This approach allowed us to examine the gene expression profiles and confirm the efficacy of the transfection process. As control samples, we obtained transcriptome data from two untreated HEK293T cells (SRR24835259 and SRR24835265) from NCBI. Subsequently, we extracted the fthl27 gene sequence of the hadal snailfish, along with 1,000 bp upstream and downstream regions, as a separate scaffold. This scaffold was then merged with the human genome to assess the expression levels of each gene in the three transcriptome datasets. The results demonstrated that the fthl27 gene exhibited the highest expression in fthl27-transfected HEK293T cells, while in the control group, the expression of the fthl27 gene was negligible (TPM = 0). Additionally, the expression patterns of other highly expressed genes were similar to those observed in the control group, confirming the successful fthl27 transfection. These findings have been incorporated into Figure 5-figure supplements 3.

      Author response image 10.

      (B) Reads depth of fthl27 gene in fthl27-transfected HEK293T cells and 2 untreated HEK293T cells (SRR24835259 and SRR24835265) transcriptome data. (C) Expression of each gene in the transcriptome data of fthl27-transfected HEK293T cells and 2 untreated HEK293T cells (SRR24835259 and SRR24835265), where the genes shown are the 4 most highly expressed genes in each sample.

      Lines 383-386: expression of DNA repair genes is mentioned, but not shown anywhere in the results?

      Reply: Thank you for your suggestions. Accordingly, we added a description of this finding in the results section (lines 337-343): “Next, we identified 34 genes that are significantly more highly expressed in all organs of hadal snailfish in comparison to Tanaka’s snailfish and zebrafish, while only seven genes were found to be significantly more highly expressed in Tanaka’s snailfish using the same criterion (Figure 5-figure supplements 1). The 34 genes are enriched in only one GO category, GO:0000077: DNA damage checkpoint (Adjusted P-value: 0.0177). Moreover, five of the 34 genes are associated with DNA repair.”. And we added the information in the Figure 5-figure supplements 1C.

      Author response image 11.

      (C) Genes were significantly more highly expressed in all tissues of the hadal snailfish compared to Tanaka's snailfish, and 5 genes (purple) were associated with DNA repair.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      eLife assessment

      This important study explores infants' attention patterns in real-world settings using advanced protocols and cutting-edge methods. The presented evidence for the role of EEG theta power in infants' attention is currently incomplete. The study will be of interest to researchers working on the development and control of attention.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The paper investigates the physiological and neural processes that relate to infants' attention allocation in a naturalistic setting. Contrary to experimental paradigms that are usually employed in developmental research, this study investigates attention processes while letting the infants be free to play with three toys in the vicinity of their caregiver, which is closer to a common, everyday life context. The paper focuses on infants at 5 and 10 months of age and finds differences in what predicts attention allocation. At 5 months, attention episodes are shorter and their duration is predicted by autonomic arousal. At 10 months, attention episodes are longer, and their duration can be predicted by theta power. Moreover, theta power predicted the proportion of looking at the toys, as well as a decrease in arousal (heart rate). Overall, the authors conclude that attentional systems change across development, becoming more driven by cortical processes.

      Strengths:

      I enjoyed reading the paper, I am impressed with the level of detail of the analyses, and I am strongly in favour of the overall approach, which tries to move beyond in-lab settings. The collection of multiple sources of data (EEG, heart rate, looking behaviour) at two different ages (5 and 10 months) is a key strength of this paper. The original analyses, which build onto robust EEG preprocessing, are an additional feat that improves the overall value of the paper. The careful consideration of how theta power might change before, during, and in the prediction of attention episodes is especially remarkable. However, I have a few major concerns that I would like the authors to address, especially on the methodological side.

      Points of improvement

      (1) Noise

      The first concern is the level of noise across age groups, periods of attention allocation, and metrics. Starting with EEG, I appreciate the analysis of noise reported in supplementary materials. The analysis focuses on a broad level (average noise in 5-month-olds vs 10-month-olds) but variations might be more fine-grained (for example, noise in 5mos might be due to fussiness and crying, while at 10 months it might be due to increased movements). More importantly, noise might even be the same across age groups, but correlated to other aspects of their behaviour (head or eye movements) that are directly related to the measures of interest. Is it possible that noise might co-vary with some of the behaviours of interest, thus leading to either spurious effects or false negatives? One way to address this issue would be for example to check if noise in the signal can predict attention episodes. If this is the case, noise should be added as a covariate in many of the analyses of this paper. 

      We thank the reviewer for this comment. We certainly have evidence that even the most state-of-the-art cleaning procedures (such as machine-learning trained ICA decompositions, as we applied here) are unable to remove eye movement artifact entirely from EEG data (Haresign et al., 2021; Phillips et al., 2023). (This applies to our data but also to others’ where confounding effects of eye movements are generally not considered.) Importantly, however, our analyses have been designed very carefully with this explicit challenge in mind. All of our analyses compare changes in the relationship between brain activity and attention as a function of age, and there is no evidence to suggest that different sources of noise (e.g. crying vs. movement) would associate differently with attention durations nor change their interactions with attention over developmental time. And figures 5 and 7, for example, both look at the relationship of EEG data at one moment in time to a child’s attention patterns hundreds or thousands of milliseconds before and after that moment, for which there is no possibility that head or eye movement artifact can have systematically influenced the results.

      Moving onto the video coding, I see that inter-rater reliability was not very high. Is this due to the fine-grained nature of the coding (20ms)? Is it driven by differences in expertise among the two coders? Or because coding this fine-grained behaviour from video data is simply too difficult? The main dependent variable (looking duration) is extracted from the video coding, and I think the authors should be confident they are maximising measurement accuracy.

      We appreciate the concern. To calculate IRR we used this function (Cardillo G. (2007) Cohen's kappa: compute the Cohen's kappa ratio on a square matrix. http://www.mathworks.com/matlabcentral/fileexchange/15365). Our “Observed agreement” was 0.7 (std= 0.15). However, we decided to report the Cohen's kappa coefficient, which is generally thought to be a more robust measure as it takes into account the agreement occurring by chance. We conducted the training meticulously (refer to response to Q6, R3), and we have confidence that our coders performed to the best of their abilities.

      (2) Cross-correlation analyses

      I would like to raise two issues here. The first is the potential problem of using auto-correlated variables as input for cross-correlations. I am not sure whether theta power was significantly autocorrelated. If it is, could it explain the cross-correlation result? The fact that the cross-correlation plots in Figure 6 peak at zero, and are significant (but lower) around zero, makes me think that it could be a consequence of periods around zero being autocorrelated. Relatedly: how does the fact that the significant lag includes zero, and a bit before, affect the interpretation of this effect? 

      Just to clarify this analysis, we did include a plot showing autocorrelation of theta activity in the original submission (Figs 7A and 7B in the revised paper). These indicate that theta shows little to no autocorrelation. And we can see no way in which this might have influenced our results. From their comments, the reviewer seems rather to be thinking of phasic changes in the autocorrelation, and whether the possibility that greater stability in theta during the time period around looks might have caused the cross-correlation result shown in 7E. Again though we can see no way in which this might be true, as the cross-correlation indicates that greater theta power is associated with a greater likelihood of looking, and this would not have been affected by changes in the autocorrelation.

      A second issue with the cross-correlation analyses is the coding of the looking behaviour. If I understand correctly, if an infant looked for a full second at the same object, they would get a maximum score (e.g., 1) while if they looked at 500ms at the object and 500ms away from the object, they would receive a score of e.g., 0.5. However, if they looked at one object for 500ms and another object for 500ms, they would receive a maximum score (e.g., 1). The reason seems unclear to me because these are different attention episodes, but they would be treated as one. In addition, the authors also show that within an attentional episode theta power changes (for 10mos). What is the reason behind this scoring system? Wouldn't it be better to adjust by the number of attention switches, e.g., with the formula: looking-time/(1+N_switches), so that if infants looked for a full second, but made 1 switch from one object to the other, the score would be .5, thus reflecting that attention was terminated within that episode? 

      We appreciate this suggestion. This is something we did not consider, and we thank the reviewer for raising it. In response to their comment, we have now rerun the analyses using the new measure (looking-time/(1+N_switches), and we are reassured to find that the results remain highly consistent. Please see Author response image 1 below where you can see the original results in orange and the new measure in blue at 5 and 10 months.

      Author response image 1.

      (3) Clearer definitions of variables, constructs, and visualisations

      The second issue is the overall clarity and systematicity of the paper. The concept of attention appears with many different names. Only in the abstract, it is described as attention control, attentional behaviours, attentiveness, attention durations, attention shifts and attention episode. More names are used elsewhere in the paper. Although some of them are indeed meant to describe different aspects, others are overlapping. As a consequence, the main results also become more difficult to grasp. For example, it is stated that autonomic arousal predicts attention, but it's harder to understand what specific aspect (duration of looking, disengagement, etc.) it is predictive of. Relatedly, the cognitive process under investigation (e.g., attention) and its operationalization (e.g., duration of consecutive looking toward a toy) are used interchangeably. I would want to see more demarcation between different concepts and between concepts and measurements.

      We appreciate the comment and we have clarified the concepts and their operationalisation throughout the revised manuscript.

      General Remarks

      In general, the authors achieved their aim in that they successfully showed the relationship between looking behaviour (as a proxy of attention), autonomic arousal, and electrophysiology. Two aspects are especially interesting. First, the fact that at 5 months, autonomic arousal predicts the duration of subsequent attention episodes, but at 10 months this effect is not present. Conversely, at 10 months, theta power predicts the duration of looking episodes, but this effect is not present in 5-month-old infants. This pattern of results suggests that younger infants have less control over their attention, which mostly depends on their current state of arousal, but older infants have gained cortical control of their attention, which in turn impacts their looking behaviour and arousal.

      We thank the reviewer for the close attention that they have paid to our manuscript, and for their insightful comments.

      Reviewer #2 (Public Review):

      Summary:

      This manuscript explores infants' attention patterns in real-world settings and their relationship with autonomic arousal and EEG oscillations in the theta frequency band. The study included 5- and 10-month-old infants during free play. The results showed that the 5-month-old group exhibited a decline in HR forward-predicted attentional behaviors, while the 10-month-old group exhibited increased theta power following shifts in gaze, indicating the start of a new attention episode. Additionally, this increase in theta power predicted the duration of infants' looking behavior.

      Strengths:

      The study's strengths lie in its utilization of advanced protocols and cutting-edge techniques to assess infants' neural activity and autonomic arousal associated with their attention patterns, as well as the extensive data coding and processing. Overall, the findings have important theoretical implications for the development of infant attention.

      Weaknesses:

      Certain methodological procedures require further clarification, e.g., details on EEG data processing. Additionally, it would be beneficial to eliminate possible confounding factors and consider alternative interpretations, e,g., whether the differences observed between the two age groups were partly due to varying levels of general arousal and engagement during the free play.

      We thank the reviewer for their suggestions and have addressed them in our point-by-point responses below.

      Reviewer #3 (Public Review):

      Summary:

      Much of the literature on attention has focused on static, non-contingent stimuli that can be easily controlled and replicated--a mismatch with the actual day-to-day deployment of attention. The same limitation is evident in the developmental literature, which is further hampered by infants' limited behavioral repertoires and the general difficulty in collecting robust and reliable data in the first year of life. The current study engages young infants as they play with age-appropriate toys, capturing visual attention, cardiac measures of arousal, and EEG-based metrics of cognitive processing. The authors find that the temporal relations between measures are different at age 5 months vs. age 10 months. In particular, at 5 months of age, cardiac arousal appears to precede attention, while at 10 months of age attention processes lead to shifts in neural markers of engagement, as captured in theta activity.

      Strengths:

      The study brings to the forefront sophisticated analytical and methodological techniques to bring greater validity to the work typically done in the research lab. By using measures in the moment, they can more closely link biological measures to actual behaviors and cognitive stages. Often, we are forced to capture these measures in separate contexts and then infer in-the-moment relations. The data and techniques provide insights for future research work.

      Weaknesses:

      The sample is relatively modest, although this is somewhat balanced by the sheer number of data points generated by the moment-to-moment analyses. In addition, the study is cross-sectional, so the data cannot capture true change over time. Larger samples, followed over time, will provide a stronger test for the robustness and reliability of the preliminary data noted here. Finally, while the method certainly provides for a more active and interactive infant in testing, we are a few steps removed from the complexity of daily life and social interactions.

      We thank the reviewer for their suggestions and have addressed them in our point-by-point responses below.

      Reviewer #1 (Recommendations For The Authors):

      Here are some specific ways in which clarity can be improved:

      A. Regarding the distinction between constructs, or measures and constructs:

      i. In the results section, I would prefer to mention looking at duration and heart rate as metrics that have been measured, while in the introduction and discussion, a clear 1-to-1 link between construct/cognitive process and behavioural or (neuro)psychophysical measure can be made (e.g., sustained attention is measured via looking durations; autonomic arousal is measured via heart-rate). 

      The way attention and arousal were operationalised are now clarified throughout the text, especially in the results.

      ii. Relatedly, the "attention" variable is not really measuring attention directly. It is rather measuring looking time (proportion of looking time to the toys?), which is the operationalisation, which is hypothesised to be related to attention (the construct/cognitive process). I would make the distinction between the two stronger.

      This distinction between looking and paying attention is clearer now in the reviewed manuscript as per R1 and R3’s suggestions. We have also added a paragraph in the Introduction to clarify it and pointed out its limitations (see pg.5).

      B. Each analysis should be set out to address a specific hypothesis. I would rather see hypotheses in the introduction (without direct reference to the details of the models that were used), and how a specific relation between variables should follow from such hypotheses. This would also solve the issue that some analyses did not seem directly necessary to the main goal of the paper. For example:

      i. Are ACF and survival probability analyses aimed at proving different points, or are they different analyses to prove the same point? Consider either making clearer how they differ or moving one to supplementary materials.

      We clarified this in pg. 4 of the revised manuscript.

      ii. The autocorrelation results are not mentioned in the introduction. Are they aiming to show that the variables can be used for cross-correlation? Please clarify their role or remove them.

      We clarified this in pg. 4 of the revised manuscript.

      C. Clarity of cross-correlation figures. To ensure clarity when presenting a cross-correlation plot, it's important to provide information on the lead-lag relationships and which variable is considered X and which is Y. This could be done by labelling the axes more clearly (e.g., the left-hand side of the - axis specifies x leads y, right hand specifies y leads x) or adding a legend (e.g., dashed line indicates x leading y, solid line indicates y leading x). Finally, the limits of the x-axis are consistent across plots, but the limits of the y-axis differ, which makes it harder to visually compare the different plots. More broadly, the plots could have clearer labels, and their resolution could also be improved. 

      This information on what variable precedes/ follows was in the caption of the figures. However, we have edited the figures as per the reviewer’s suggestion and added this information in the figures themselves. We have also uploaded all the figures in higher resolution.

      D. Figure 7 was extremely helpful for understanding the paper, and I would rather have it as Figure 1 in the introduction. 

      We have moved figure 7 to figure 1 as per this request.

      E. Statistics should always be reported, and effects should always be described. For example, results of autocorrelation are not reported, and from the plot, it is also not clear if the effects are significant (the caption states that red dots indicate significance, but there are no red dots. Does this mean there is no autocorrelation?).

      We apologise – this was hard to read in the original. We have clarified that there is no autocorrelation present in Fig 7A and 7D.

      And if so, given that theta is a wave, how is it possible that there is no autocorrelation (connected to point 1)? 

      We thank the reviewer for raising this point. In fact, theta power is looking at oscillatory activity in the EEG within the 3-6Hz window (i.e. 3 to 6 oscillations per second). Whereas we were analysing the autocorrelation in the EEG data by looking at changes in theta power between consecutive 1 second long windows. To say that there is no autocorrelation in the data means that, if there is more 3-6Hz activity within one particular 1-second window, there tends not to be significantly more 3-6Hz activity within the 1-second windows immediately before and after.

      F. Alpha power is introduced later on, and in the discussion, it is mentioned that the effects that were found go against the authors' expectations. However, alpha power and the authors' expectations about it are not mentioned in the introduction. 

      We thank the reviewer for this comment. We have added a paragraph on alpha in the introduction (pg.4).

      Minor points:

      1. At the end of 1st page of introduction, the authors state that: 

      “How children allocate their attention in experimenter-controlled, screen-based lab tasks differs, however, from actual real-world attention in several ways (32-34). For example, the real-world is interactive and manipulable, and so how we interact with the world determines what information we, in turn, receive from it: experiences generate behaviours (35).”

      I think there's more to this though - Lab-based studies can be made interactive too (e.g., Meyer et al., 2023, Stahl & Feigenson, 2015). What remains unexplored is how infants actively and freely initiate and self-structure their attention, rather than how they respond to experimental manipulations.

      Meyer, M., van Schaik, J. E., Poli, F., & Hunnius, S. (2023). How infant‐directed actions enhance infants' attention, learning, and exploration: Evidence from EEG and computational modeling. Developmental Science, 26(1), e13259.

      Stahl, A. E., & Feigenson, L. (2015). Observing the unexpected enhances infants' learning and exploration. Science, 348(6230), 91-94.

      We thank the reviewer for this suggestion and added their point in pg. 4.

      (2) Regarding analysis 4:

      a. In analysis 1 you showed that the duration of attentional episodes changes with age. Is it fair to keep the same start, middle, and termination ranges across age groups? Is 3-4 seconds "middle" for 5-month-olds? 

      We appreciate the comment. There are many ways we could have run these analyses and, in fact, in other papers we have done it differently, for example by splitting each look in 3, irrespective of its duration (Phillips et al., 2023).

      However, one aspect we took into account was the observation that 5-month-old infants exhibited more shorter looks compared to older infants. We recognized that dividing each into 3 parts, regardless of its duration, might have impacted the results. Presumably, the activity during the middle and termination phases of a 1.5-second look differs from that of a look lasting over 7 seconds.

      Two additional factors that provided us with confidence in our approach were: 1) while the definition of "middle" was somewhat arbitrary, it allowed us to maintain consistency in our analyses across different age points. And, 2) we obtained a comparable amount of observations across the two time points (e.g. “middle” at 5 months we had 172 events at 5 months, and 194 events at 10 months).

      b. It is recommended not to interpret lower-level interactions if more complex interactions are not significant. How are the interaction effects in a simpler model in which the 3-way interaction is removed? 

      We appreciate the comment. We tried to follow the same steps as in (Xie et al., 2018). However, we have re-analysed the data removing the 3-way interaction and the significance of the results stayed the same. Please see Author response image 2 below (first: new analyses without the 3-way interactions, second: original analyses that included the 3-way interaction).

      Author response image 2.

      (3) Figure S1: there seems to be an outlier in the bottom-right panel. Do results hold excluding it? 

      We re-run these analyses as per this suggestion and the results stayed the same (refer to SM pg. 2).

      (4) Figure S2 should refer to 10 months instead of 12.

      We thank the reviewer for noticing this typo, we have changed it in the reviewed manuscript (see SM pg. 3). 

      (5) In the 2nd paragraph of the discussion, I found this sentence unclear: "From Analysis 1 we found that infants at both ages showed a preferred modal reorientation rate". 

      We clarified this in the reviewed manuscript in pg10

      (6) Discussion: many (infant) studies have used theta in anticipation of receiving information (Begus et al., 2016) surprising events (Meyer et al., 2023), and especially exploration (Begus et al., 2015). Can you make a broader point on how these findings inform our interpretation of theta in the infant population (go more from description to underlying mechanisms)? 

      We have extended on this point on interpreting frequency bands in pg13 of the reviewed manuscript and thank the reviewer for bringing it up.

      Begus, K., Gliga, T., & Southgate, V. (2016). Infants' preferences for native speakers are associated with an expectation of information. Proceedings of the National Academy of Sciences, 113(44), 12397-12402.

      Meyer, M., van Schaik, J. E., Poli, F., & Hunnius, S. (2023). How infant‐directed actions enhance infants' attention, learning, and exploration: Evidence from EEG and computational modeling. Developmental Science, 26(1), e13259.

      Begus, K., Southgate, V., & Gliga, T. (2015). Neural mechanisms of infant learning: differences in frontal theta activity during object exploration modulate subsequent object recognition. Biology letters, 11(5), 20150041.

      (7) 2nd page of discussion, last paragraph: "preferred modal reorientation timer" is not a neural/cognitive mechanism, just a resulting behaviour. 

      We agree with this comment and thank the reviewer for bringing it out to our attention. We clarified this in in pg12 and pg13 of the reviewed manuscript.

      Reviewer #2 (Recommendations For The Authors):

      I have a few comments and questions that I think the authors should consider addressing in a revised version. Please see below:

      (1) During preprocessing (steps 5 and 6), it seems like the "noisy channels" were rejected using the pop_rejchan.m function and then interpolated. This procedure is common in infant EEG analysis, but a concern arises: was there no upper limit for channel interpolation? Did the authors still perform bad channel interpolation even when more than 30% or 40% of the channels were identified as "bad" at the beginning with the continuous data? 

      We did state in the original manuscript that “participants with fewer than 30% channels interpolated at 5 months and 25% at 10 months made it to the final step (ICA) and final analyses”. In the revised version we have re-written this section in order to make this more clear (pg. 17).

      (2) I am also perplexed about the sequencing of the ICA pruning step. If the intention of ICA pruning is to eliminate artificial components, would it be more logical to perform this procedure before the conventional artifacts' rejection (i.e., step 7), rather than after? In addition, what was the methodology employed by the authors to identify the artificial ICA components? Was it done through manual visual inspection or utilizing specific toolboxes? 

      We agree that the ICA is often run before, however, the decision to reject continuous data prior to ICA was to remove the very worst sections of data (where almost all channels were affected), which can arise during times when infants fuss or pull the caps. Thus, this step was applied at this point in the pipeline so that these sections of really bad data were not inputted into the ICA. This is fairly widespread practice in cleaning infant data.

      Concerning the reviewer’s second question, of how ICA components were removed – the answer to this is described in considerable detail in the paper that we refer to in that setion of the manuscript. This was done by training a classifier specially designed to clean naturalistic infant EEG data (Haresign et al., 2021) and has since been employed in similar studies (e.g. Georgieva et al., 2020; Phillips et al., 2023).

      (3) Please clarify how the relative power was calculated for the theta (3-6Hz) and alpha (6-9Hz) bands. Were they calculated by dividing the ratio of theta or alpha power to the power between 3 and 9Hz, or the total power between 1 (or 3) and 20 Hz? In other words, what does the term "all frequency bands" refer to in section 4.3.7? 

      We thank the reviewer for this comment, we have now clarified this in pg. 22.

      (4) One of the key discoveries presented in this paper is the observation that attention shifts are accompanied by a subsequent enhancement in theta band power shortly after the shifts occur. Is it possible that this effect or alteration might be linked to infants' saccades, which are used as indicators of attention shifts? Would it be feasible to analyze the disparities in amplitude between the left and right frontal electrodes (e.g., Fp1 and Fp2, which could be viewed as virtual horizontal EOG channels) in relation to theta band power, in order to eliminate the possibility that the augmentation of theta power was attributable to the intensity of the saccades? 

      We appreciate the concern. Average saccade duration in infants is about 40ms (Garbutt et al., 2007). Our finding that the positive cross-correlation between theta and look duration is present not only when we examine zero-lag data but also when we examine how theta forwards-predicts attention 1-2 seconds afterwards seems therefore unlikely to be directly attributable to saccade-related artifact. Concerning the reviewer’s suggestion – this is something that we have tried in the past. Unfortunately, however, our experience is that identifying saccades based on the disparity between Fp1 and Fp2 is much too unreliable to be of any use in analysing data. Even if specially positioned HEOG electrodes are used, we still find the saccade detection to be insufficiently reliable. In ongoing work we are tracking eye movements separately, in order to be able to address this point more satisfactorily.

      (5) The following question is related to my previous comment. Why is the duration of the relationship between theta power and moment-to-moment changes in attention so short? If theta is indeed associated with attention and information processing, shouldn't the relationship between the two variables strengthen as the attention episode progresses? Given that the authors themselves suggest that "One possible interpretation of this is that neural activity associates with the maintenance more than the initiation of attentional behaviors," it raises the question of (is in contradiction to) why the duration of the relationship is not longer but declines drastically (Figure 6). 

      We thank the reviewer for raising this excellent point. Certainly we argue that this, together with the low autocorrelation values for theta documented in Fig 7A and 7D challenge many conventional ways of interpreting theta. We are continuing to investigate this question in ongoing work.

      (6) Have the authors conducted a comparison of alpha relative power and HR deceleration durations between 5 and 10-month-old infants? This analysis could provide insights into whether the differences observed between the two age groups were partly due to varying levels of general arousal and engagement during free play.

      We thank the reviewer for this suggestion. Indeed, this is an aspect we investigated but ultimately, given that our primary emphasis was on the theta frequency, and considering the length of the manuscript, we decided not to incorporate. However, we attached Author response image 3 below showing there was no significant interaction between HR and alpha band.

      Author response image 3.

      Reviewer #3 (Recommendations For The Authors):

      (1) In reading the manuscript, the language used seems to imply longitudinal data or at the very least the ability to detect change or maturation. Given the cross-sectional nature of the data, the language should be tempered throughout. The data are illustrative but not definitive. 

      We thank the reviewer for this comment. We have now clarified that “Data was analysed in a cross-sectional manner” in pg15.

      (2) The sample size is quite modest, particularly in the specific age groups. This is likely tempered by the sheer number of data points available. This latter argument is implied in the text, but not as explicitly noted. (However, I may have missed this as the text is quite dense). I think more notice is needed on the reliability and stability of the findings given the sample. 

      We have clarified this in pg16.

      (3) On a related note, how was the sample size determined? Was there a power analysis to help guide decision-making for both recruitment and choosing which analyses to proceed with? Again, the analytic approach is quite sophisticated and the questions are of central interest to researchers, but I was left feeling maybe these two aspects of the study were out-sprinting the available data. The general impression is that the sample is small, but it is not until looking at table s7, that it is in full relief. I think this should be more prominent in the main body of the study.

      We have clarified this in pg16.

      (4) The devotes a few sentences to the relation between looking and attention. However, this distinction is central to the design of the study, and any philosophical differences regarding what take-away points can be generated. In my reading, I think this point needs to be more heavily interrogated. 

      This distinction between looking and paying attention is clearer now in the reviewed manuscript as per R1 and R3’s suggestions. We have also added a paragraph in the Introduction to clarify it and pointed out its limitations (see pg.5).

      (5) I would temper the real-world attention language. This study is certainly a great step forward, relative to static faces on a computer screen. However, there are still a great number of artificial constraints that have been added. That is not to say that the constraints are bad--they are necessary to carry out the work. However, it should be acknowledged that it constrains the external validity. 

      We have added a paragraph to acknowledged limitations of the setup in pg. 14.

      (6) The kappa on the coding is not strong. The authors chose to proceed nonetheless. Given that, I think more information is needed on how coders were trained, how they were standardized, and what parameters were used to decide they were ready to code independently. Again, with the sample size and the kappa presented, I think more discussion is needed regarding the robustness of the findings. 

      We appreciate the concern. As per our answer to R1, we chose to report the most stringent calculator of inter-rater reliability, but other calculation methods (i.e., percent agreement) return higher scores (see response to R1).

      As per the training, we wrote an extensively detailed coding scheme describing exactly how to code each look that was handed to our coders. Throughout the initial months of training, we meet with the coders on a weekly basis to discuss questions and individual frames that looked ambiguous. After each session, we would revise the coding scheme to incorporate additional details, aiming to make the coding process progressively less subjective. During this period, every coder analysed the same interactions, and inter-rater reliability (IRR) was assessed weekly, comparing their evaluations with mine (Marta). With time, the coders had fewer questions and IRR increased. At that point, we deemed them sufficiently trained, and began assigning them different interactions from each other. Periodically, though, we all assessed the same interaction and meet to review and discuss our coding outputs.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      These ingenious and thoughtful studies present important findings concerning how people represent and generalise abstract patterns of sensory data. The issue of generalisation is a core topic in neuroscience and psychology, relevant across a wide range of areas, and the findings will be of interest to researchers across areas in perception, learning, and cognitive science. The findings have the potential to provide compelling support for the outlined account, but there appear other possible explanations, too, that may affect the scope of the findings but could be considered in a revision.

      Thank you for sending the feedback from the three peer reviewers regarding our paper. Please find below our detailed responses addressing the reviewers' comments. We have incorporated these suggestions into the paper and provided explanations for the modifications made.

      We have specifically addressed the point of uncertainty highlighted in eLife's editorial assessment, which concerned alternative explanations for the reported effect. In response to Reviewer #1, we have clarified how Exp. 2c and Exp. 3c address the potential alternative explanation related to "attention to dimensions." Further, we present a supplementary analysis to account for differences in asymptotic learning, as noted by Reviewer #2. We have also clarified how our control experiments address effects associated with general cognitive engagement in the task. Lastly, we have further clarified the conceptual foundation of our paper, addressing concerns raised by Reviewers #2 and #3.

      Reviewer #1 (Public Review):

      Summary:

      This manuscript reports a series of experiments examining category learning and subsequent generalization of stimulus representations across spatial and nonspatial domains. In Experiment 1, participants were first trained to make category judgments about sequences of stimuli presented either in nonspatial auditory or visual modalities (with feature values drawn from a two-dimensional feature manifold, e.g., pitch vs timbre), or in a spatial modality (with feature values defined by positions in physical space, e.g., Cartesian x and y coordinates). A subsequent test phase assessed category judgments for 'rotated' exemplars of these stimuli: i.e., versions in which the transition vectors are rotated in the same feature space used during training (near transfer) or in a different feature space belonging to the same domain (far transfer). Findings demonstrate clearly that representations developed for the spatial domain allow for representational generalization, whereas this pattern is not observed for the nonspatial domains that are tested. Subsequent experiments demonstrate that if participants are first pre-trained to map nonspatial auditory/visual features to spatial locations, then rotational generalization is facilitated even for these nonspatial domains. It is argued that these findings are consistent with the idea that spatial representations form a generalized substrate for cognition: that space can act as a scaffold for learning abstract nonspatial concepts.

      Strengths:

      I enjoyed reading this manuscript, which is extremely well-written and well-presented. The writing is clear and concise throughout, and the figures do a great job of highlighting the key concepts. The issue of generalization is a core topic in neuroscience and psychology, relevant across a wide range of areas, and the findings will be of interest to researchers across areas in perception and cognitive science. It's also excellent to see that the hypotheses, methods, and analyses were pre-registered.

      The experiments that have been run are ingenious and thoughtful; I particularly liked the use of stimulus structures that allow for disentangling of one-dimensional and two-dimensional response patterns. The studies are also well-powered for detecting the effects of interest. The model-based statistical analyses are thorough and appropriate throughout (and it's good to see model recovery analysis too). The findings themselves are clear-cut: I have little doubt about the robustness and replicability of these data.

      Weaknesses:

      I have only one significant concern regarding this manuscript, which relates to the interpretation of the findings. The findings are taken to suggest that "space may serve as a 'scaffold', allowing people to visualize and manipulate nonspatial concepts" (p13). However, I think the data may be amenable to an alternative possibility. I wonder if it's possible that, for the visual and auditory stimuli, participants naturally tended to attend to one feature dimension and ignore the other - i.e., there may have been a (potentially idiosyncratic) difference in salience between the feature dimensions that led to participants learning the feature sequence in a one-dimensional way (akin to the 'overshadowing' effect in associative learning: e.g., see Mackintosh, 1976, "Overshadowing and stimulus intensity", Animal Learning and Behaviour). By contrast, we are very used to thinking about space as a multidimensional domain, in particular with regard to two-dimensional vertical and horizontal displacements. As a result, one would naturally expect to see more evidence of two-dimensional representation (allowing for rotational generalization) for spatial than nonspatial domains.

      In this view, the impact of spatial pre-training and (particularly) mapping is simply to highlight to participants that the auditory/visual stimuli comprise two separable (and independent) dimensions. Once they understand this, during subsequent training, they can learn about sequences on both dimensions, which will allow for a 2D representation and hence rotational generalization - as observed in Experiments 2 and 3. This account also anticipates that mapping alone (as in Experiment 4) could be sufficient to promote a 2D strategy for auditory and visual domains.

      This "attention to dimensions" account has some similarities to the "spatial scaffolding" idea put forward in the article, in arguing that experience of how auditory/visual feature manifolds can be translated into a spatial representation helps people to see those domains in a way that allows for rotational generalization. Where it differs is that it does not propose that space provides a scaffold for the development of the nonspatial representations, i.e., that people represent/learn the nonspatial information in a spatial format, and this is what allows them to manipulate nonspatial concepts. Instead, the "attention to dimensions" account anticipates that ANY manipulation that highlights to participants the separable-dimension nature of auditory/visual stimuli could facilitate 2D representation and hence rotational generalization. For example, explicit instruction on how the stimuli are constructed may be sufficient, or pre-training of some form with each dimension separately, before they are combined to form the 2D stimuli.

      I'd be interested to hear the authors' thoughts on this account - whether they see it as an alternative to their own interpretation, and whether it can be ruled out on the basis of their existing data.

      We thank the Reviewer for their comments. We agree with the Reviewer that the “attention to dimensions” hypothesis is an interesting alternative explanation. However, we believe that the results of our control experiments Exp. 2c and Exp. 3c are incompatible with this alternative explanation.

      In Exp. 2c, participants are pre-trained in the visual modality and then tested in the auditory modality. In the multimodal association task, participants have to associate the auditory stimuli and the visual stimuli: on each trial, they hear a sound and then have to click on the corresponding visual stimulus. It is thus necessary to pay attention to both auditory dimensions and both visual dimensions to perform the task. To give an example, the task might involve mapping the fundamental frequency and the amplitude modulation of the auditory stimulus to the colour and the shape of the visual stimulus, respectively. If participants pay attention to only one dimension, this would lead to a maximum of 25% accuracy on average (because they would be at chance on the other dimension, with four possible options). We observed that 30/50 participants reached an accuracy > 50% in the multimodal association task in Exp. 2c. This means that we know for sure that at least 60% of the participants paid attention to both dimensions of the stimuli. Nevertheless, there was a clear difference between participants that received a visual pre-training (Exp. 2c) and those who received a spatial pre-training (Exp. 2a) (frequency of 1D vs 2D models between conditions, BF > 100 in near transfer and far transfer). In fact, only 3/50 participants were best fit by a 2D model when vision was the pre-training modality compared to 29/50 when space was the pre-training modality. Thus, the benefit of the spatial pre-training cannot be due solely to a shift in attention toward both dimensions.

      This effect was replicated in Exp. 3c. Similarly, 33/48 participants reached an accuracy > 50% in the multimodal association task in Exp. 3c, meaning that we know for sure that at least 68% of the participants actually paid attention to both dimensions of the stimuli. Again, there was a clear difference between participants who received a visual pre-training (frequency of 1D vs 2D models between conditions, Exp. 3c) and those who received a spatial pre-training (Exp. 3a) (BF > 100 in near transfer and far transfer).

      Thus, we believe that the alternative explanation raised by the Reviewer is not supported by our data. We have added a paragraph in the discussion:

      “One alternative explanation of this effect could be that the spatial pre-training encourages participants to attend to both dimensions of the non-spatial stimuli. By contrast, pretraining in the visual or auditory domains (where multiple dimensions of a stimulus may be relevant less often naturally) encourages them to attend to a single dimension. However, data from our control experiments Exp. 2c and Exp. 3c, are incompatible with this explanation. Around ~65% of the participants show a level of performance in the multimodal association task (>50%) which could only be achieved if they were attending to both dimensions (performance attending to a single dimension would yield 25% and chance performance is at 6.25%). This suggests that participants are attending to both dimensions even in the visual and auditory mapping case.”

      Reviewer #2 (Public Review):

      Summary:

      In this manuscript, L&S investigates the important general question of how humans achieve invariant behavior over stimuli belonging to one category given the widely varying input representation of those stimuli and more specifically, how they do that in arbitrary abstract domains. The authors start with the hypothesis that this is achieved by invariance transformations that observers use for interpreting different entries and furthermore, that these transformations in an arbitrary domain emerge with the help of the transformations (e.g. translation, rotation) within the spatial domain by using those as "scaffolding" during transformation learning. To provide the missing evidence for this hypothesis, L&S used behavioral category learning studies within and across the spatial, auditory, and visual domains, where rotated and translated 4-element token sequences had to be learned to categorize and then the learned transformation had to be applied in new feature dimensions within the given domain. Through single- and multiple-day supervised training and unsupervised tests, L&S demonstrated by standard computational analyses that in such setups, space and spatial transformations can, indeed, help with developing and using appropriate rotational mapping whereas the visual domain cannot fulfill such a scaffolding role.

      Strengths:

      The overall problem definition and the context of spatial mapping-driven solution to the problem is timely. The general design of testing the scaffolding effect across different domains is more advanced than any previous attempts clarifying the relevance of spatial coding to any other type of representational codes. Once the formulation of the general problem in a specific scientific framework is done, the following steps are clearly and logically defined and executed. The obtained results are well interpretable, and they could serve as a good stepping stone for deeper investigations. The analytical tools used for the interpretations are adequate. The paper is relatively clearly written.

      Weaknesses:

      Some additional effort to clarify the exact contribution of the paper, the link between analyses and the claims of the paper, and its link to previous proposals would be necessary to better assess the significance of the results and the true nature of the proposed mechanism of abstract generalization.

      (1) Insufficient conceptual setup: The original theoretical proposal (the Tolman-Eichenbaum-Machine, Whittington et al., Cell 2020) that L&S relate their work to proposes that just as in the case of memory for spatial navigation, humans and animals create their flexible relational memory system of any abstract representation by a conjunction code that combines on the one hand, sensory representation and on the other hand, a general structural representation or relational transformation. The TEM also suggests that the structural representation could contain any graph-interpretable spatial relations, albeit in their demonstration 2D neighbor relations were used. The goal of L&S's paper is to provide behavioral evidence for this suggestion by showing that humans use representational codes that are invariant to relational transformations of non-spatial abstract stimuli and moreover, that humans obtain these invariances by developing invariance transformers with the help of available spatial transformers. To obtain such evidence, L&S use the rotational transformation. However, the actual procedure they use actually solved an alternative task: instead of interrogating how humans develop generalizations in abstract spaces, they demonstrated that if one defines rotation in an abstract feature space embedded in a visual or auditory modality that is similar to the 2D space (i.e. has two independent dimensions that are clearly segregable and continuous), humans cannot learn to apply rotation of 4-piece temporal sequences in those spaces while they can do it in 2D space, and with co-associating a one-to-one mapping between locations in those feature spaces with locations in the 2D space an appropriate shaping mapping training will lead to the successful application of rotation in the given task (and in some other feature spaces in the given domain). While this is an interesting and challenging demonstration, it does not shed light on how humans learn and generalize, only that humans CAN do learning and generalization in this, highly constrained scenario. This result is a demonstration of how a stepwise learning regiment can make use of one structure for mapping a complex input into a desired output. The results neither clarify how generalizations would develop in abstract spaces nor the question of whether this generalization uses transformations developed in the abstract space. The specific training procedure ensures success in the presented experiments but the availability and feasibility of an equivalent procedure in a natural setting is a crucial part of validating the original claim and that has not been done in the paper.

      We thank the Reviewer for their detailed comments on our manuscript. We reply to the three main points in turn.

      First, concerning the conceptual grounding of our work, we would point out that the TEM model (Whittington et al., 2020), however interesting, is not our theoretical starting point. Rather, as we hope the text and references make clear, we ground our work in theoretical work from the 1990/2000s proposing that space acts as a scaffold for navigating abstract spaces (such as Gärdenfors, 2000). We acknowledge that the TEM model and other experimental work on the implication of the hippocampus, the entorhinal cortex and the parietal cortex in relational transformations of nonspatial stimuli provide evidence for this general theory. However, our work is designed to test a more basic question: whether there is behavioural evidence that space scaffolds learning in the first place. To achieve this, we perform behavioural experiments with causal manipulation (spatial pre-training vs no spatial pre-training) have the potential to provide such direct evidence. This is why we claim that:

      “This theory is backed up by proof-of-concept computational simulations [13], and by findings that brain regions thought to be critical for spatial cognition in mammals (such as the hippocampal-entorhinal complex and parietal cortex) exhibit neural codes that are invariant to relational transformations of nonspatial stimuli. However, whilst promising, this theory lacks direct empirical evidence. Here, we set out to provide a strong test of the idea that learning about physical space scaffolds conceptual generalisation.“

      Second, we agree with the Reviewer that we do not provide an explicit model for how generalisation occurs, and how precisely space acts as a scaffold for building representations and/or applying the relevant transformations to non-spatial stimuli to solve our task. Rather, we investigate in our Exp. 2-4 which aspects of the training are necessary for rotational generalisation to happen (and conclude that a simple training with the multimodal association task is sufficient for ~20% participants). We now acknowledge in the discussion the fact that we do not provide an explicit model and leave that for future work:

      “We acknowledge that our study does not provide a mechanistic model of spatial scaffolding but rather delineate which aspects of the training are necessary for generalisation to happen.”

      Finally, we also agree with the Reviewer that our task is non-naturalistic. As is common in experimental research, one must sacrifice the naturalistic elements of the task in exchange for the control and the absence of prior knowledge of the participants. We have decided to mitigate as possible the prior knowledge of the participants to make sure that our task involved learning a completely new task and that the pre-training was really causing the better learning/generalisation. The effects we report are consistent across the experiments so we feel confident about them but we agree with the Reviewer that an external validation with more naturalistic stimuli/tasks would be a nice addition to this work. We have included a sentence in the discussion:

      “All the effects observed in our experiments were consistent across near transfer conditions (rotation of patterns within the same feature space), and far transfer conditions (rotation of patterns within a different feature space, where features are drawn from the same modality). This shows the generality of spatial training for conceptual generalisation. We did not test transfer across modalities nor transfer in a more natural setting; we leave this for future studies.”

      (2) Missing controls: The asymptotic performance in experiment 1 after training in the three tasks was quite different in the three tasks (intercepts 2.9, 1.9, 1.6 for spatial, visual, and auditory, respectively; p. 5. para. 1, Fig 2BFJ). It seems that the statement "However, our main question was how participants would generalise learning to novel, rotated exemplars of the same concept." assumes that learning and generalization are independent. Wouldn't it be possible, though, that the level of generalization depends on the level of acquiring a good representation of the "concept" and after obtaining an adequate level of this knowledge, generalization would kick in without scaffolding? If so, a missing control is to equate the levels of asymptotic learning and see whether there is a significant difference in generalization. A related issue is that we have no information on what kind of learning in the three different domains was performed, albeit we probably suspect that in space the 2D representation was dominant while in the auditory and visual domains not so much. Thus, a second missing piece of evidence is the model-fitting results of the ⦰ condition that would show which way the original sequences were encoded (similar to Fig 2 CGK and DHL). If the reason for lower performance is not individual stimulus difficulty but the natural tendency to encode the given stimulus type by a combo of random + 1D strategy that would clarify that the result of the cross-training is, indeed, transferring the 2D-mapping strategy.

      We agree with the Reviewer that a good further control is to equate performance during training. Thus, we have run a complementary analysis where we select only the participants that reach > 90% accuracy in the last block of training in order to equate asymptotic performance after training in Exp. 1. The results (see Author response image 1) replicates the results that we report in the main text: there is a large difference between groups (relative likelihood of 1D vs. 2D models, all BF > 100 in favour of a difference between the auditory and the spatial modalities, between the visual and the spatial modalities, in both near and far transfer, “decisive” evidence). We prefer not to include this figure in the paper for clarity, and because we believe this result is expected given the fact that 0/50 and 0/50 of the participants in the auditory and visual condition used a 2D strategy – thus, selecting subgroups of these participants cannot change our conclusions.

      Author response image 1.

      Results of Exp. 1 when selecting participants that reached > 90% accuracy in the last block of training. Captions are the same as Figure 2 of the main text.

      Second, the Reviewer suggested that we run the model fitting analysis only on the ⦰ condition (training) in Exp. 1 to reveal whether participants use a 1D or a 2D strategy already during training. Unfortunately, we cannot provide the model fits only in the ⦰ condition in Exp. 1 because all models make the same predictions for this condition (see Fig S4). However, note that this is done by design: participants were free to apply whatever strategy they want during training; we then used the generalisation phase with the rotated stimuli precisely to reveal this strategy. Further, we do believe that the strategy used by the participants during training and the strategy during transfer are the same, partly because – starting from block #4 – participants have no idea whether the current trial is a training trial or a transfer trial, as both trial types are randomly interleaved with no cue signalling the trial type. We have made this clear in the methods:

      “They subsequently performed 105 trials (with trialwise feedback) and 105 transfer trials including rotated and far transfer quadruplets (without trialwise feedback) which were presented in mixed blocks of 30 trials. Training and transfer trials were randomly interleaved, and no clue indicated whether participants were currently on a training trial or a transfer trial before feedback (or absence of feedback in case of a transfer trial).”

      Reviewer #3 (Public Review):

      Summary:

      Pesnot Lerousseau and Summerfield aimed to explore how humans generalize abstract patterns of sensory data (concepts), focusing on whether and how spatial representations may facilitate the generalization of abstract concepts (rotational invariance). Specifically, the authors investigated whether people can recognize rotated sequences of stimuli in both spatial and nonspatial domains and whether spatial pre-training and multi-modal mapping aid in this process.

      Strengths:

      The study innovatively examines a relatively underexplored but interesting area of cognitive science, the potential role of spatial scaffolding in generalizing sequences. The experimental design is clever and covers different modalities (auditory, visual, spatial), utilizing a two-dimensional feature manifold. The findings are backed by strong empirical data, good data analysis, and excellent transparency (including preregistration) adding weight to the proposition that spatial cognition can aid abstract concept generalization.

      Weaknesses:

      The examples used to motivate the study (such as "tree" = oak tree, family tree, taxonomic tree) may not effectively represent the phenomena being studied, possibly confusing linguistic labels with abstract concepts. This potential confusion may also extend to doubts about the real-life applicability of the generalizations observed in the study and raises questions about the nature of the underlying mechanism being proposed.

      We thank the Reviewer for their comments. We agree that we could have explained ore clearly enough how these examples motivate our study. The similarity between “oak tree” and “family tree” is not just the verbal label. Rather, it is the arrangement of the parts (nodes and branches) in a nested hierarchy. Oak trees and family trees share the same relational structure. The reason that invariance is relevant here is that the similarity in relational structure is retained under rigid body transformations such as rotation or translation. For example, an upside-down tree can still be recognised as a tree, just as a family tree can be plotted with the oldest ancestors at either top or bottom. Similarly, in our study, the quadruplets are defined by the relations between stimuli: all quadruplets use the same basic stimuli, but the categories are defined by the relations between successive stimuli. In our task, generalising means recognising that relations between stimuli are the same despite changes in the surface properties (for example in far transfer). We have clarify that in the introduction:

      “For example, the concept of a “tree” implies an entity whose structure is defined by a nested hierarchy, whether this is a physical object whose parts are arranged in space (such as an oak tree in a forest) or a more abstract data structure (such as a family tree or taxonomic tree). [...] Despite great changes in the surface properties of oak trees, family trees and taxonomic trees, humans perceive them as different instances of a more abstract concept defined by the same relational structure.”

      Next, the study does not explore whether scaffolding effects could be observed with other well-learned domains, leaving open the question of whether spatial representations are uniquely effective or simply one instance of a familiar 2D space, again questioning the underlying mechanism.

      We would like to mention that Reviewer #2 had a similar comment. We agree with both Reviewers that our task is non-naturalistic. As is common in experimental research, one must sacrifice the naturalistic elements of the task in exchange for the control and the absence of prior knowledge of the participants. We have decided to mitigate as possible the prior knowledge of the participants to make sure that our task involved learning a completely new task and that the pre-training was really causing the better learning/generalisation. The effects we report are consistent across the experiments so we feel confident about them but we agree with the Reviewer that an external validation with more naturalistic stimuli/tasks would be a nice addition to this work. We have included a sentence in the discussion:

      “All the effects observed in our experiments were consistent across near transfer conditions (rotation of patterns within the same feature space), and far transfer conditions (rotation of patterns within a different feature space, where features are drawn from the same modality). This shows the generality of spatial training for conceptual generalisation. We did not test transfer across modalities nor transfer in a more natural setting; we leave this for future studies.”

      Further doubt on the underlying mechanism is cast by the possibility that the observed correlation between mapping task performance and the adoption of a 2D strategy may reflect general cognitive engagement rather than the spatial nature of the task. Similarly, the surprising finding that a significant number of participants benefited from spatial scaffolding without seeing spatial modalities may further raise questions about the interpretation of the scaffolding effect, pointing towards potential alternative interpretations, such as shifts in attention during learning induced by pre-training without changing underlying abstract conceptual representations.

      The Reviewer is concerned about the fact that the spatial pre-training could benefit the participants by increasing global cognitive engagement rather than providing a scaffold for learning invariances. It is correct that the participants in the control group in Exp. 2c have poorer performances on average than participants that benefit from the spatial pre-training in Exp. 2a and 2b. The better performances of the participants in Exp. 2a and 2b could be due to either the spatial nature of the pre-training (as we claim) or a difference in general cognitive engagement. .

      However, if we look closely at the results of Exp. 3, we can see that the general cognitive engagement hypothesis is not well supported by the data. Indeed, the participants in the control condition (Exp. 3c) have relatively similar performances than the other groups during training. Rather, the difference is in the strategy they use, as revealed by the transfer condition. The majority of them are using a 1D strategy, contrary to the participants that benefited from a spatial pre-training (Exp 3a and 3b). We have included a sentence in the results:

      “Further, the results show that participants who did not experience spatial pre-training were still engaged in the task, but were not using the same strategy as the participants who experienced spatial pre-training (1D rather than 2D). Thus, the benefit of the spatial pre-training is not simply to increase the cognitive engagement of the participants. Rather, spatial pre-training provides a scaffold to learn rotation-invariant representation of auditory and visual concepts even when rotation is never explicitly shown during pre-training.”

      Finally, Reviewer #1 had a related concern about a potential alternative explanation that involved a shift in attention. We reproduce our response here: we agree with the Reviewer that the “attention to dimensions” hypothesis is an interesting (and potentially concerning) alternative explanation. However, we believe that the results of our control experiments Exp. 2c and Exp. 3c are not compatible with this alternative explanation.

      Indeed, in Exp. 2c, participants are pre-trained in the visual modality and then tested in the auditory modality. In the multimodal association task, participants have to associate the auditory stimuli and the visual stimuli: on each trial, they hear a sound and then have to click on the corresponding visual stimulus. It is necessary to pay attention to both auditory dimensions and both visual dimensions to perform well in the task. To give an example, the task might involve mapping the fundamental frequency and the amplitude modulation of the auditory stimulus to the colour and the shape of the visual stimulus, respectively. If participants pay attention to only one dimension, this would lead to a maximum of 25% accuracy on average (because they would be at chance on the other dimension, with four possible options). We observed that 30/50 participants reached an accuracy > 50% in the multimodal association task in Exp. 2c. This means that we know for sure that at least 60% of the participants actually paid attention to both dimensions of the stimuli. Nevertheless, there was a clear difference between participants that received a visual pre-training (Exp. 2c) and those who received a spatial pre-training (Exp. 2a) (frequency of 1D vs 2D models between conditions, BF > 100 in near transfer and far transfer). In fact, only 3/50 participants were best fit by a 2D model when vision was the pre-training modality compared to 29/50 when space was the pre-training modality. Thus, the benefit of the spatial pre-training cannot be due solely to a shift in attention toward both dimensions.

      This effect was replicated in Exp. 3c. Similarly, 33/48 participants reached an accuracy > 50% in the multimodal association task in Exp. 3c, meaning that we know for sure that at least 68% of the participants actually paid attention to both dimensions of the stimuli. Again, there was a clear difference between participants who received a visual pre-training (frequency of 1D vs 2D models between conditions, Exp. 3c) and those who received a spatial pre-training (Exp. 3a) (BF > 100 in near transfer and far transfer).

      Thus, we believe that the alternative explanation raised by the Reviewer is not supported by our data. We have added a paragraph in the discussion:

      “One alternative explanation of this effect could be that the spatial pre-training encourages participants to attend to both dimensions of the non-spatial stimuli. By contrast, pretraining in the visual or auditory domains (where multiple dimensions of a stimulus may be relevant less often naturally) encourages them to attend to a single dimension. However, data from our control experiments Exp. 2c and Exp. 3c, are incompatible with this explanation. Around ~65% of the participants show a level of performance in the multimodal association task (>50%) which could only be achieved if they were attending to both dimensions (performance attending to a single dimension would yield 25% and chance performance is at 6.25%). This suggests that participants are attending to both dimensions even in the visual and auditory mapping case.”

      Conclusions:

      The authors successfully demonstrate that spatial training can enhance the ability to generalize in nonspatial domains, particularly in recognizing rotated sequences. The results for the most part support their conclusions, showing that spatial representations can act as a scaffold for learning more abstract conceptual invariances. However, the study leaves room for further investigation into whether the observed effects are unique to spatial cognition or could be replicated with other forms of well-established knowledge, as well as further clarifications of the underlying mechanisms.

      Impact:

      The study's findings are likely to have a valuable impact on cognitive science, particularly in understanding how abstract concepts are learned and generalized. The methods and data can be useful for further research, especially in exploring the relationship between spatial cognition and abstract conceptualization. The insights could also be valuable for AI research, particularly in improving models that involve abstract pattern recognition and conceptual generalization.

      In summary, the paper contributes valuable insights into the role of spatial cognition in learning abstract concepts, though it invites further research to explore the boundaries and specifics of this scaffolding effect.

      Reviewer #1 (Recommendations For The Authors):

      Minor issues / typos:

      P6: I think the example of the "signed" mapping here should be "e.g., ABAB maps to one category and BABA maps to another", rather than "ABBA maps to another" (since ABBA would always map to another category, whether the mapping is signed or unsigned).

      Done.

      P11: "Next, we asked whether pre-training and mapping were systematically associated with 2Dness...". I'd recommend changing to: "Next, we asked whether accuracy during pre-training and mapping were systematically associated with 2Dness...", just to clarify what the analyzed variables are.

      Done.

      P13, paragraph 1: "only if the features were themselves are physical spatial locations" either "were" or "are" should be removed.

      Done.

      P13, paragraph 1: should be "neural representations of space form a critical substrate" (not "for").

      Done.

      Reviewer #2 (Recommendations For The Authors):

      The authors use in multiple places in the manuscript the phrases "learn invariances" (Abstract), "formation of invariances" (p. 2, para. 1), etc. It might be just me, but this feels a bit like 'sloppy' wording: we do not learn or form invariances, rather we learn or form representations or transformations by which we can perform tasks that require invariance over particular features or transformation of the input such as the case of object recognition and size- translation- or lighting-invariance. We do not form size invariance, we have representations of objects and/or size transformations allowing the recognition of objects of different sizes. The authors might change this way of referring to the phenomenon.

      We respectfully disagree with this comment. An invariance occurs when neurons make the same response under different stimulation patterns. The objects or features to which a neuron responds is shaped by its inputs. Those inputs are in turn determined by experience-dependent plasticity. This process is often called “representation learning”. We think that our language here is consistent with this status quo view in the field.

      Reviewer #3 (Recommendations For The Authors):

      • I understand that the objective of the present experiment is to study our ability to generalize abstract patterns of sensory data (concepts). In the introduction, the authors present examples like the concept of a "tree" (encompassing a family tree, an oak tree, and a taxonomic tree) and "ring" to illustrate the idea. However, I am sceptical as to whether these examples effectively represent the phenomena being studied. From my perspective, these different instances of "tree" do not seem to relate to the same abstract concept that is translated or rotated but rather appear to share only a linguistic label. For instance, the conceptual substance of a family tree is markedly different from that of an oak tree, lacking significant overlap in meaning or structure. Thus, to me, these examples do not demonstrate invariance to transformations such as rotations.

      To elaborate further, typically, generalization involves recognizing the same object or concept through transformations. In the case of abstract concepts, this would imply a shared abstract representation rather than a mere linguistic category. While I understand the objective of the experiments and acknowledge their potential significance, I find myself wondering about the real-world applicability and relevance of such generalizations in everyday cognitive functioning. This, in turn, casts some doubt on the broader relevance of the study's results. A more fitting example, or an explanation that addresses my concerns about the suitability of the current examples, would be beneficial to further clarify the study's intent and scope.

      Response in the public review.

      • Relatedly, the manuscript could benefit from greater clarity in defining key concepts and elucidating the proposed mechanism behind the observed effects. Is it plausible that the changes observed are primarily due to shifts in attention induced by the spatial pre-training, rather than a change in the process of learning abstract conceptual invariances (i.e., modifications to the abstract representations themselves)? While the authors conclude that spatial pre-training acts as a scaffold for enhancing the learning of conceptual invariances, it raises the question: does this imply participants simply became more focused on spatial relationships during learning, or might this shift in attention represent a distinct strategy, and an alternative explanation? A more precise definition of these concepts and a clearer explanation of the authors' perspective on the mechanism underlying these effects would reduce any ambiguity in this regard.

      Response in the public review.

      • I am wondering whether the effectiveness of spatial representations in generalizing abstract concepts stems from their special nature or simply because they are a familiar 2D space for participants. It is well-established that memory benefits from linking items to familiar locations, a technique used in memory training (method of loci). This raises the question: Are we observing a similar effect here, where spatial dimensions are the only tested familiar 2D spaces, while the other 2 spaces are simply unfamiliar, as also suggested by the lower performance during training (Fig.2)? Would the results be replicable with another well-learned, robustly encoded domain, such as auditory dimensions for professional musicians, or is there something inherently unique about spatial representations that aids in bootstrapping abstract representations?

      On the other side of the same coin, are spatial representations qualitatively different, or simply more efficient because they are learned more quickly and readily? This leads to the consideration that if visual pre-training and visual-to-auditory mapping were continued until a similar proficiency level as in spatial training is achieved, we might observe comparable performance in aiding generalization. Thus, the conclusion that spatial representations are a special scaffold for abstract concepts may not be exclusively due to their inherent spatial nature, but rather to the general characteristic of well-established representations. This hypothesis could be further explored by either identifying alternative 2D representations that are equally well-learned or by extending training in visual or auditory representations before proceeding with the mapping task. At the very least I believe this potential explanation should be explored in the discussion section.

      Response in the public review.

      I had some difficulty in following an important section of the introduction: "... whether participants can learn rotationally invariant concepts in nonspatial domains, i.e., those that are defined by sequences of visual and auditory features (rather than by locations in physical space, defined in Cartesian or polar coordinates) is not known." This was initially puzzling to me as the paragraph preceding it mentions: "There is already good evidence that nonspatial concepts are represented in a translation invariant format." While I now understand that the essential distinction here is between translation and rotation, this was not immediately apparent upon first reading. This crucial distinction, especially in the context of conceptual spaces, was not clearly established before this point in the manuscript. For better clarity, it would be beneficial to explicitly contrast and define translation versus rotation in this particular section and stress that the present study concerns rotations in abstract spaces.

      Done.

      • The multi-modal association is crucial for the study, however to my knowledge, it is not depicted or well explained in the main text or figures (Results section). In my opinion, the details of this task should be explained and illustrated before the details of the associated results are discussed.

      We have included an illustration of a multimodal association trial in Fig. S3B.

      Author response image 2.

      • The observed correlation between the mapping task performance and the adoption of a 2D strategy is logical. However, this correlation might not exclusively indicate the proposed underlying mechanism of spatial scaffolding. Could it also be reflective of more general factors like overall performance, attention levels, or the effort exerted by participants? This alternative explanation suggests that the correlation might arise from broader cognitive engagement rather than specifically from the spatial nature of the task. Addressing this possibility could strengthen the argument for the unique role of spatial representations in learning abstract concepts, or at least this alternative interpretation should be mentioned.

      Response in the public review.

      • To me, the finding that ~30% of participants benefited from the spatial scaffolding effect for example in the auditory condition merely through exposure to the mapping (Fig 4D), without needing to see the quadruplets in the spatial modality, was somewhat surprising. This is particularly noteworthy considering that only ~60% of participants adopted the 2D strategy with exposure to rotated contingencies in Experiment 3 (Fig 3D). How do the authors interpret this outcome? It would be interesting to understand their perspective on why such a significant effect emerged from mere exposure to the mapping task.

      • I appreciate the clarity Fig.1 provides in explaining a challenging experimental setup. Is it possible to provide example trials, including an illustration that shows which rotations produce the trail and an intuitive explanation that response maps onto the 1D vs 2D strategies respectively, to aid the reader in better understanding this core manipulation?

      • I like that the authors provide transparency by depicting individual subject's data points in their results figures (e.g. Figs. 2 B, F, J). However, with an n=~50 per condition, it becomes difficult to intuit the distribution, especially for conditions with higher variance (e.g., Auditory). The figures might be more easily interpretable with alternative methods of displaying variances, such as violin plots per data point, conventional error shading using 95%CIs, etc.

      • Why are the authors not reporting exact BFs in the results sections at least for the most important contrasts?

      • While I understand why the authors report the frequencies for the best model fits, this may become difficult to interpret in some sections, given the large number of reported values. Alternatives or additional summary statistics supporting inference could be beneficial.

      As the Reviewer states, there are a large number of figures that we can report in this study. We have chosen to keep this number at a minimum to be as clear as possible. To illustrate the distribution of individual data points, we have opted to display only the group's mean and standard error (the standard errors are included, but the substantial number of participants per condition provides precise estimates, resulting in error bars that can be smaller than the mean point). This decision stems from our concern that including additional details could lead to a cluttered representation with unnecessary complexity. Finally, we report what we believe to be the critical BFs for the comprehension of the reader in the main text, and choose a cutoff of 100 when BFs are high (corresponding to the label “decisive” evidence, some BFs are larger than 1012). All the exact BFs are in the supplementary for the interested readers.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The manuscript considers a mechanistic extension of MacArthur's consumer-resource model to include chasing down food and potential encounters between the chasers (consumers) that lead to less efficient feeding in the form of negative feedback. After developing the model, a deterministic solution and two forms of stochastic solutions are presented, in agreement with each other. Finally, the model is applied to explain observed coexistence and rank-abundance data.

      We thank the reviewer for the accurate summary of our manuscript.

      Strengths:

      The application of the theory to natural rank-abundance curves is impressive. The comparison with the experiments that reject the competitive exclusion principle is promising. It would be fascinating to see if in, e.g. insects, the specific interference dynamics could be observed and quantified and whether they would agree with the model.

      The results are clearly presented; the methods adequately described; the supplement is rich with details.

      There is much scope to build upon this expansion of the theory of consumer-resource models. This work can open up new avenues of research.

      We appreciate the reviewer for the very positive comments. We have followed many of the suggestions raised by the reviewer, and the manuscript is much improved as a result.

      Following the reviewer’s suggestions, we have now used Shannon entropies to quantify the model comparison with experiments that reject the Competitive Exclusion Principle (CEP). Specifically, for each time point of each experimental or model-simulated community, we calculated the Shannon entropies using the formula:

      , where is the probability that a consumer individual belongs to species C<sub>i</sub> at the time stamp of t. The comparison of Shannon entropies in the time series between those of the experimental data and SSA results shown in Fig. 2D-E is presented in Appendix-fig. 7C-D. The time averages and standard deviations (δH) of the Shannon entropies for these experimental or SSA model-simulated communities are as follows:

      , ; ,

      , , .

      Meanwhile, we have calculated the time averages and standard deviations (δC<sub>i</sub>) of the species’ relative/absolute abundances for the experimental or SSA model-simulated communities shown in Fig. 2D-E, which are as follows:

      , ; , ; , , , , where the superscript “(R)” represents relative abundances.

      From the results of Shannon entropies shown in Author response image 1 (which are identical to those of Appendix-fig. 7C-D) and the quantitative comparison of the time average and standard deviation between the model and experiments presented above, it is evident that the model results in Fig. 2D-E exhibit good consistency with the experimental data. They share roughly identical time averages and standard deviations in both Shannon entropies and the species' relative/absolute abundances for most of the comparisons. All these analyses are included in the appendices and mentioned in the main text.

      Author response image 1.

      Shannon Entropies of the experimental data and SSA results in Fig. 2D-E, redrawn from Appendix-fig. 7C-D.

      Weaknesses:

      I am questioning the use of carrying capacity (Eq. 4) instead of using nutrient limitation directly through Monod consumption (e.g. Posfai et al. who the authors cite). I am curious to see how these results hold or are changed when Monod consumption is used.

      We thank the reviewer for raising this question. To explain it more clearly, the equation combining the third equation in Eq. 1 and Eq. 4 of our manuscript is presented below as Eq. R1:

      where x<sub>il</sub> represents the population abundance of the chasing pair C<sub>i</sub><sup>(P)</sup> ∨ R<sub>l</sub><sup>(P)</sup>, κ<sub>l</sub> stands for the steady-state population abundance of species R<sub>l</sub> (the carrying capacity) in the absence of consumer species. In the case with no consumer species, then x<sub>il</sub> \= 0 since C<sub>i</sub> \= 0 (i\=1,…,S<sub>C</sub>), thus R<sub>l</sub> = κ<sub>l</sub> when R<sub>l</sub> = 0.

      Eq. R1 for the case of abiotic resources is comparable to Eq. (1) in Posfai et al., which we present below as Eq. R2:

      where c<sub>i</sub> represents the concentration of nutrient i, and thus corresponds to our R<sub>l</sub> ; n<sub>σ</sub>(t) is the population of species σ, which corresponds to our C<sub>i</sub> ; s<sub>i</sub> stands for the nutrient supply rate, which corresponds to our ζl ; µi denotes the nutrient loss rate, corresponding to our is the coefficient of the rate of species σ for consuming nutrient i, which corresponds to our in Posfai et al. is the consumption rate of nutrient i by the population of species σ, which corresponds to our x<sub>il</sub>.

      In Posfai et al., is the Monod function: and thus

      In our model, however, since predator interference is not involved in Posfai et al., we need to analyze the form of x<sub>il</sub> presented in the functional form of x<sub>il</sub> ({R<sub>l</sub>},{C<sub>i</sub>}) in the case involving only chasing pairs. Specifically, for the case of abiotic resources, the population dynamics can be described by Eq. 1 combined with Eq. R1:

      where and . For convenience, we consider the case of S<sub>R</sub> \=1 where the Monod form was derived (Monod, J. (1949). Annu. Rev. Microbiol., 3, 371-394.). From , we have

      where , and l =1. If the population abundance of the resource species is much larger than that of all consumer species (i.e., ), then,

      and R<sub>l</sub><sup>(F)</sup> ≈ R<sub>l</sub>. Combined with R5, and noting that C<sub>i</sub> \= C<sub>i</sub>(F) + xil we can solve for x<sub>il</sub> :

      with l =1 since S<sub>R</sub> \=1. Comparing Eq. R6 with Eq. R3, and considering the symbol correspondence explained in the text above, it is now clear that our model can be reduced to the Monod consumption form in the case of S<sub>R</sub> \=1 where the Monod form was derived from.

      Following on the previous comment, I am confused by the fact that the nutrient consumption term in Eq. 1 and how growth is modeled (Eq. 4) are not obviously compatible and would be hard to match directly to experimentally accessible quantities such as yield (nutrient to biomass conversion ratio). Ultimately, there is a conservation of mass ("flux balance"), and therefore the dynamics must obey it. I don't quite see how conservation of mass is imposed in this work.

      We thank the reviewer for raising this question. Indeed, the population dynamics of our model must adhere to flux balance, with the most pertinent equation restated here as Eq. R7:

      Below is the explanation of how Eq. R7, and thus Eqs. 1 and 4 of our manuscript, adhere to the constraint of flux balance. The interactions and fluxes between consumer and resource species occur solely through chasing pairs. At the population level, the scenario of chasing pairs among consumer species C<sub>i</sub> and resource species R<sub>l</sub> is presented in the follow expression:

      where the superscripts "(F)" and "(P)" represent the freely wandering individuals and those involved in chasing pairs, respectively, "(+)" stands for the gaining biomass of consumer C<sub>i</sub> from resource R<sub>l</sub>. In our manuscript, we use x<sub>l</sub> to represent the population abundance (or equivalently, the concentration, for a well-mixed system with a given size) of the chasing pair C<sub>i</sub><sup>(P)</sup> ∨ R<sub>l</sub><sup>(P)</sup>, and thus, the net flow from resource species R<sub>l</sub> to consumer species C<sub>i</sub> per unit time is k<sub>il</sub>x<sub>il</sub>. Noting that there is only one R<sub>l</sub> individual within the chasing pair C<sub>i</sub><sup>(P)</sup> ∨ R<sub>l</sub><sup>(P)</sup>, then the net effect on the population dynamics of species is −k<sub>il</sub>x<sub>il</sub>. However, since a consumer individual from species C<sub>i</sub> could be much heavier than a species R<sub>l</sub> individual, and energy dissipation would be involved from nutrient conversion into biomass, we introduce a mass conversion ratio w<sub>l</sub> in our manuscript. For example, if a species C<sub>i</sub> individual is ten times the weight of a species R<sub>l</sub> individual, without energy dissipation, the mass conversion ratio wil should be 1/10 (i.e., wil \= 0.1 ), however, if half of the chemical energy is dissipated into heat from nutrient conversion into biomass, then w<sub>l</sub> \= 0.1 0.5× = 0.05. Consequently, the net effect of the flux from resource species _R_l to consumer species C<sub>i</sub> per unit time on the population dynamics is , and flux balance is clearly satisfied.

      For the population dynamics of a consumer species C<sub>i</sub>, we need to consider all the biomass influx from different resource species, and thus there is a summation over all species of resources, which leads to the term of in Eq. R7. Similarly, for the population dynamics of a resource species R<sub>l</sub>, we need to lump sum all the biomass outflow into different consumer species, resulting in the term of in Eq. R7.

      Consequently, Eq. R7 and our model satisfy the constraint of flux balance.

      These models could be better constrained by more data, in principle, thereby potential exists for a more compelling case of the relevance of this interference mechanism to natural systems.

      We thank the reviewer for raising this question. Indeed, our model could benefit from the inclusion of more experimental data. In our manuscript, we primarily set the parameters by estimating their reasonable range. Following the reviewer's suggestions, we have now specified the data we used to set the parameters. For example, in Fig. 2D, we set 𝐷<sub>2</sub>\=0.01 with τ=0.4 days, resulting in an expected lifespan of Drosophila serrata in our model setting of 𝜏⁄𝐷<sub>2</sub>\= 40 days, which roughly agrees with experimental data showing that the average lifespan of D. serrata is 34 days for males and 54 days for females (lines 321-325 in the appendices; reference: Narayan et al. J Evol Biol. 35: 657–663 (2022)). To explain biodiversity and quantitatively illustrate the rank-abundance curves across diverse communities, the competitive differences across consumer species, exemplified by the coefficient of variation of the mortality rates - a key parameter influencing the rank-abundance curve, were estimated from experimental data in the reference article (Patricia Menon et al., Water Research (2003) 37, 4151) using the two-sigma rule (lines 344-347 in the appendices).

      Still, we admit that many factors other than intraspecific interference, such as temporal variation, spatial heterogeneity, etc., are involved in breaking the limits of CEP in natural systems, and it is still challenging to differentiate each contribution in wild systems. However, for the two classical experiments that break CEP (Francisco Ayala, 1969; Thomas Park, 1954), intraspecific interference could probably be the most relevant mechanism, since factors such as temporal variation, spatial heterogeneity, cross-feeding, and metabolic tradeoffs are not involved in those two experimental systems.

      The underlying frameworks, B-D and MacArthur are not properly exposed in the introduction, and as a result, it is not obvious what is the specific contribution in this work as opposed to existing literature. One needs to dig into the literature a bit for that.

      The specific contribution exists, but it might be more clearly separated and better explained. In the process, the introduction could be expanded a bit to make the paper more accessible, by reviewing key features from the literature that are used in this manuscript.

      We thank the reviewer for these very insightful suggestions. Following these suggestions, we have now added a new paragraph and revised the introduction part of our manuscript (lines 51-67 in the main text) to address the relevant issues. Our paper is much improved as a result.

      Reviewer #2 (Public Review):

      Summary:

      The manuscript by Kang et al investigates how the consideration of pairwise encounters (consumer-resource chasing, intraspecific consumer pair, and interspecific consumer pair) influences the community assembly results. To explore this, they presented a new model that considers pairwise encounters and intraspecific interference among consumer individuals, which is an extension of the classical Beddington-DeAngelis (BD) phenomenological model, incorporating detailed considerations of pairwise encounters and intraspecific interference among consumer individuals. Later, they connected with several experimental datasets.

      Strengths:

      They found that the negative feedback loop created by the intraspecific interference allows a diverse range of consumer species to coexist with only one or a few types of resources. Additionally, they showed that some patterns of their model agree with experimental data, including time-series trajectories of two small in-lab community experiments and the rank-abundance curves from several natural communities. The presented results here are interesting and present another way to explain how the community overcomes the competitive exclusion principle.

      We appreciate the reviewer for the positive comments and the accurate summary of our manuscript.

      Weaknesses:

      The authors only explore the case with interspecific interference or intraspecific interference exists. I believe they need to systematically investigate the case when both interspecific and intraspecific interference exists. In addition, the text description, figures, and mathematical notations have to be improved to enhance the article's readability. I believe this manuscript can be improved by addressing my comments, which I describe in more detail below.

      We thank the reviewer for these valuable suggestions. We have followed many of the suggestions raised by the reviewer, and the manuscript is much improved as a result.

      (1) In nature, it is really hard for me to believe that only interspecific interference or intraspecific interference exists. I think a hybrid between interspecific interference and intraspecific interference is very likely. What would happen if both the interspecific and intraspecific interference existed at the same time but with different encounter rates? Maybe the authors can systematically explore the hybrid between the two mechanisms by changing their encounter rates. I would appreciate it if the authors could explore this route.

      We thank the reviewer for raising this question. Indeed, interspecific interference and intraspecific interference simultaneously exist in real cases. To differentiate the separate contributions of inter- and intra-specific interference on biodiversity, we considered different scenarios involving inter- or intra-specific interference. In fact, we have also considered the scenario involving both inter- and intra-specific interference in our old version for the case of S<sub>C</sub> = 2 and S<sub>R</sub> = 1, where two consumer species compete for one resource species (Appendix-fig. 5, and lines 147-148, 162-163 in the main text of the old version, or lines 160-161, 175-177 in the new version).

      Following the reviewer’s suggestions, we have now systematically investigated the cases of S<sub>C</sub> = 6, S<sub>R</sub> = 1, and S<sub>C</sub> = 20, S<sub>R</sub> = 1, where six or twenty consumer species compete for one resource species in scenarios involving chasing pairs and both inter- and intra-specific interference using both ordinary differential equations (ODEs) and stochastic simulation algorithm (SSA). These newly added ODE and SSA results are shown in Appendix-fig. 5 F-H, and we have added a new paragraph to describe these results in our manuscript (lines 212-215 in the main text). Consistent with our findings in the case of S<sub>C</sub> = 2 and S<sub>R</sub> = 1, the species coexistence behavior in the cases of both S<sub>C</sub> = 6, S<sub>R</sub> = 1, and S<sub>C</sub> = 20, S<sub>R</sub> = 1 is very similar to those without interspecific interference: all consumer species coexist with one type of resources at constant population densities in the ODE studies, and the SSA results fluctuate around the population dynamics of the ODEs.

      As for the encounter rates of interspecific and intraspecific interference, in fact, in a well-mixed system, these encounter rates can be derived from the mobility rates of the consumer species using the mean field method. For a system with a size of L2, the interspecific encounter rate between consumer species C<sub>i</sub> and C<sub>j</sub> (ij) is please refer to lines 100-102, 293-317 in the main text, and see also Appendix-fig. 1), where r<sup>(I)</sup> is the upper distance for interference, while v<sub>C<sub>i</sub></sub> and v<sub>C<sub>j</sub></sub> represent the mobility rates of species C<sub>i</sub> and C<sub>j</sub>, respectively. Meanwhile, the intraspecific encounter rates within species C<sub>i</sub> and species C<sub>j</sub> are and , respectively.

      Thus, once the intraspecific encounter rates a’<sub>ii</sub> are a’<sub>jj</sub> given, the interspecific encounter rate between species C<sub>i</sub> and C<sub>j</sub> is determined. Consequently, we could not tune the encounter rates of interspecific and intraspecific interference at will in our study, especially noting that for clarity reasons, we have used the mortality rate as the only parameter that varies among the consumer species throughout this study. Alternatively, we have made a systematic study on analyzing the influence of varying the separate rate and escape rate on species coexistence in the case of two consumers competing for a single type of resources (see Appendix-fig. 5A).

      (2) In the first two paragraphs of the introduction, the authors describe the competitive exclusion principle (CEP) and past attempts to overcome the CEP. Moving on from the first two paragraphs to the third paragraph, I think there is a gap that needs to be filled to make the transition smoother and help readers understand the motivations. More specifically, I think the authors need to add one more paragraph dedicated to explaining why predator interference is important, how considering the mechanism of predator interference may help overcome the CEP, and whether predator interference has been investigated or under-investigated in the past. Then building upon the more detailed introduction and movement of predator interference, the authors may briefly introduce the classical B-D phenomenological model and what are the conventional results derived from the classical B-D model as well as how they intend to extend the B-D model to consider the pairwise encounters.

      We thank the reviewer for these very insightful suggestions. Following these suggestions, we have added a new paragraph and revised the introduction part of our paper (lines 51-67 in the main text). Our manuscript is significantly improved as a result.

      (3) The notations for the species abundances are not very informative. I believe some improvements can be made to make them more meaningful. For example, I think using Greek letters for consumers and English letters for resources might improve readability. Some sub-scripts are not necessary. For instance, R^(l)_0 can be simplified to g_l to denote the intrinsic growth rate of resource l. Similarly, K^(l)_0 can be simplified to K_l. Another example is R^(l)_a, which can be simplified to s_l to denote the supply rate. In addition, right now, it is hard to find all definitions across the text. I would suggest adding a separate illustrative box with all mathematical equations and explanations of symbols.

      We thank the reviewer for these very useful suggestions. We have now followed many of the suggestions to improve the readability of our manuscript. Given that we have used many English letters for consumers and there are already many symbols of English and Greek letters for different variables and parameters in the appendices, we have opted to use Greek letters for parameters specific to resource species and English letters for those specific to consumer species. Additionally, we have now added Appendix-tables 1-2 in the appendices (pages 16-17 in the appendices) to illustrate the symbols used throughout our manuscript.

      (4) What is the f_i(R^(F)) on line 131? Does it refer to the growth rate of C_i? I noticed that f_i(R^(F)) is defined in the supplementary information. But please ensure that readers can understand it even without reading the supplementary information. Otherwise, please directly refer to the supplementary information when f_i(R^(F)) occurs for the first time. Similarly, I don't think the readers can understand \Omega^\prime_i and G^\prime_i on lines 135-136.

      We thank the reviewer for raising these questions. We apologize for not illustrating those symbols and functions clearly enough in our previous version of the manuscript. f<sub>i</sub>R<sup>(F)</sup>⟯ is a function of the variable R<sup>(F)</sup> with the index i, which is defined as and for i=2. Following the reviewer’s suggestions, we have now added clear definitions for symbols and functions and resolved these issues. The definitions of \Omega_i, \Omega^\prime_i, G, and G^\prime are overly complex, and hence we directly refer to the Appendices when they occur for the first time in the main text.

      Reviewer #3 (Public Review):

      Summary:

      A central question in ecology is: Why are there so many species? This question gained heightened interest after the development of influential models in theoretical ecology in the 1960s, demonstrating that under certain conditions, two consumer species cannot coexist on the same resource. Since then, several mechanisms have been shown to be capable of breaking the competitive exclusion principle (although, we still lack a general understanding of the relative importance of the various mechanisms in promoting biodiversity).

      One mechanism that allows for breaking the competitive exclusion principle is predator interference. The Beddington-DeAngelis is a simple model that accounts for predator interference in the functional response of a predator. The B-D model is based on the idea that when two predators encounter one another, they waste some time engaging with one another which could otherwise be used to search for resources. While the model has been influential in theoretical ecology, it has also been criticized at times for several unusual assumptions, most critically, that predators interfere with each other regardless of whether they are already engaged in another interaction. However, there has been considerable work since then which has sought either to find sets of assumptions that lead to the B-D equation or to derive alternative equations from a more realistic set of assumptions (Ruxton et al. 1992; Cosner et al. 1999; Broom et al. 2010; Geritz and Gyllenberg 2012). This paper represents another attempt to more rigorously derive a model of predator interference by borrowing concepts from chemical reaction kinetics (the approach is similar to previous work: Ruxton et al. 1992). The main point of difference is that the model in the current manuscript allows for 'chasing pairs', where a predator and prey engage with one another to the exclusion of other interactions, a situation Ruxton et al. (1992) do not consider. While the resulting functional response is quite complex, the authors show that under certain conditions, one can get an analytical expression for the functional response of a predator as a function of predator and resource densities. They then go on to show that including intraspecific interference allows for the coexistence of multiple species on one or a few resources, and demonstrate that this result is robust to demographic stochasticity.

      We thank the reviewer for carefully reading our manuscript and for the positive comments on the rigorously derived model of predator interference presented in our paper. We also appreciate the reviewer for providing a thorough introduction to the research background of our study, especially the studies related to the BeddingtonDeAngelis model. We apologize for our oversight in not fully appreciating the related study by Ruxton et al. (1992) at the time of our first submission. Indeed, as suggested by the reviewer, Ruxton et al. (1992) is relevant to our study in that we both borrowed concepts from chemical reaction kinetics. Now, we have reworked the introduction and discussion sections of our manuscript, cited, and acknowledged the contributions of related works, including Ruxton et al. (1992).

      Strengths:

      I appreciate the effort to rigorously derive interaction rates from models of individual behaviors. As currently applied, functional responses (FRs) are estimated by fitting equations to feeding rate data across a range of prey or predator densities. In practice, such experiments are only possible for a limited set of species. This is problematic because whether a particular FR allows stability or coexistence depends on not just its functional form, but also its parameter values. The promise of the approach taken here is that one might be able to derive the functional response parameters of a particular predator species from species traits or more readily measurable behavioral data.

      We appreciate the reviewer's positive comments regarding the rigorous derivation of our model. Indeed, all parameters of our model can be derived from measurable behavioral data for a specific set of predator species.

      Weaknesses:

      The main weakness of this paper is that it devotes the vast majority of its length to demonstrating results that are already widely known in ecology. We have known for some time that predator interference can relax the CEP (e.g., Cantrell, R. S., Cosner, C., & Ruan, S. 2004).

      While the model presented in this paper differs from the functional form of the B-D in some cases, it would be difficult to formulate a model that includes intraspecific interference (that increases with predator density) that does not allow for coexistence under some parameter range. Thus, I find it strange that most of the main text of the paper deals with demonstrating that predator interference allows for coexistence, given that this result is already well known. A more useful contribution would focus on the extent to which the dynamics of this model differ from those of the B-D model.

      We appreciate the reviewer for raising this question and apologize for not sufficiently clarifying the contribution of our manuscript in the context of existing knowledge upon our initial submission. We have now significantly revised the introduction part of our manuscript (lines 51-67 in the main text) to make this clearer. Indeed, with the application of the Beddington-DeAngelis (B-D) model, several studies (e.g., Cantrell, R. S., Cosner, C., & Ruan, S. 2004) have already shown that intraspecific interference promotes species coexistence, and it is certain that the mechanism of intraspecific interference could lead to species coexistence if modeled correctly. However, while we acknowledge that the B-D model is a brilliant phenomenological model of intraspecific interference, for the specific research topic of our manuscript on breaking the CEP and explaining the paradox of the plankton, it is highly questionable regarding the validity of applying the B-D model to obtain compelling results.

      Specifically, the functional response in the B-D model of intraspecific interference can be formally derived from the scenario involving only chasing pairs without consideration of pairwise encounters between consumer individuals (Eq. S8 in Appendices; related references: Gert Huisman, Rob J De Boer, J. Theor. Biol. 185, 389 (1997) and Xin Wang and Yang-Yu Liu, iScience 23, 101009 (2020)). Since we have demonstrated that the scenario involving only chasing pairs is under the constraint of CEP (see lines 139-144 in the main text and Appendix-fig. 3A-C; related references: Xin Wang and Yang-Yu Liu, iScience 23, 101009 (2020)), and given the identical functional response mentioned above, it is thus highly questionable regarding the validity of the studies relying on the B-D model to break CEP or explain the paradox of the plankton.

      Consequently, one of the major objectives of our manuscript is to resolve whether the mechanism of intraspecific interference can truly break CEP and explain the paradox of the plankton in a rigorous manner. By modeling intraspecific predator interference from a mechanistic perspective and applying rigorous mathematical analysis and numerical simulations, our work resolves these issues and demonstrates that intraspecific interference enables a wide range of consumer species to coexist with only one or a handful of resource species. This naturally breaks CEP, explains the paradox of plankton, and quantitatively illustrates a broad spectrum of experimental results.

      For intuitive understanding, we introduced a functional response in our model (presented as Eq. 5 in the main text), which indeed involves approximations. However, to rigorously break the CEP or explain the paradox of plankton, all simulation results in our study were directly derived from equations 1 to 4 (main text), without relying on the approximate functional response presented in Eq. 5.

      The formulation of chasing-pair engagements assumes that prey being chased by a predator are unavailable to other predators. For one, this seems inconsistent with the ecology of most predator-prey systems. In the system in which I work (coral reef fishes), prey under attack by one predator are much more likely to be attacked by other predators (whether it be a predator of the same species or otherwise). I find it challenging to think of a mechanism that would give rise to chased prey being unavailable to other predators. The authors also critique the B-D model: "However, the functional response of the B-D model involving intraspecific interference can be formally derived from the scenario involving only chasing pairs without predator interference (Wang and Liu, 2020; Huisman and De Boer, 1997) (see Eqs. S8 and S24). Therefore, the validity of applying the B-D model to break the CEP is questionable.".

      We appreciate the reviewer for raising this question. We fully agree with the reviewer that in many predator-prey systems (e.g., coral reef fishes as mentioned by the reviewer, wolves, and even microbial species such as Myxococcus xanthus; related references: Berleman et al., FEMS Microbiol. Rev. 33, 942-957 (2009)), prey under attack by one predator can be targeted by another predator (which we term as a chasing triplet) or even by additional predator individuals (which we define as higher-order terms). However, since we have already demonstrated in a previous study (Xin Wang, Yang-Yu Liu, iScience 23, 101009 (2020)) from a mechanistic perspective that a scenario involving chasing triplets or higher-order terms can naturally break the CEP, while our manuscript focuses on whether pairwise encounters between individuals can break the CEP and explain the paradox of plankton, we deliberately excluded confounding factors that are already known to promote biodiversity, just as we excluded prevalent factors such as cross-feeding and temporal variations in our model.

      However, the way "chasing pairs" are formulated does result in predator interference because a predator attacking prey interferes with the ability of other predators to encounter the prey. I don't follow the author's logic that B-D isn't a valid explanation for coexistence because a model incorporating chasing pairs engagements results in the same functional form as B-D.

      We thank the reviewer for raising this question, and we apologize for not making this point clear enough at the time of our initial submission. We have now revised the related part of our manuscript (lines 56-62 in the main text) to make this clearer.

      In our definition, predator interference means the pairwise encounter between consumer individuals, while a chasing pair is formed by a pairwise encounter between a consumer individual and a resource individual. Thus, in these definitions, a scenario involving only chasing pairs does not involve pairwise encounters between consumer individuals (which is our definition of predator interference).

      We acknowledge that there can be different definitions of predator interference, and the reviewer's interpretation is based on a definition of predator interference that incorporates indirect interference without pairwise encounters between consumer individuals. We do not wish to argue about the appropriateness of definitions. However, since we have proven that scenarios involving only chasing pairs are under the constraint of CEP (see lines 139-144 in the main text and Appendix-fig. 3A-C; related references: Xin Wang and Yang-Yu Liu, iScience 23, 101009 (2020)), while the functional response of the B-D model can be derived from the scenario involving only chasing pairs without consideration of pairwise encounters between consumer individuals (Eq. S8 in Appendices; related references: Gert Huisman, Rob J De Boer, J. Theor. Biol. 185, 389 (1997) and Xin Wang and Yang-Yu Liu, iScience 23, 101009 (2020)), it is thus highly questionable regarding the validity of applying the B-D model to break CEP.

      More broadly, the specific functional form used to model predator interference is of secondary importance to the general insight that intraspecific interference (however it is modeled) can allow for coexistence. Mechanisms of predator interference are complex and vary substantially across species. Thus it is unlikely that any one specific functional form is generally applicable.

      We thank the reviewer for raising this issue. We agree that the general insight that intraspecific predator interference can facilitate species coexistence is of great importance. We also acknowledge that any functional form of a functional response is unlikely to be universally applicable, as explicit functional responses inevitably involve approximations. However, we must reemphasize the importance of verifying whether intraspecific predator interference can truly break CEP and explain the paradox of plankton, which is one of the primary objectives of our study. As mentioned above, since the B-D model can be derived from the scenario involving only chasing pairs (Eq. S8 in Appendices; related references: Gert Huisman, Rob J De Boer, J. Theor. Biol. 185, 389 (1997) and Xin Wang and Yang-Yu Liu, iScience 23, 101009 (2020)), while we have demonstrated that scenarios involving only chasing pairs are subject to the constraint of CEP (see lines 139-144 in the main text and Appendix-fig. 3A-C; related references: Xin Wang and Yang-Yu Liu, iScience 23, 101009 (2020)), it is highly questionable regarding the validity of applying the B-D model to break CEP.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      I do not see any code or data sharing. They should exist in a prominent place. The authors should make their simulations and the analysis scripts freely available to download, e.g. by GitHub. This is always true but especially so in a journal like eLife.

      We appreciate the reviewer for these recommendations. We apologize for our oversight regarding the unsuccessful upload of the data in our initial submission, as the data size was considerable and we neglected to double-check for this issue. Following the reviewer’s recommendation, we have now uploaded the code and dataset to GitHub (accessible at https://github.com/SchordK/Intraspecific-predator-interference-promotesbiodiversity-in-ecosystems), where they are freely available for download.

      The introduction section should include more background, including about BD but also about consumer-resource models. Part of the results section could be moved/edited to the introduction. You should try that the results section should contain only "new" stuff whereas the "old" stuff should go in the introduction.

      We thank the reviewer for these recommendations. Following these suggestions, we have now reorganized our manuscript by adding a new paragraph to the introduction section (lines 51-62 in the main text) and revising related content in both the introduction and results sections (lines 63-67, 81-83 in the main text).

      I found myself getting a little bogged down in the general/formal description of the model before you go to specific cases. I found the most interesting part of the paper to be its second half. This is a dangerous strategy, a casual reader may miss out on the most interesting part of the paper. It's your paper and do what you think is best, but my opinion is that you could improve the presentation of the model and background to get to the specific contribution and specific use case quickly and easily, then immediately to the data. You can leave the more general formulation and the details to later in the paper or even the appendix. Ultimately, you have a simple idea and a beautiful application on interesting data-that is your strength I think, and so, I would focus on that.

      We appreciate the reviewer for the positive comments and valuable suggestions. Following these recommendations, we have revised the presentation of the background information to clarify the contribution of our manuscript, and we have refined our model presentation to enhance clarity. Meanwhile, as we need to address the concerns raised by other reviewers, we continue to maintain systematic investigations for scenarios involving different forms of pairwise encounters in the case of S<sub>C</sub> = 2 and S<sub>R</sub> = 1 before applying our model to the experimental data.

      Reviewer #2 (Recommendations For The Authors):

      (1) I believe the surfaces in Figs. 1F-H corresponds to the zero-growth isoclines. The authors should directly point it out in the figure captions and text descriptions.

      We thank the reviewer for this suggestion, and we have followed it to address the issue.

      (2) After showing equations 1 or 2, I believe it will help readers understand the mechanism of equations by adding text such as "(see Fig. 1B)" to the sentences following the equations.

      We appreciate the reviewer's suggestion, and we have implemented it to address the issue.

      (3) Lines 12, 129 143 & 188: "at steady state" -> "at a steady state"

      (4) Line 138: "is doom to extinct" -> "is doomed to extinct"

      (5) Line 170: "intraspecific interference promotes species coexistence along with stochasticity" -> "intraspecific interference still robustly promotes species coexistence when stochasticity is considered"

      (6) Line 190: "The long-term coexistence behavior are exemplified" -> "The long-term coexistence behavior is exemplified"

      (7) Line 227: "the coefficient of variation was taken round 0.3" -> "the coefficient of variation was taken around 0.3"?

      (8) Line 235: "tend to extinct" -> "tend to be extinct"

      We thank the reviewer for all these suggestions, and we have implemented each of them to revise our manuscript.

      Reviewer #3 (Recommendations For The Authors):

      I think this would be a much more useful paper if the authors focused on how the behavior of this model differs from existing models rather than showing that the new formation also generates the same dynamics as the existing theory.

      We thank the reviewers for this suggestion, and we apologize for not explaining the limitations of the B-D model and the related studies on the topic of CEP clearly enough at the time of our initial submission. As we have explained in the responses above, we have now revised the introduction part of our manuscript (lines 5167 in the main text) to make it clear that since the functional response in the B-D model can be derived from the scenario involving only chasing pairs without consideration of pairwise encounters between consumer individuals, while we have demonstrated that a scenario involving only chasing pairs is under the constraint of CEP, it is thus highly questionable regarding the validity of the studies relying on the B-D model to break CEP or explain the paradox of the plankton. Consequently, one of the major objectives of our manuscript is to resolve whether the mechanism of intraspecific interference can truly break CEP and explain the paradox of the plankton in a rigorous manner. By modeling from a mechanistic perspective, we resolve the above issues and quantitatively illustrate a broad spectrum of experimental results, including two classical experiments that violate CEP and the rank-abundance curves across diverse ecological communities.

      Things that would be of interest:

      What are the conditions for coexistence in this model? Presumably, it depends heavily on the equilibrium abundances of the consumers and resources as well as the engagement times/rates.

      We thank the reviewer for raising this question. We have shown that there is a wide range of parameter space for species coexistence in our model. Specifically, for the case involving two consumer species and one resource species (S<sub>C</sub> = 2 and S<sub>R</sub> \= 1), we have conducted a systematic study on the parameter region for promoting species coexistence. For clarity, we set the mortality rate 𝐷<sub>i</sub> (i = 1, 2) as the only parameter that varies with the consumer species, and the order of magnitude of all model parameters was estimated from behavioral data. The results for scenarios involving intraspecific predator interference are shown in Appendix-figs. 4B-D, 5A, 6C-D and we redraw some of them here as Fig. R2, including both ODEs and SSA results, wherein Δ = (𝐷<sub>1</sub>-𝐷<sub>2</sub>)/ 𝐷<sub>2</sub> represents the competitive difference between the two consumer species. For example, Δ =1 means that species C2 is twice the competitiveness of species C<sub>1</sub>. In Fig. R2 (see also Appendix-figs. 4B-D, 5A, 6C-D), we see that the two consumer species can coexist with a large competitive difference in either ODEs and SSA simulation studies.

      Author response image 2.

      The parameter region for two consumer species coexisting with one type of abiotic resource species (S<sub>C</sub> =2 and S<sub>R</sub> \=1). (A) The region below the blue surface and above the red surface represents stable coexistence of the three species at constant population densities. (B) The blue region represents stable coexistence at a steady state for the three species. (C) The color indicates (refer to the color bar) the coexisting fraction for long-term coexistence of the three species. Figure redrawn from Appendixfigs. 4B, 6C-D.

      For systems shown in Fig. 3A-D, where the number of consumer species is much larger than that of the resource species, we set each consumer species with unique competitiveness through a distinctive 𝐷<sub>i</sub> (i =1,…, S<sub>C</sub>). In Fig. 3A-D (see also Appendix fig. 10), we see that hundreds of consumer species may coexist with one or three types of resources when the coefficient of variation (CV) of the consumer species’ competitiveness was taken around 0.3, which indicates a large parameter region for promoting species coexistence.

      Is there existing data to estimate the parameters in the model directly from behavioral data? Do these parameter ranges support the hypothesis that predator interference is significant enough to allow for the coexistence of natural predator populations?

      We appreciate the reviewer for raising this question. Indeed, the parameters in our model were primarily determined by estimating their reasonable range from behavioral data. Following the reviewer's suggestions, we have now specified the data we used to set the parameters. For instance, in Fig. 2D, we set 𝐷<sub>2</sub>\=0.01 with τ=0.4 Day, resulting in an expected lifespan of Drosophila serrata in our model setting of 𝜏⁄𝐷<sub>2</sub>\= 40 days, which roughly agrees with experimental behavioral data showing that the average lifespan of D. serrata is 34 days for males and 54 days for females (lines 321325 in the appendices; reference: Narayan et al. J Evol Biol. 35: 657–663 (2022)). To account for competitive differences, we set the mortality rate as the only parameter that varies among the consumer species. As specified in the Appendices, the CV of the mortality rate is the only parameter that was used to fit the experiments within the range of 0.15-0.43. This parameter range (i.e., 0.15-0.43) was directly estimated from experimental data in the reference article (Patricia Menon et al., Water Research 37, 4151(2003)) using the two-sigma rule (lines 344-347 in the appendices).

      Given the high consistency between the model results and experiments shown in Figs. 2D-E and 3C-D, where all the key model parameters were estimated from experimental data in references, and considering that the rank-abundance curves shown in Fig. 3C-D include a wide range of ecological communities, there is no doubt that predator interference is significant enough to allow for the coexistence of natural predator populations within the parameter ranges estimated from experimental references.

      Bifurcation analyses for the novel parameters of this model. Does the fact that prey can escape lead to qualitatively different model behaviors?

      Author response image 3.

      Bifurcation analyses for the separate rate d’<sub>i</sub> and escape rate d<sub>i</sub> (i =1, 2) of our model in the case of two consumer species competing for one abiotic resource species (S<sub>C</sub> =2 and S<sub>R</sub> \=1). (A) A 3D representation: the region above the blue surface signifies competitive exclusion where C<sub>1</sub> species extinct, while the region below the blue surface and above the red surface represents stable coexistence of the three species at constant population densities. (B) a 2D representation: the blue region represents stable coexistence at a steady state for the three species. Figure redrawn from Appendix-fig. 4C-D.

      We appreciate the reviewer for this suggestion. Following this suggestion, we have conducted bifurcation analyses for the separate rate d’<sub>i</sub> and escape rate d<sub>i</sub> of our model in the case where two consumer species compete for one resource species (S<sub>C</sub> =2 and S<sub>R</sub> \=1). Both 2D and 3D representations of these results have been included in Appendix-fig. 4, and we redraw them here as Fig. R3. In Fig. R3, we set the mortality rate 𝐷<sub>i</sub> (i =1, 2) as the only parameter that varies between the consumer species, and thus Δ = _(D1-𝐷<sub>2</sub>)/𝐷<sub>2</sub> represents the competitive difference between the two species.

      As shown in Fig. R3A-B, the smaller the escape rate d<sub>i</sub>, the larger the competitive difference Δ tolerated for species coexistence at steady state. A similar trend is observed for the separate rate d’<sub>i</sub>. However, there is an abrupt change for both 2D and 3D representations at the area where d’<sub>i</sub> =0, since if d’<sub>i</sub> =0, all consumer individuals would be trapped in interference pairs, and then no consumer species could exist. On the contrary, there is no abrupt change for both 2D and 3D representations at the area where d<sub>i</sub>\=0, since even if d<sub>i</sub>\=0, the consumer individuals could still leave the chasing pair through the capture process.

      Figures: I found the 3D plots especially Appendix Figure 2 very difficult to interpret. I think 2D plots with multiple lines to represent predator densities would be more clear.

      We thank the reviewer for this suggestion. Following this suggestion, we have added a 2D diagram to Appendix-fig. 2.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment 

      The work introduces a valuable new method for depleting the ribosomal RNA from bacterial single-cell RNA sequencing libraries and shows that this method is applicable to studying the heterogeneity in microbial biofilms. The evidence for a small subpopulation of cells at the bottom of the biofilm which upregulates PdeI expression is solid. However, more investigation into the unresolved functional relationship between PdeI and c-di-GMP levels with the help of other genes co-expressed in the same cluster would have made the conclusions more significant. 

      Many thanks for eLife’s assessment of our manuscript and the constructive feedback. We are encouraged by the recognition of our bacterial single-cell RNA-seq methodology as valuable and its efficacy in studying bacterial population heterogeneity. We appreciate the suggestion for additional investigation into the functional relationship between PdeI and c-di-GMP levels. We concur that such an exploration could substantially enhance the impact of our conclusions. To address this, we have implemented the following revisions: We have expanded our data analysis to identify and characterize genes co-expressed with PdeI within the same cellular cluster (Fig. 3F, G, Response Fig. 10); We conducted additional experiments to validate the functional relationships between PdeI and c-di-GMP, followed by detailed phenotypic analyses (Response Fig. 9B). Our analysis reveals that while other marker genes in this cluster are co-expressed, they do not significantly impact biofilm formation or directly relate to c-di-GMP or PdeI. We believe these revisions have substantially enhanced the comprehensiveness and context of our manuscript, thereby reinforcing the significance of our discoveries related to microbial biofilms. The expanded investigation provides a more thorough understanding of the PdeI-associated subpopulation and its role in biofilm formation, addressing the concerns raised in the initial assessment.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Summary: 

      In this manuscript, Yan and colleagues introduce a modification to the previously published PETRI-seq bacterial single-cell protocol to include a ribosomal depletion step based on a DNA probe set that selectively hybridizes with ribosome-derived (rRNA) cDNA fragments. They show that their modification of the PETRI-seq protocol increases the fraction of informative non-rRNA reads from ~4-10% to 54-92%. The authors apply their protocol to investigating heterogeneity in a biofilm model of E. coli, and convincingly show how their technology can detect minority subpopulations within a complex community. 

      Strengths: 

      The method the authors propose is a straightforward and inexpensive modification of an established split-pool single-cell RNA-seq protocol that greatly increases its utility, and should be of interest to a wide community working in the field of bacterial single-cell RNA-seq. 

      Weaknesses: 

      The manuscript is written in a very compressed style and many technical details of the evaluations conducted are unclear and processed data has not been made available for evaluation, limiting the ability of the reader to independently judge the merits of the method. 

      Thank you for your thoughtful and constructive review of our manuscript. We appreciate your recognition of the strengths of our work and the potential impact of our modified PETRI-seq protocol on the field of bacterial single-cell RNA-seq. We are grateful for the opportunity to address your concerns and improve the clarity and accessibility of our manuscript.

      We acknowledge your feedback regarding the compressed writing style and lack of technical details, which are constrained by the requirements of the Short Report format in eLife. We have addressed these issues in our revised manuscript as follows:

      (1) Expanded methodology section: We have provided a more comprehensive description of our experimental procedures, including detailed protocols for the ribosomal depletion step (lines 435-453) and data analysis pipeline (lines 471-528). This will enable readers to better understand and potentially replicate our methods.

      (2) Clarification of technical evaluations: We have elaborated on the specifics of our evaluations, including the criteria used for assessing the efficiency of ribosomal depletion (lines 99-120), and the methods employed for identifying and characterizing subpopulations (lines 155-159, 161-163 and 163-167).

      (3) Data availability: We apologize for the oversight in not making our processed data readily available. We have deposited all relevant datasets, including raw and source data, in appropriate public repositories (GEO: GSE260458) and provide clear instructions for accessing this data in the revised manuscript.

      (4) Supplementary information: To maintain the concise nature of the main text while providing necessary details, we have included additional supplementary information. This will cover extended methodology (lines 311-318, 321-323, 327-340, 450-453, 533, and 578-589), detailed statistical analyses (lines 492-493, 499-501 and 509-528), and comprehensive data tables to support our findings.

      We believe these changes significantly improved the clarity and reproducibility of our work, allowing readers to better evaluate the merits of our method.

      Reviewer #2 (Public Review): 

      Summary: 

      This work introduces a new method of depleting the ribosomal reads from the single-cell RNA sequencing library prepared with one of the prokaryotic scRNA-seq techniques, PETRI-seq. The advance is very useful since it allows broader access to the technology by lowering the cost of sequencing. It also allows more transcript recovery with fewer sequencing reads. The authors demonstrate the utility and performance of the method for three different model species and find a subpopulation of cells in the E.coli biofilm that express a protein, PdeI, which causes elevated c-di-GMP levels. These cells were shown to be in a state that promotes persister formation in response to ampicillin treatment. 

      Strengths: 

      The introduced rRNA depletion method is highly efficient, with the depletion for E.coli resulting in over 90% of reads containing mRNA. The method is ready to use with existing PETRI-seq libraries which is a large advantage, given that no other rRNA depletion methods were published for split-pool bacterial scRNA-seq methods. Therefore, the value of the method for the field is high. There is also evidence that a small number of cells at the bottom of a static biofilm express PdeI which is causing the elevated c-di-GMP levels that are associated with persister formation. Given that PdeI is a phosphodiesterase, which is supposed to promote hydrolysis of c-di-GMP, this finding is unexpected. 

      Weaknesses: 

      With the descriptions and writing of the manuscript, it is hard to place the findings about the PdeI into existing context (i.e. it is well known that c-di-GMP is involved in biofilm development and is heterogeneously distributed in several species' biofilms; it is also known that E.coli diesterases regulate this second messenger, i.e. https://journals.asm.org/doi/full/10.1128/jb.00604-15). 

      There is also no explanation for the apparently contradictory upregulation of c-di-GMP in cells expressing higher PdeI levels. Perhaps the examination of the rest of the genes in cluster 2 of the biofilm sample could be useful to explain the observed association. 

      Thank you for your thoughtful and constructive review of our manuscript. We are pleased that the reviewer recognizes the value and efficiency of our rRNA depletion method for PETRI-seq, as well as its potential impact on the field. We would like to address the points raised by the reviewer and provide additional context and clarification regarding the function of PdeI in c-di-GMP regulation.

      We acknowledge that c-di-GMP’s role in biofilm development and its heterogeneous distribution in bacterial biofilms are well studied. We appreciate the reviewer's observation regarding the seemingly contradictory relationship between increased PdeI expression and elevated c-di-GMP levels. This is indeed an intriguing finding that warrants further explanation.

      PdeI is predicted to function as a phosphodiesterase involved in c-di-GMP degradation, based on sequence analysis demonstrating the presence of an intact EAL domain, which is known for this function. However, it is important to note that PdeI also harbors a divergent GGDEF domain, typically associated with c-di-GMP synthesis. This dual-domain structure indicates that PdeI may play complex regulatory roles. Previous studies have shown that knocking out the major phosphodiesterase PdeH in E. coli results in the accumulation of c-di-GMP. Moreover, introducing a point mutation (G412S) in PdeI's divergent GGDEF domain within this PdeH knockout background led to decreased c-di-GMP levels2. This finding implies that the wild-type GGDEF domain in PdeI contributes to maintaining or increasing cellular c-di-GMP levels.

      Importantly, our single-cell experiments demonstrated a positive correlation between PdeI expression levels and c-di-GMP levels (Figure 4D). In this revision, we also constructed a PdeI(G412S)-BFP mutation strain. Notably, our observations of this strain revealed that c-di-GMP levels remained constant despite an increase in BFP fluorescence, which serves as a proxy for PdeI(G412S) expression levels (Figure 4D). This experimental evidence, coupled with domain analyses, suggests that PdeI may also contribute to c-di-GMP synthesis, rebutting the notion that it acts solely as a phosphodiesterase. HPLC LC-MS/MS analysis further confirmed that the overexpression of PdeI, induced by arabinose, resulted in increased c-di-GMP levels (Fig. 4E) . These findings strongly suggest that PdeI plays a pivotal role in upregulating c-di-GMP levels.

      Our further analysis indicated that PdeI contains a CHASE (cyclases/histidine kinase-associated sensory) domain. Combined with our experimental results showing that PdeI is a membrane-associated protein, we hypothesize that PdeI acts as a sensor, integrating environmental signals with c-di-GMP production under complex regulatory mechanisms.

      We understand your interest in the other genes present in cluster 2 of the biofilm and their potential relationship to PdeI and c-di-GMP. Upon careful analysis, we have determined that the other marker genes in this cluster do not significantly impact biofilm formation, nor have we identified any direct relationship between these genes, c-di-GMP, or PdeI. Our focus on PdeI within this cluster is justified by its unique and significant role in c-di-GMP regulation and biofilm formation, as demonstrated by our experimental results. While other genes in this cluster may be co-expressed, their functions appear unrelated to the PdeI-c-di-GMP pathway we are investigating. Therefore, we opted not to elaborate on these genes in our main discussion, as they do not contribute directly to our understanding of the PdeI-c-di-GMP association. However, we can include a brief mention of these genes in the manuscript, indicating their lack of relevance to the PdeI-c-di-GMP pathway. This addition will provide a more comprehensive view of the cluster's composition while maintaining our focus on the key findings related to PdeI and c-di-GMP.

      We have also included the aforementioned explanations and supporting experimental data within the manuscript to clarify this important point (lines 193-217). Thank you for highlighting this apparent contradiction, allowing us to provide a more detailed explanation of our findings.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      Overall, I found the main text of the manuscript well written and easy to understand, though too compressed in parts to fully understand the details of the work presented, some examples are outlined below. The materials and methods appeared to be less carefully compiled and could use some careful proof-reading for spelling (e.g. repeated use of "minuts" for minutes, "datas" for data) and grammar and sentence fragments (e.g. "For exponential period E. coli data." Line 333). In general, the meaning is still clear enough to be understood. I also was unable to find figure captions for the supplementary figures, making these difficult to understand. 

      We appreciate your careful review, which has helped us improve the clarity and quality of our manuscript. We acknowledge that some parts of the main text may have been overly compressed due to Short Report format in eLife. We have thoroughly reviewed the manuscript and expanded on key areas to provide more comprehensive explanations. We have carefully revised the Materials and Methods section to address the following: Corrected all spelling and grammatical error, including "minuts" to "minutes" and "datas" to "data". Corrected grammatical issues and sentence fragments throughout the section. We sincerely apologize for the omission of captions for the supplementary figures. We have now added detailed captions for all supplementary figures to ensure they are easily understandable. We believe these revisions address your concerns and enhance the overall readability and comprehension of our work.

      General comments: 

      (1) To evaluate the performance of RiboD-PETRI, it would be helpful to have more details in general, particularly to do with the development of the sequencing protocol and the statistics shown. Some examples: How many reads were sequenced in each experiment? Of these, how many are mapped to the bacterial genome? How many reads were recovered per cell? Have the authors performed some kind of subsampling analysis to determine if their sequencing has saturated the detection of expressed genes? The authors show e.g. correlations between classic PETRI-seq and RiboD-PETRI for E. coli in Figure 1, but also have similar data for C. crescentus and S. aureus - do these data behave similarly? These are just a few examples, but I'm sure the authors have asked themselves many similar questions while developing this project; more details, hard numbers, and comparisons would be very much appreciated. 

      Thank you for your valuable feedback. To address your concerns, we have added a table in the supplementary material that clarifies the details of sequencing.

      The correlation values of PETRI-seq and RiboD-PETRI data in C. crescentus are relatively good. However, the correlation values between PETRI-seq and RiboD-PETRI data in SA data are relatively less high. The reason is that the sequencing depths of RiboD-PETRI and PETRI-seq are different, resulting in much higher gene expression in the RiboD-PETRI sequencing results than in PETRI-seq, and the calculated correlation coefficient is only about 0.47. This indicates that there is some positive correlation between the two sets of data, but it is not particularly strong. This indicates that there is a certain positive correlation between these two sets of data, but it is not particularly strong. However, we have counted the expression of 2763 genes in total, and even though the calculated correlation coefficient is relatively low, it still shows that there is some consistency between the two groups of samples.

      Author response image 1.

      Assessment of the effect of rRNA depletion on transcriptional profiles of (A) C. crescentus (CC) and (B) S. aureus (SA) . The Pearson correlation coefficient (r) of UMI counts per gene (log2 UMIs) between RiboD-PETRI and PETRI-seq was calculated for 4097 genes (A) and 2763 genes (B). The "ΔΔ" label represents the RiboD-PETRI protocol; The "Ctrl" label represents the classic PETRI-seq protocol we performed. Each point represents a gene.

      (2) Additionally, I think it is critical that the authors provide processed read counts per cell and gene in their supplementary information to allow others to investigate the performance of their method without going back to raw FASTQ files, as this can represent a significant hurdle for reanalysis. 

      Thank you for your suggestion. However, it's important to clarify that reads and UMIs (Unique Molecular Identifiers) are distinct concepts in single-cell RNA sequencing. Reads can be influenced by PCR amplification during library construction, making their quantity less stable. In contrast, UMIs serve as a more reliable indicator of the number of mRNA molecules detected after PCR amplification. Throughout our study, we primarily utilized UMI counts for quantification. To address your concern about data accessibility, we have included the UMI counts per cell and gene in our supplementary materials provided above (Table S7-15. Some of the files are too large in memory and are therefore stored in GEO: GSE260458). This approach provides a more accurate representation of gene expression levels and allows for robust reanalysis without the need to process raw FASTQ files.

      (3) Finally, the authors should also discuss other approaches to ribosomal depletion in bacterial scRNA-seq. One of the figures appears to contain such a comparison, but it is never mentioned in the text that I can find, and one could read this manuscript and come away believing this is the first attempt to deplete rRNA from bacterial scRNA-seq. 

      We have addressed this concern by including a comparison of different methods for depleting rRNA from bacterial scRNA-seq in Table S4 and make a short text comparison as follows: “Additionally, we compared our findings with other reported methods (Fig. 1B; Table S4). The original PETRI-seq protocol, which does not include an rRNA depletion step, exhibited an mRNA detection rate of approximately 5%. The MicroSPLiT-seq method, which utilizes Poly A Polymerase for mRNA enrichment, achieved a detection rate of 7%. Similarly, M3-seq and BacDrop-seq, which employ RNase H to digest rRNA post-DNA probe hybridization in cells, reported mRNA detection rates of 65% and 61%, respectively. MATQ-DASH, which utilizes Cas9-mediated targeted rRNA depletion, yielded a detection rate of 30%. Among these, RiboD-PETRI demonstrated superior performance in mRNA detection while requiring the least sequencing depth.” We have added this content in the main text (lines 110-120), specifically in relation to Figure 1B and Table S4. This addition provides context for our method and clarifies its position among existing techniques.

      Detailed comments: 

      Line 78: the authors describe the multiplet frequency, but it is not clear to me how this was determined, for which experiments, or where in the SI I should look to see this. Often this is done by mixing cultures of two distinct bacteria, but I see no evidence of this key experiment in the manuscript. 

      The multiplet frequency we discuss in the manuscript is not determined through experimental mixing of distinct bacterial cultures.The PETRI-seq and mirco-SPLIT articles have also done experiments mixing the two libraries to determine the single-cell rate, and both gave good results. Our technique is derived from these two articles (mainly PETRI-seq), and the biggest difference is the difference in the later RiboD part, so we did not do this experiment separately. So the multiple frequencies here are theoretical predictions based on our sequencing results, calculated using a Poisson distribution. We have made this distinction clearer in our manuscript (lines 93-97). The method is available in Materials and Methods section (lines 520-528). The data is available in Table S2. To elaborate:

      To assess the efficiency of single-cell capture in RiboD-PETRI, we calculated the multiplet frequency using a Poisson distribution based on our sequencing results

      (1) Definition: In our study, multiplet frequency is defined as the probability of a non-empty barcode corresponding to more than one cell.

      (2) Calculation Method: We use a Poisson distribution-based approach to calculate the predicted multiplet frequency. The process involves several steps:

      We first calculate the proportion of barcodes corresponding to zero cells: . Then, we calculate the proportion corresponding to one cell: . We derive the proportion for more than zero cells: P(≥1) = 1 - P(0). And for more than one cell: P(≥2) = 1 - P(1) - P(0). Finally, the multiplet frequency is calculated as:

      (3) Parameter λ: This is the ratio of the number of cells to the total number of possible barcode combinations. For instance, when detecting 10,000 cells, .

      Line 94: the concept of "percentage of gene expression" is never clearly defined. Does this mean the authors detect 99.86% of genes expressed in some cells? How is "expressed" defined - is this just detecting a single UMI? 

      The term "percentage gene expression" refers to the proportion of genes in the bacterial strain that were detected as expressed in the sequenced cell population. Specifically, in this context, it means that 99.86% of all genes in the bacterial strain were detected as expressed in at least one cell in our sequencing results. To define "expressed" more clearly: a gene is considered expressed if at least one UMI (Unique Molecular Identifier) detected in a cell in the population. This definition allows for the detection of even low-level gene expression. To enhance clarity in the manuscript, we have rephrased the sentence as “transcriptome-wide gene coverage across the cell population”.

      Line 98: The authors discuss the number of recovered UMIs throughout this paragraph, but there is no clear discussion of the number of detected expressed genes per cell. Could the authors include a discussion of this as well, as this is another important measure of sensitivity? 

      We appreciate your suggestion to include a discussion on the number of detected expressed genes per cell, as this is indeed another important measure of sensitivity. We would like to clarify that we have actually included statistics on the number of genes detected across all cells in the main text of our paper. This information is presented as percentages. However, we understand that you may be looking for a more detailed representation, similar to the UMI statistics we provided. To address this, we have now added a new analysis showing the number of genes detected per cell (lines 132-133, 138-139, 144-145 and 184-186, Fig. 2B, 3B and S2B). This additional result complements our existing UMI data and provides a more comprehensive view of the sensitivity of our method. We have included this new gene-per-cell statistical graph in the supplementary materials.

      Figure 1B: I presume ctrl and delta delta represent the classic PETRI-seq and RiboD protocols, respectively, but this is not specified. This should be clarified in the figure caption, or the names changed. 

      We appreciate you bringing this to our attention. We acknowledge that the labeling in the figure could have been clearer. We have now clarified this information in the figure caption. To provide more specificity: The "ΔΔ" label represents the RiboD-PETRI protocol; The "Ctrl" label represents the classic PETRI-seq protocol we performed. We have updated the figure caption to include these details, which should help readers better understand the protocols being compared in the figure.​

      Line 104: the authors claim "This performance surpassed other reported bacterial scRNA-seq methods" with a long number of references to other methods. "Performance" is not clearly defined, and it is unclear what the exact claim being made is. The authors should clarify what they're claiming, and further discuss the other methods and comparisons they have made with them in a thorough and fair fashion. 

      We appreciate your request for clarification, and we acknowledge that our definition of "performance" should have been more explicit. We would like to clarify that in this context, we define performance primarily in terms of the proportion of mRNA captured. Our improved method demonstrates a significantly higher rate of rRNA removal compared to other bacterial single-cell library construction methods. This results in a higher proportion of mRNA in our sequencing data, which we consider a key performance metric for single-cell RNA sequencing in bacteria. Additionally, when compared to our previous method, PETRI-seq, our improved approach not only enhances rRNA removal but also reduces library construction costs. This dual improvement in both data quality and cost-effectiveness is what we intended to convey with our performance claim.

      We recognize that a more thorough and fair discussion of other methods and their comparisons would be beneficial. We have summarized the comparison in Table S4 and make a short text discussion in the main text (lines 106-120). This addition provides context for our method and clarifies its position among existing techniques.

      Figure 1D: Do the authors have any explanation for the relatively lower performance of their C. crescentus depletion? 

      We appreciate your attention to detail and the opportunity to address this point. The lower efficiency of rRNA removal in C. crescentus compared to other species can be attributed to inherent differences between species. It's important to note that a single method for rRNA depletion may not be universally effective across all bacterial species due to variations in their genetic makeup and rRNA structures. Different bacterial species can have unique rRNA sequences, secondary structures, or associated proteins that may affect the efficiency of our depletion method. This species-specific variation highlights the challenges in developing a one-size-fits-all approach for bacterial rRNA depletion. While our method has shown high efficiency across several species, the results with C. crescentus underscore the need for continued refinement and possibly species-specific optimizations in rRNA depletion techniques. We thank you for bringing attention to this point, as it provides valuable insight into the complexities of bacterial rRNA depletion and areas for future improvement in our method.

      Line 118: The authors claim RiboD-PETRI has a "consistent ability to unveil within-population heterogeneity", however the preceding paragraph shows it detects potential heterogeneity, but provides no evidence this inferred heterogeneity reflects the reality of gene expression in individual cells. 

      We appreciate your careful reading and the opportunity to clarify this point. We acknowledge that our wording may have been too assertive given the evidence presented. We acknowledge that the subpopulations of cells identified in other species have not undergone experimental verification. Our intention in presenting these results was to demonstrate RiboD-PETRI's capability to detect “potential” heterogeneity consistently across different bacterial species, showcasing the method's sensitivity and potential utility in exploring within-population diversity. However, we agree that without further experimental validation, we cannot definitively claim that these detected differences represent true biological heterogeneity in all cases. We have revised this section to reflect the current state of our findings more accurately, emphasizing that while RiboD-PETRI consistently detects potential heterogeneity across species, further experimental validation would be required to confirm the biological significance of the observations (lines 169-171).

      Figure 1 H&I: I'm not entirely sure what I am meant to see in these figures, presumably some evidence for heterogeneity in gene expression. Are there better visualizations that could be used to communicate this? 

      We appreciate your suggestion for improving the visualization of gene expression heterogeneity. We have explored alternative visualization methods in the revised manuscript. Specifically, for the expression levels of marker genes shown in Figure 1H (which is Figure 2D now), we have created violin plots (Supplementary Fig. 4). These plots offer a more comprehensive view of the distribution of expression levels across different cell populations, making it easier to discern heterogeneity. However, due to the number of marker genes and the resulting volume of data, these violin plots are quite extensive and would occupy a significant amount of space. Given the space constraints of the main figure, we propose to include these violin plots as a Fig. S4 immediately following Figure 1 H&I (which is Figure 2D&E now). This arrangement will allow readers to access more detailed information about these marker genes while maintaining the concise style of the main figure.

      Regarding the pathway enrichment figure (Figure 2E), we have also considered your suggestion for improvement. We attempted to use a dot plot to display the KEGG pathway enrichment of the genes. However, our analysis revealed that the genes were only enriched in a single pathway. As a result, the visual representation using a dot plot still did not produce a particularly aesthetically pleasing or informative figure.

      Line 124: The authors state no significant batch effect was observed, but in the methods on line 344 they specify batch effects were removed using Harmony. It's unclear what exactly S2 is showing without a figure caption, but the authors should clarify this discrepancy. 

      We apologize for any confusion caused by the lack of a clear figure caption for Figure S2 (which is Figure S3D now). To address your concern, in addition to adding figure captions for supplementary figure, we would also like to provide more context about the batch effect analysis. In Supplementary Fig. S3, Panel C represents the results without using Harmony for batch effect removal, while Panel D shows the results after applying Harmony. In both panels A and B, the distribution of samples one and two do not show substantial differences. Based on this observation, we concluded that there was no significant batch effect between the two samples. However, we acknowledge that even subtle batch effects could potentially influence downstream analyses. Therefore, out of an abundance of caution and to ensure the highest quality of our results, we decided to apply Harmony to remove any potential minor batch effects. This approach aligns with best practices in single-cell analysis, where even small technical variations are often accounted for to enhance the robustness of the results.

      To improve clarity, we have revised our manuscript to better explain this nuanced approach: 1. We have updated the statement to reflect that while no major batch effect was observed, we applied batch correction as a precautionary measure (lines 181-182). 2. We have added a detailed caption to Figure S3, explaining the comparison between non-corrected and batch-corrected data. 3. We have modified the methods section to clarify that Harmony was applied as a precautionary step, despite the absence of obvious batch effects (lines 492-493).

      Figure 2D: I found this panel fairly uninformative, is there a better way to communicate this finding? 

      Thank you for your feedback regarding Figure 2D. We have explored alternative ways to present this information, using a dot plot to display the enrichment pathways, as this is often an effective method for visualizing such data. Meanwhile, we also provided a more detailed textual description of the enrichment results in the main text, highlighting the most significant findings.

      Figure 2I: the figure itself and caption say GFP, but in the text and elsewhere the authors say this is a BFP fusion. 

      We appreciate your careful review of our manuscript and figures. We apologize for any confusion this may have caused. To clarify: Both GFP (Green Fluorescent Protein) and BFP (Blue Fluorescent Protein) were indeed used in our experiments, but for different purposes: 1. GFP was used for imaging to observe location of PdeI in bacteria and persister cell growth, which is shown in Figure 4C and 4K. 2. BFP was used for cell sorting, imaging of location in biofilm, and detecting the proportion of persister cells which shown in Figure 4D, 4F-J. To address this inconsistency and improve clarity, we will make the following corrections: 1. We have reviewed the main text to ensure that references to GFP and BFP are accurate and consistent with their respective uses in our experiments. 2. We have added a note in the figure caption for Figure 4C to explicitly state that this particular image shows GFP fluorescence for location of PdeI. 3. In the methods section, we have provided a clear explanation of how both fluorescent proteins were used in different aspects of our study (lines 326-340).

      Line 156: The authors compare prices between RiboD and PETRI-seq. It would be helpful to provide a full cost breakdown, e.g. in supplementary information, as it is unclear exactly how the authors came to these numbers or where the major savings are (presumably in sequencing depth?) 

      We appreciate your suggestion to provide a more detailed cost breakdown, and we agree that this would enhance the transparency and reproducibility of our cost analysis. In response to your feedback, we have prepared a comprehensive cost breakdown that includes all materials and reagents used in the library preparation process. Additionally, we've factored in the sequencing depth (50G) and the unit price for sequencing (25¥/G). These calculations allow us to determine the cost per cell after sequencing. As you correctly surmised, a significant portion of the cost reduction is indeed related to sequencing depth. However, there are also savings in the library preparation steps that contribute to the overall cost-effectiveness of our method. We propose to include this detailed cost breakdown as a supplementary table (Table S6) in our paper. This table will provide a clear, itemized list of all expenses involved, including: 1. Reagents and materials for library preparation 2. Sequencing costs (depth and price per G) 3. Calculated cost per cell.

      Line 291: The design and production of the depletion probes are not clearly explained. How did the authors design them? How were they synthesized? Also, it appears the authors have separate probe sets for E. coli, C. crescentus, and S. aureus - this should be clarified, possibly in the main text.

      Thank you for your important questions regarding the design and production of our depletion probes. We included the detailed probe information in Supplementary Table S1, however, we didn’t clarify the information in the main text due to the constrains of the requirements of the Short Report format in eLife. We appreciate the opportunity to provide clarifications. ​

      The core principle behind our probe design is that the probe sequences are reverse complementary to the r-cDNA sequences. This design allows for specific recognition of r-cDNA. The probes are then bound to magnetic beads, allowing the r-cDNA-probe-bead complexes to be separated from the rest of the library. To address your specific questions: 1. Probe Design: We designed separate probe sets for E. coli, C. crescentus, and S. aureus. Each set was specifically constructed to be reverse complementary to the r-cDNA sequences of its respective bacterial species. This species-specific approach ensures high efficiency and specificity in rRNA depletion for each organism. The hybrid DNA complex wasthen removed by Streptavidin magnetic beads. 2. Probe Synthesis: The probes were synthesized based on these design principles. 3. Species-Specific Probe Sets: You are correct in noting that we used separate probe sets for each bacterial species. We have clarified this important point in the main text to ensure readers understand the specificity of our approach. To further illustrate this process, we have created a schematic diagram showing the principle of rRNA removal and clarified the design principle in figure legend, which we have included in the figure legend of Fig. 1A.

      Line 362: I didn't see a description of the construction of the PdeI-BFP strain, I assume this would be important for anyone interested in the specific work on PdeI. 

      Thank you for your astute observation regarding the construction of the PdeI-BFP strain. We appreciate the opportunity to provide this important information. The PdeI-BFP strain was constructed as follows: 1. We cloned the pdeI gene along with its native promoter region (250bp) into a pBAD vector. 2. The original promoter region of the pBAD vector was removed to avoid any potential interference. 3. This construction enables the expression of the PdeI-BFP fusion protein to be regulated by the native promoter of pdeI, thus maintaining its physiological control mechanisms. 4. The BFP coding sequence was fused to the pdeI gene to create the PdeI-BFP fusion construct. We have added a detailed description of the PdeI-BFP strain construction to our methods section (lines 327-334).

      Reviewer #2 (Recommendations For The Authors): 

      (1) General remarks: 

      Reconsider using 'advanced' in the title. It is highly generic and misleading. Perhaps 'cost-efficient' would be a more precise substitute. 

      Thank you for your valuable suggestion. After careful consideration, we have decided to use "improved" in the title. Firstly, our method presents an efficient solution to a persistent challenge in bacterial single-cell RNA sequencing, specifically addressing rRNA abundance. Secondly, it facilitates precise exploration of bacterial population heterogeneity. We believe our method encompasses more than just cost-effectiveness, justifying the use of the term "advanced."

      Consider expanding the introduction. The introduction does not explain the setup of the biological question or basic details such as the organism(s) for which the technique has been developed, or which species biofilms were studied. 

      Thank you for your valuable feedback regarding our introduction. We acknowledge our compressed writing style due to constrains of the requirements of the Short Report format in eLife. We appreciate opportunity to expand this crucial section of our manuscript, which will undoubtedly improve the clarity and impact of our manuscript's introduction.

      We revised our introduction (lines 53-80) according to following principles:

      (1) Initial Biological Question: We explained the initial biological question that motivated our research—understanding the heterogeneity in E. coli biofilms—to provide essential context for our technological development.

      (2) Limitations of Existing Techniques: We briefly described the limitations of current single-cell sequencing techniques for bacteria, particularly regarding their application in biofilm studies.

      (3) Introduction of Improved Technique: We introduced our improved technique, initially developed for E. coli.

      (4) Research Evolution: We highlighted how our research has evolved, demonstrating that our technique is applicable not only to E. coli but also to Gram-positive bacteria and other Gram-negative species, showcasing the broad applicability of our method.

      (5) Specific Organisms Studied: We provided examples of the specific organisms we studied, encompassing both Gram-positive and Gram-negative bacteria.

      (6) Potential Implications: Finally, we outlined the potential implications of our technique for studying bacterial heterogeneity across various species and contexts, extending beyond biofilms.

      (2) Writing remarks: 

      43-45 Reword: "Thus, we address a persistent challenge in bacterial single-cell RNA-seq regarding rRNA abundance, exemplifying the utility of this method in exploring biofilm heterogeneity.". 

      Thank you for highlighting this sentence and requesting a rewording. I appreciate the opportunity to improve the clarity and impact of our statement. We have reworded the sentence as: "Our method effectively tackles a long-standing issue in bacterial single-cell RNA-seq: the overwhelming abundance of rRNA. This advancement significantly enhances our ability to investigate the intricate heterogeneity within biofilms at unprecedented resolution." (lines 47-50)

      49 "Biofilms, comprising approximately 80% of chronic and recurrent microbial infections in the human body..." - probably meant 'contribute to'. 

      Thank you for catching this imprecision in our statement. We have reworded the sentence as: "​Biofilms contribute to approximately 80% of chronic and recurrent microbial infections in the human body...​"

      54-55 Please expand on "this". 

      Thank you for your request to expand on the use of "this" in the sentence. You're right that more clarity would be beneficial here. We have revised and expanded this section in lines 54-69.

      81-84 Unclear why these species samples were either at exponential or stationary phases. The growth stage can influence the proportion of rRNA and other transcripts in the population. 

      Thank you for raising this important point about the growth phases of the bacterial samples used in our study. We appreciate the opportunity to clarify our experimental design. To evaluate the performance of RiboD-PETRI, we designed a comprehensive assessment of rRNA depletion efficiency under diverse physiological conditions, specifically contrasting exponential and stationary phases. This approach allows us to understand how these different growth states impact rRNA depletion efficacy. Additionally, we included a variety of bacterial species, encompassing both gram-negative and gram-positive organisms, to ensure that our findings are broadly applicable across different types of bacteria. By incorporating these variables, we aim to provide insights into the robustness and reliability of the RiboD-PETRI method in various biological contexts. We have included this rationale in our result section (lines 99-106), providing readers with a clear understanding of our experimental design choices.

      86 "compared TO PETRI-seq " (typo). 

      We have corrected this typo in our manuscript.

      94 "gene expression collectively" rephrase. Probably this means coverage of the entire gene set across all cells. Same for downstream usage of the phrase. 

      Thank you for pointing out this ambiguity in our phrasing. Your interpretation of our intended meaning is accurate. We have rephrased the sentence as “transcriptome-wide gene coverage across the cell population”.

      97 What were the median UMIs for the 30,000 cell library {greater than or equal to}15 UMIs? Same question for the other datasets. This would reflect a more comparable statistic with previous studies than the top 3% of the cells for example, since the distributions of the single-cell UMIs typically have a long tail. 

      Thank you for this insightful question and for pointing out the importance of providing more comparable statistics. We agree that median values offer a more robust measure of central tendency, especially for datasets with long-tailed distributions, which are common in single-cell studies. The suggestion to include median Unique Molecular Identifier (UMI) counts would indeed provide a more comparable statistic with previous studies. We have analyzed the median UMIs for our libraries as follows and revised our manuscript according to the analysis (lines 126-130, 133-136, 139-142 and 175-180).

      (1) Median UMI count in Exponential Phase E. coli:

      Total: 102 UMIs per cell

      Top 1,000 cells: 462 UMIs per cell

      Top 5,000 cells: 259 UMIs per cell

      Top 10,000 cells: 193 UMIs per cell

      (2) Median UMI count in Stationary Phase S. aureus:

      Total: 142 UMIs per cell

      Top 1,000 cells: 378 UMIs per cell

      Top 5,000 cells: 207 UMIs per cell

      Top 8,000 cells: 167 UMIs per cell

      (3) Median UMI count in Exponential Phase C. crescentus:

      Total: 182 UMIs per cell

      Top 1,000 cells: 2,190 UMIs per cell

      Top 5,000 cells: 662 UMIs per cell

      Top 10,000 cells: 225 UMIs per cell

      (4) Median UMI count in Static E. coli Biofilm:

      Total of Replicate 1: 34 UMIs per cell

      Total of Replicate 2: 52 UMIs per cell

      Top 1,621 cells of Replicate 1: 283 UMIs per cell

      Top 3,999 cells of Replicate 2: 239 UMIs per cell

      104-105 The performance metric should again be the median UMIs of the majority of the cells passing the filter (15 mRNA UMIs is reasonable). The top 3-5% are always much higher in resolution because of the heavy tail of the single-cell UMI distribution. It is unclear if the performance surpasses the other methods using the comparable metric. Recommend removing this line. 

      We appreciate your suggestion regarding the use of median UMIs as a more appropriate performance metric, and we agree that comparing the top 3-5% of cells can be misleading due to the heavy tail of the single-cell UMI distribution. We have removed the line in question (104-105) that compares our method's performance based on the top 3-5% of cells in the revised manuscript. Instead, we focused on presenting the median UMI counts for cells passing the filter (≥15 mRNA UMIs) as the primary performance metric. This will provide a more representative and comparable measure of our method's performance. We have also revised the surrounding text to reflect this change, ensuring that our claims about performance are based on these more robust statistics (lines 126-130, 133-136, 139-142 and 175-180).

      106-108 The sequencing saturation of the libraries (in %), and downsampling analysis should be added to illustrate this point. 

      Thank you for your valuable suggestion. Your recommendation to add sequencing saturation and downsampling analysis is highly valuable and will help better illustrate our point. Based on your feedback, we have revised our manuscript by adding the following content:

      To provide a thorough evaluation of our sequencing depth and library quality, we performed sequencing saturation analysis on our sequencing samples. The findings reveal that our sequencing saturation is 100% (Fig. 8A & B), indicating that our sequencing depth is sufficient to capture the diversity of most transcripts. To further illustrate the impact of our downstream analysis on the datasets, we have demonstrated the data distribution before and after applying our filtering criteria (Fig. S1B & C). These figures effectively visualized the influence of our filtering process on the data quality and distribution. After filtering, we can have a more refined dataset with reduced noise and outliers, which enhances the reliability of our downstream analyses.

      We have also ensured that a detailed description of the sequencing saturation method is included in the manuscript to provide readers with a comprehensive understanding of our methodology. We appreciate your feedback and believe these additions significantly improve our work.

      122: Please provide more details about the biofilm setup, including the media used. I did not find them in the methods. 

      We appreciate your attention to detail, and we agree that this information is crucial for the reproducibility of our experiments. We propose to add the following information to our methods section (lines 311-318):

      "For the biofilm setup, bacterial cultures were grown overnight. The next day, we diluted the culture 1:100 in a petri dish. We added 2ml of LB medium to the dish. If the bacteria contain a plasmid, the appropriate antibiotic needs to be added to LB. The petri dish was then incubated statically in a growth chamber for 24 hours. After incubation, we performed imaging directly under the microscope. The petri dishes used were glass-bottom dishes from Biosharp (catalog number BS-20-GJM), allowing for direct microscopic imaging without the need for cover slips or slides. This setup allowed us to grow and image the biofilms in situ, providing a more accurate representation of their natural structure and composition.​"

      125: "sequenced 1,563 reads" missing "with" 

      Thank you for correcting our grammar. We have revisd the phrase as “sequenced with 1,563 reads”.

      126: "283/239 UMIs per cell" unclear. 283 and 239 UMIs per cell per replicate, respectively? 

      Thank you for correcting our grammar. We have revised the phrase as “283 and 239 UMIs per cell per replicate, respectively” (lines 184).

      Figure 1D: Please indicate where the comparison datasets are from. 

      We appreciate your question regarding the source of the comparison datasets in Figure 1D. All data presented in Figure 1D are from our own sequencing experiments. We did not use data from other publications for this comparison. Specifically, we performed sequencing on E. coli cells in the exponential growth phase using three different library preparation methods: RiboD-PETRI, PETRI-seq, and RNA-seq. The data shown in Figure 1D represent a comparison of UMIs and/or reads correlations obtained from these three methods. All sequencing results have been uploaded to the Gene Expression Omnibus (GEO) database. The accession number is GSE260458. We have updated the figure legend for Figure 1D to clearly state that all datasets are from our own experiments, specifying the different methods used.

      Figure 1I, 2D: Unable to interpret the color block in the data. 

      We apologize for any confusion regarding the interpretation of the color blocks in Figures 1I and 2D (which are Figure 2E, 3E now). The color blocks in these figures represent the p-values of the data points. The color scale ranges from red to blue. Red colors indicate smaller p-values, suggesting higher statistical significance and more reliable results. Blue colors indicate larger p-values, suggesting lower statistical significance and less reliable results. We have updated the figure legends for both Figure 2E and Figure 3E to include this explanation of the color scale. Additionally, we have added a color legend to each figure to make the interpretation more intuitive for readers.

      Figure1H and 2C: Gene names should be provided where possible. The locus tags are highly annotation-dependent and hard to interpret. Also, a larger size figure should be helpful. The clusters 2 and 3 in 2C are the most important, yet because they have few cells, very hard to see in this panel. 

      We appreciate your suggestions for improving the clarity and interpretability of Figures 1H and 2C (which is Figure 2D, 3D now). We have replaced the locus tags with gene names where possible in both figures. We have increased the size of both figures to improve visibility and readability. We have also made Clusters 2 and 3 in Figure 3D more prominent in the revised figure. Despite their smaller cell count, we recognize their importance and have adjusted the visualization to ensure they are clearly visible. We believe these modifications will significantly enhance the clarity and informativeness of Figures 2D and 3D.​

      (3) Questions to consider further expanding on, by more analyses or experiments and in the discussion: 

      What are the explanations for the apparently contradictory upregulation of c-di-GMP in cells expressing higher PdeI levels? How could a phosphodiesterase lead to increased c-di-GMP levels? 

      We appreciate the reviewer's observation regarding the seemingly contradictory relationship between increased PdeI expression and elevated c-di-GMP levels. This is indeed an intriguing finding that warrants further explanation.

      PdeI was predicted to be a phosphodiesterase responsible for c-di-GMP degradation. This prediction is based on sequence analysis where PdeI contains an intact EAL domain known for degrading c-di-GMP. However, it is noteworthy that PdeI also contains a divergent GGDEF domain, which is typically associated with c-di-GMP synthesis (Fig S8). This dual-domain architecture suggests that PdeI may engage in complex regulatory roles. Previous studies have shown that the knockout of the major phosphodiesterase PdeH in E. coli leads to the accumulation of c-di-GMP. Further, a point mutation on PdeI's divergent GGDEF domain (G412S) in this PdeH knockout strain resulted in decreased c-di-GMP levels2, implying that the wild-type GGDEF domain in PdeI contributes to the maintenance or increase of c-di-GMP levels in the cell. Importantly, our single-cell experiments showed a positive correlation between PdeI expression levels and c-di-GMP levels (Response Fig. 9B). In this revision, we also constructed PdeI(G412S)-BFP mutation strain. Notably, our observations of this strain revealed that c-di-GMP levels remained constant despite increasing BFP fluorescence, which serves as a proxy for PdeI(G412S) expression levels (Fig. 4D). This experimental evidence, along with domain analysis, suggests that PdeI could contribute to c-di-GMP synthesis, rebutting the notion that it solely functions as a phosphodiesterase. HPLC LC-MS/MS analysis further confirmed that PdeI overexpression, induced by arabinose, led to an upregulation of c-di-GMP levels (Fig. 4E). These results strongly suggest that PdeI plays a significant role in upregulating c-di-GMP levels. Our further analysis revealed that PdeI contains a CHASE (cyclases/histidine kinase-associated sensory) domain. Combined with our experimental results demonstrating that PdeI is a membrane-associated protein, we hypothesize that PdeI functions as a sensor that integrates environmental signals with c-di-GMP production under complex regulatory mechanisms.

      We have also included this explanation (lines 193-217) and the supporting experimental data (Fig. 4D & 4J) in our manuscript to clarify this important point. Thank you for highlighting this apparent contradiction, as it has allowed us to provide a more comprehensive explanation of our findings.

      What about the rest of the genes in cluster 2 of the biofilm? They should be used to help interpret the association between PdeI and c-di-GMP. 

      We understand your interest in the other genes present in cluster 2 of the biofilm and their potential relationship to PdeI and c-di-GMP. After careful analysis, we have determined that the other marker genes in this cluster do not have a significant impact on biofilm formation. Furthermore, we have not found any direct relationship between these genes and c-di-GMP or PdeI. Our focus on PdeI in this cluster is due to its unique and significant role in c-di-GMP regulation and biofilm formation, as demonstrated by our experimental results. While the other genes in this cluster may be co-expressed, their functions appear to be unrelated to the PdeI and c-di-GMP pathway we are investigating. We chose not to elaborate on these genes in our main discussion as they do not contribute directly to our understanding of the PdeI and c-di-GMP association. Instead, we could include a brief mention of these genes in the manuscript, noting that they were found to be unrelated to the PdeI-c-di-GMP pathway. This would provide a more comprehensive view of the cluster composition while maintaining focus on the key findings related to PdeI and c-di-GMP.

      Author response image 2.

      Protein-protein interactions of marker genes in cluster 2 of 24-hour static biofilms of E coli data.

      A verification is needed that the protein fusion to PdeI functional/membrane localization is not due to protein interactions with fluorescent protein fusion. 

      We appreciate your concern regarding the potential impact of the fluorescent protein fusion on the functionality and membrane localization of PdeI. It is crucial to verify that the observed effects are attributable to PdeI itself and not an artifact of its fusion with the fluorescent protein. To address this matter, we have incorporated a control group expressing only the fluorescent protein BFP (without the PdeI fusion) under the same promoter. This experimental design allows us to differentiate between effects caused by PdeI and those potentially arising from the fluorescent protein alone.

      Our results revealed the following key observations:

      (1) Cellular Localization: The GFP alone exhibited a uniform distribution in the cytoplasm of bacterial cells, whereas the PdeI-GFP fusion protein was specifically localized to the membrane (Fig. 4C).

      (2) Localization in the Biofilm Matrix: BFP-positive cells were distributed throughout the entire biofilm community. In contrast, PdeI-BFP positive cells localized at the bottom of the biofilm, where cell-surface adhesion occurs (Fig 4F).

      (3) c-di-GMP Levels: Cells with high levels of BFP displayed no increase in c-di-GMP levels. Conversely, cells with high levels of PdeI-BFP exhibited a significant increase in c-di-GMP levels (Fig. 4D).

      (4) Persister Cell Ratio: Cells expressing high levels of BFP showed no increase in persister ratios, while cells with elevated levels of PdeI-BFP demonstrated a marked increase in persister ratios (Fig. 4J).

      These findings from the control experiments have been included in our manuscript (lines 193-244, Fig. 4C, 4D, 4F, 4G and 4J), providing robust validation of our results concerning the PdeI fusion protein. They confirm that the observed effects are indeed due to PdeI and not merely artifacts of the fluorescent protein fusion.

      (!) Vrabioiu, A. M. & Berg, H. C. Signaling events that occur when cells of Escherichia coli encounter a glass surface. Proceedings of the National Academy of Sciences of the United States of America 119, doi:10.1073/pnas.2116830119 (2022). https://doi.org/10.1073/pnas.2116830119

      (2)bReinders, A. et al. Expression and Genetic Activation of Cyclic Di-GMP-Specific Phosphodiesterases in Escherichia coli. J Bacteriol 198, 448-462 (2016). https://doi.org:10.1128/JB.00604-15

    1. Author Response

      The following is the authors’ response to the original reviews.

      Major comments (Public Reviews)

      Generality of grid cells

      We appreciate the reviewers’ concern regarding the generality of our approach, and in particular for analogies in nonlinear spaces. In that regard, there are at least two potential directions that could be pursued. One is to directly encode nonlinear structures (such as trees, rings, etc.) with grid cells, to which DPP-A could be applied as described in our model. The TEM model [1] suggests that grid cells in the medial entorhinal may form a basis set that captures structural knowledge for such nonlinear spaces, such as social hierarchies and transitive inference when formalized as a connected graph. Another would be to use eigen-decomposition of the successor representation [2], a learnable predictive representation of possible future states that has been shown by Stachenfield et al. [3] to provide an abstract structured representation of a space that is analogous to the grid cell code. This general-purpose mechanism could be applied to represent analogies in nonlinear spaces [4], for which there may not be a clear factorization in terms of grid cells (i.e., distinct frequencies and multiple phases within each frequency). Since the DPP-A mechanism, as we have described it, requires representations to be factored in this way it would need to be modified for such purpose. Either of these approaches, if successful, would allow our model to be extended to domains containing nonlinear forms of structure. To the extent that different coding schemes (i.e., basis sets) are needed for different forms of structure, the question of how these are identified and engaged for use in a given setting is clearly an important one, that is not addressed by the current work. We imagine that this is likely subserved by monitoring and selection mechanisms proposed to underlie the capacity for selective attention and cognitive control [5], though the specific computational mechanisms that underlie this function remain an important direction for future research. We have added a discussion of these issues in Section 6 of the updated manuscript.

      (1) Whittington, J.C., Muller, T.H., Mark, S., Chen, G., Barry, C., Burgess, N. and Behrens, T.E., 2020. The Tolman-Eichenbaum machine: unifying space and relational memory through generalization in the hippocampal formation. Cell, 183(5), pp.1249-1263.

      (2) Dayan, P., 1993. Improving generalization for temporal difference learning: The successor representation. Neural computation, 5(4), pp.613-624.

      (3) Stachenfeld, K.L., Botvinick, M.M. and Gershman, S.J., 2017. The hippocampus as a predictive map. Nature neuroscience, 20(11), pp.1643-1653.

      (4) Frankland, S., Webb, T.W., Petrov, A.A., O'Reilly, R.C. and Cohen, J., 2019. Extracting and Utilizing Abstract, Structured Representations for Analogy. In CogSci (pp. 1766-1772).

      (5) Shenhav, A., Botvinick, M.M. and Cohen, J.D., 2013. The expected value of control: an integrative theory of anterior cingulate cortex function. Neuron, 79(2), pp.217-240. Biological plausibility of DPP-A

      We appreciate the reviewers’ interest in the biological plausibility of our model, and in particular the question of whether and how DPP-A might be implemented in a neural network. In that regard, Bozkurt et al. [1] recently proposed a biologically plausible neural network algorithm using a weighted similarity matrix approach to implement a determinant maximization criterion, which is the core idea underlying the objective function we use for DPP-A, suggesting that the DPP-A mechanism we describe may also be biologically plausible. This could be tested experimentally by exposing individuals (e.g., rodents or humans) to a task that requires consistent exposure to a subregion, and evaluating the distribution of activity over the grid cells. Our model predicts that high frequency grid cells should increase their firing rate more than low frequency cells, since the high frequency grid cells maximize the determinant of the covariance matrix of the grid cell embeddings. It is also worth noting that Frankland et al. [2] have suggested that the use of DPPs may also help explain a mutual exclusivity bias observed in human word learning and reasoning. While this is not direct evidence of biological plausibility, it is consistent with the idea that the human brain selects representations for processing that maximize the volume of the representational space, which can be achieved by maximizing the DPP-A objective function defined in Equation 6. We have added a comment to this effect in Section 6 of the updated manuscript.

      (1) Bozkurt, B., Pehlevan, C. and Erdogan, A., 2022. Biologically-plausible determinant maximization neural networks for blind separation of correlated sources. Advances in Neural Information Processing Systems, 35, pp.13704-13717.

      (2) Frankland, S. and Cohen, J., 2020. Determinantal Point Processes for Memory and Structured Inference. In CogSci.

      Simplicity of analogical problem and comparison to other models using this task

      First, we would like to point out that analogical reasoning is a signatory feature of human cognition, which supports flexible and efficient adaptation to novel inputs that remains a challenge for most current neural network architectures. While humans can exhibit complex and sophisticated forms of analogical reasoning [1, 2, 3], here we focused on a relatively simple form, that was inspired by Rumelhart’s parallelogram model of analogy [4,5] that has been used to explain traditional human verbal analogies (e.g., “king is to what as man is to woman?”). Our model, like that one, seeks to explain analogical reasoning in terms of the computation of simple Euclidean distances (i.e., A - B = C - D, where A, B, C, D are vectors in 2D space). We have now noted this in Section 2.1.1 of the updated manuscript. It is worth noting that, despite the seeming simplicity of this construction, we show that standard neural network architectures (e.g., LSTMs and transformers) struggle to generalize on such tasks without the use of the DPP-A mechanism.

      Second, we are not aware of any previous work other than Frankland et al. [6] cited in the first paragraph of Section 2.2.1, that has examined the capacity of neural network architectures to perform even this simple form of analogy. The models in that study were hardcoded to perform analogical reasoning, whereas we trained models to learn to perform analogies. That said, clearly a useful line of future work would be to scale our model further to deal with more complex forms of representation and analogical reasoning tasks [1,2,3]. We have noted this in Section 6 of the updated manuscript.

      (1) Holyoak, K.J., 2012. Analogy and relational reasoning. The Oxford handbook of thinking and reasoning, pp.234-259.

      (2) Webb, T., Fu, S., Bihl, T., Holyoak, K.J. and Lu, H., 2023. Zero-shot visual reasoning through probabilistic analogical mapping. Nature Communications, 14(1), p.5144.

      (3) Lu, H., Ichien, N. and Holyoak, K.J., 2022. Probabilistic analogical mapping with semantic relation networks. Psychological review.

      (4) Rumelhart, D.E. and Abrahamson, A.A., 1973. A model for analogical reasoning. Cognitive Psychology, 5(1), pp.1-28.

      (5) Mikolov, T., Chen, K., Corrado, G. and Dean, J., 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

      (6) Frankland, S., Webb, T.W., Petrov, A.A., O'Reilly, R.C. and Cohen, J., 2019. Extracting and Utilizing Abstract, Structured Representations for Analogy. In CogSci (pp. 1766-1772).

      Clarification of DPP-A attentional modulation

      We would like to clarify several concerns regarding the DPP-A attentional modulation. First, we would like to make it clear that ω is not meant to correspond to synaptic weights, and thank the reviewer for noting the possibility for confusion on this point. It is also distinct from a biasing input, which is often added to the product of the input features and weights. Rather, in our model ω is a vector, and diag (ω) converts it into a matrix with ω as the diagonal of the matrix, and the rest entries are zero. In Equation 6, diag(ω) is matrix multiplied with the covariance matrix V, which results in elementwise multiplication of ω with column vectors of V, and hence acts more like gates. We have noted this in Section 2.2.2 and have changed all instances of “weights (ω)” to “gates (ɡ)” in the updated manuscript. We have also rewritten the definition of Equation 6 and uses of it (as in Algorithm 1) to depict the use of sigmoid nonlinearity (σ) to , so that the resulting values are always between 0 and 1.

      Second, we would like to clarify that we don’t compute the inner product between the gates ɡ and the grid cell embeddings x anywhere in our model. The gates within each frequency were optimized (independent of the task inputs), according to Equation 6, to compute the approximate maximum log determinant of the covariance matrix over the grid cell embeddings individually for each frequency. We then used the grid cell embeddings belonging to the frequency that had the maximum within-frequency log determinant for training the inference module, which always happened to be grid cells within the top three frequencies. Author response image 1 (also added to the Appendix, Section 7.10 of the updated manuscript) shows the approximate maximum log determinant (on the y-axis) for the different frequencies (on the x-axis).

      Author response image 1.

      Approximate maximum log determinant of the covariance matrix over the grid cell embeddings (y-axis) for each frequency (x-axis), obtained after maximizing Equation 6.

      Third, we would like to clarify our interpretation of why DPP-A identified grid cell embeddings corresponding to the highest spatial frequencies, and why this produced the best OOD generalization (i.e., extrapolation on our analogy tasks). It is because those grid cell embeddings exhibited greater variance over the training data than the lower frequency embeddings, while at the same time the correlations among those grid cell embeddings were lower than the correlations among the lower frequency grid cell embeddings. The determinant of the covariance matrix of the grid cell embeddings is maximized when the variances of the grid cell embeddings are high (they are “expressive”) and the correlation among the grid cell embeddings is low (they “cover the representational space”). As a result, the higher frequency grid cell embeddings more efficiently covered the representational space of the training data, allowing them to efficiently capture the same relational structure across training and test distributions which is required for OOD generalization. We have added some clarification to the second paragraph of Section 2.2.2 in the updated manuscript. Furthermore, to illustrate this graphically, Author response image 2 (added to the Appendix, Section 7.10 of the updated manuscript) shows the results after the summation of the multiplication of the grid cell embeddings over the 2d space of 1000x1000 locations, with their corresponding gates for 3 representative frequencies (left, middle and right panels showing results for the lowest, middle and highest grid cell frequencies, respectively, of the 9 used in the model), obtained after maximizing Equation 6 for each grid cell frequency. The color code indicates the responsiveness of the grid cells to different X and Y locations in the input space (lighter color corresponding to greater responsiveness). Note that the dark blue area (denoting regions of least responsiveness to any grid cell) is greatest for the lowest frequency and nearly zero for the highest frequency, illustrating that grid cell embeddings belonging to the highest frequency more efficiently cover the representational space which allows them to capture the same relational structure across training and test distributions as required for OOD generalization.

      Author response image 2.

      Each panel shows the results after summation of the multiplication of the grid cell embeddings over the 2d space of 1000x1000 locations, with their corresponding gates for a particular frequency, obtained after maximizing Equation 6 for each grid cell frequency. The left, middle, and right panels show results for the lowest, middle, and highest grid cell frequencies, respectively, of the 9 used in the model. Lighter color in each panel corresponds to greater responsiveness of grid cells at that particular location in the 2d space.

      Finally, we would like to clarify how the DPP-A attentional mechanism is different from the attentional mechanism in the transformer module, and why both are needed for strong OOD generalization. Use of the standard self-attention mechanism in transformers over the inputs (i.e., A, B, C, and D for the analogy task) in place of DPP-A would lead to weightings of grid cell embeddings over all frequencies and phases. The objective function for the DPP-A represents an inductive bias, that selectively assigns the greatest weight to all grid cell embeddings (i.e., for all phases) of the frequency for which the determinant of the covariance matrix is greatest computed over the training space. The transformer inference module then attends over the inputs with the selected grid cell embeddings based on the DPP-A objective. We have added a discussion of this point in Section 6 of the updated manuscript.

      We would like to thank the reviewers for their recommendations. We have tried our best to incorporate them into our updated manuscript. Below we provide a detailed response to each of the recommendations grouped for each reviewer.

      Reviewer #1 (Recommendations for the authors)

      (1) It would be helpful to see some equations for R in the main text.

      We thank the reviewer for this suggestion. We have now added some equations explaining the working of R in Section 2.2.3 of the updated manuscript.

      (2) Typo: p 11 'alongwith' -> 'along with'

      We have changed all instances of ‘alongwith’ to ‘along with’ in the updated manuscript.

      (3) Presumably, this is related to equivariant ML - it would be helpful to comment on this.

      Yes, this is related to equivariant ML, since the properties of equivariance hold for our model. Specifically, the probability distribution after applying softmax remains the same when the transformation (translation or scaling) is applied to the scores for each of the answer choices obtained from the output of the inference module, and when the same transformation is applied to the stimuli for the task and all the answer choices before presenting as input to the inference module to obtain the scores. We have commented on this in Section 2.2.3 of the updated manuscript.

      Reviewer #2 (Recommendations for the authors)

      (1) Page 2 - "Webb et al." temporal context - they should also cite and compare this to work by Marc Howard on generalization based on multi-scale temporal context.

      While we appreciate the important contributions that have been made by Marc Howard and his colleagues to temporal coding and its role in episodic memory and hippocampal function, we would like to clarify that his temporal context model is unrelated to the temporal context normalization developed by Webb et al. (2020) and mentioned on Page 2. The former (Temporal Context Model) is a computational model that proposes a role for temporal coding in the functions of the medial temporal lobe in support of episodic recall, and spatial navigation. The latter (temporal context normalization) is a normalization procedure proposed for use in training a neural network, similar to batch normalization [1], in which tensor normalization is applied over the temporal instead of the batch dimension, which is shown to help with OOD generalization. We apologize for any confusion engendered by the similarity of these terms, and failure to clarify the difference between these, that we have now attempted to do in a footnote on Page 2.

      Ioffe, S. and Szegedy, C., 2015, June. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448-456). pmlr.

      (2) page 3 - "known to be implemented in entorhinal" - It's odd that they seem to avoid citing the actual biology papers on grid cells. They should cite more of the grid cell recording papers when they mention the entorhinal cortex (i.e. Hafting et al., 2005; Barry et al., 2007; Stensola et al., 2012; Giocomo et al., 2011; Brandon et al., 2011).

      We have now cited the references mentioned below, on page 3 after the phrase “known to be implemented in entohinal cortex”.

      (1) Barry, C., Hayman, R., Burgess, N. and Jeffery, K.J., 2007. Experience-dependent rescaling of entorhinal grids. Nature neuroscience, 10(6), pp.682-684.

      (2) Stensola, H., Stensola, T., Solstad, T., Frøland, K., Moser, M.B. and Moser, E.I., 2012. The entorhinal grid map is discretized. Nature, 492(7427), pp.72-78.

      (3) Giocomo, L.M., Hussaini, S.A., Zheng, F., Kandel, E.R., Moser, M.B. and Moser, E.I., 2011. Grid cells use HCN1 channels for spatial scaling. Cell, 147(5), pp.1159-1170.

      (4) Brandon, M.P., Bogaard, A.R., Libby, C.P., Connerney, M.A., Gupta, K. and Hasselmo, M.E., 2011. Reduction of theta rhythm dissociates grid cell spatial periodicity from directional tuning. Science, 332(6029), pp.595-599.

      (3) To enhance the connection to biological systems, they should cite more of the experimental and modeling work on grid cell coding (for example on page 2 where they mention relational coding by grid cells). Currently, they tend to cite studies of grid cell relational representations that are very indirect in their relationship to grid cell recordings (i.e. indirect fMRI measures by Constaninescu et al., 2016 or the very abstract models by Whittington et al., 2020). They should cite more papers on actual neurophysiological recordings of grid cells that suggest relational/metric representations, and they should cite more of the previous modeling papers that have addressed relational representations. This could include work on using grid cell relational coding to guide spatial behavior (e.g. Erdem and Hasselmo, 2014; Bush, Barry, Manson, Burges, 2015). This could also include other papers on the grid cell code beyond the paper by Wei et al., 2015 - they could also cite work on the efficiency of coding by Sreenivasan and Fiete and by Mathis, Herz, and Stemmler.

      We thank the reviewer for bringing the additional references to our attention. We have cited the references mentioned below on page 2 of the updated manuscript.

      (1) Erdem, U.M. and Hasselmo, M.E., 2014. A biologically inspired hierarchical goal directed navigation model. Journal of Physiology-Paris, 108(1), pp.28-37.

      (2) Sreenivasan, S. and Fiete, I., 2011. Grid cells generate an analog error-correcting code for singularly precise neural computation. Nature neuroscience, 14(10), pp.1330-1337.

      (3) Mathis, A., Herz, A.V. and Stemmler, M., 2012. Optimal population codes for space: grid cells outperform place cells. Neural computation, 24(9), pp.2280-2317.

      (4) Bush, D., Barry, C., Manson, D. and Burgess, N., 2015. Using grid cells for navigation. Neuron, 87(3), pp.507-520

      (4) Page 3 - "Determinantal Point Processes (DPPs)" - it is rather annoying that DPP is defined after DPP-A is defined. There ought to be a spot where the definition of DPP-A is clearly stated in a single location.

      We agree it makes more sense to define Determinantal Point Process (DPP) before DPP-A. We have now rephrased the sentences accordingly. In the “Abstract”, the sentence now reads “Second, we propose an attentional mechanism that operates over the grid cell code using Determinantal Point Process (DPP), which we call DPP attention (DPP-A) - a transformation that ensures maximum sparseness in the coverage of that space.” We have also modified the second paragraph of the “Introduction”. The modified portion now reads “b) an attentional objective inspired from Determinantal Point Processes (DPPs), which are probabilistic models of repulsion arising in quantum physics [1], to attend to abstract representations that have maximum variance and minimum correlation among them, over the training data. We refer to this as DPP attention or DPP-A.” Due to this change, we removed the last sentence of the fifth paragraph of the “Introduction”.

      (1) Macchi, O., 1975. The coincidence approach to stochastic point processes. Advances in Applied Probability, 7(1), pp.83-122.

      (5) Page 3 - "the inference module R" - there should be some discussion about how this component using LSTM or transformers could relate to the function of actual brain regions interacting with entorhinal cortex. Or if there is no biological connection, they should state that this is not seen as a biological model and that only the grid cell code is considered biological.

      While we agree that the model is not construed to be as specific about the implementation of the R module, we assume that — as a standard deep learning component — it is likely to map onto neocortical structures that interact with the entorhinal cortex and, in particular, regions of the prefrontal-posterior parietal network widely believed to be involved in abstract relational processes [1,2,3,4]. In particular, the role of the prefrontal cortex in the encoding and active maintenance of abstract information needed for task performance (such as rules and relations) has often been modeled using gated recurrent networks, such as LSTMs [5,6], and the posterior parietal cortex has long been known to support “maps” that may provide an important substrate for computing complex relations [4]. We have added some discussion about this in Section 2.2.3 of the updated manuscript.

      (1) Waltz, J.A., Knowlton, B.J., Holyoak, K.J., Boone, K.B., Mishkin, F.S., de Menezes Santos, M., Thomas, C.R. and Miller, B.L., 1999. A system for relational reasoning in human prefrontal cortex. Psychological science, 10(2), pp.119-125.

      (2) Christoff, K., Prabhakaran, V., Dorfman, J., Zhao, Z., Kroger, J.K., Holyoak, K.J. and Gabrieli, J.D., 2001. Rostrolateral prefrontal cortex involvement in relational integration during reasoning. Neuroimage, 14(5), pp.1136-1149.

      (3) Knowlton, B.J., Morrison, R.G., Hummel, J.E. and Holyoak, K.J., 2012. A neurocomputational system for relational reasoning. Trends in cognitive sciences, 16(7), pp.373-381.

      (4) Summerfield, C., Luyckx, F. and Sheahan, H., 2020. Structure learning and the posterior parietal cortex. Progress in neurobiology, 184, p.101717.

      (5) Frank, M.J., Loughry, B. and O’Reilly, R.C., 2001. Interactions between frontal cortex and basal ganglia in working memory: a computational model. Cognitive, Affective, & Behavioral Neuroscience, 1, pp.137-160.

      (6) Braver, T.S. and Cohen, J.D., 2000. On the control of control: The role of dopamine in regulating prefrontal function and working memory. Control of cognitive processes: Attention and performance XVIII, (2000).

      (6) Page 4 - "Learned weighting w" - it is somewhat confusing to use "w" as that is commonly used for synaptic weights, whereas I understand this to be an attentional modulation vector with the same dimensionality as the grid cell code. It seems more similar to a neural network bias input than a weight matrix.

      We refer to the first paragraph of our response above to the topic “Clarification of DPP-A attentional modulation” under “Major comments (Public Reviews)”, which contains our response to this issue.

      (7) Page 4 - "parameterization of w... by two loss functions over the training set." - I realize that this has been stated here, but to emphasize the significance to a naïve reader, I think they should emphasize that the learning is entirely focused on the initial training space, and there is NO training done in the test spaces. It's very impressive that the parameterization is allowing generalization to translated or scaled spaces without requiring ANY training on the translated or scaled spaces.

      We have added the sentence “Note that learning of parameter occurs only over the training space and is not further modified during testing (i.e. over the test spaces)” to the updated manuscript.

      (8) Page 4 - "The first," - This should be specific - "The first loss function"

      We have changed it to “The first loss function” in the updated manuscript.

      (9) Page 4 - The analogy task seems rather simplistic when first presented (i.e. just a spatial translation to different parts of a space, which has already been shown to work in simulations of spatial behavior such as Erdem and Hasselmo, 2014 or Bush, Barry, Manson, Burgess, 2015). To make the connection to analogy, they might provide a brief mention of how this relates to the analogy space created by word2vec applied to traditional human verbal analogies (i.e. king-man+woman=queen).

      We agree that the analogy task is simple, and recognize that grid cells can be used to navigate to different parts of space over which the test analogies are defined when those are explicitly specified, as shown by Erdem and Hasselmo (2014) and Bush, Barry, Manson, and Burgess (2015). However, for the analogy task, the appropriate set of grid cell embeddings must be identified that capture the same relational structure between training and test analogies to demonstrate strong OOD generalization, and that is achieved by the attentional mechanism DPP-A. As suggested by the reviewer’s comment, our analogy task is inspired by Rumelhart’s parallelogram model of analogy [1,2] (and therefore similar to traditional human verbal analogies) in as much as it involves differences (i.e A - B = C - D, where A, B, C, D are vectors in 2D space). We have now noted this in Section 2.1.1 of the updated manuscript.

      (1) Rumelhart, D.E. and Abrahamson, A.A., 1973. A model for analogical reasoning. Cognitive Psychology, 5(1), pp.1-28.

      (2) Mikolov, T., Chen, K., Corrado, G. and Dean, J., 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

      (10) Page 5 - The variable "KM" is a bit confusing when it first appears. It would be good to re-iterate that K and M are separate points and KM is the vector between these points.

      We apologize for the confusion on this point. KM is meant to refer to an integer value, obtained by multiplying K and M, which is added to both dimensions of A, B, C and D, which are points in ℤ2, to translate them to a different region of the space. K is an integer value ranging from 1 to 9 and M is also an integer value denoting the size of the training region, which in our implementation is 100. We have clarified this in Section 2.1.1 of the updated manuscript.

      (11) Page 5 - "two continuous dimensions (Constantinescu et al._)" - this ought to give credit to the original study showing the abstract six-fold rotational symmetry for spatial coding (Doeller, Barry and Burgess).

      We have now cited the original work by Doeller et al. [1] along with Constantinescu et al. (2016) in the updated manuscript after the phrase “two continuous dimensions” on page 5.

      (1) Doeller, C.F., Barry, C. and Burgess, N., 2010. Evidence for grid cells in a human memory network. Nature, 463(7281), pp.657-661.

      (12) Page 6 - Np=100. This is done later, but it would be clearer if they right away stated that Np*Nf=900 in this first presentation.

      We have now added this sentence after Np=100. “Hence Np*Nf=900, which denotes the number of grid cells.”

      (13) Page 6 - They provide theorem 2.1 on the determinant of the covariance matrix of the grid code, but they ought to cite this the first time this is mentioned.

      We have cited Gilenwater et al. (2012) before mentioning theorem 2.1. The sentence just before that reads “We use the following theorem from Gillenwater et al. (2012) to construct :”

      (14) Page 6 - It would greatly enhance the impact of the paper if they could give neuroscientists some sense of how the maximization of the determinant of the covariance matrix of the grid cell code could be implemented by a biological circuit. OR at least to show an example of the output of this algorithm when it is used as an inner product with the grid cell code. This would require plotting the grid cell code in the spatial domain rather than the 900 element vector.

      We refer to our response above to the topic “Biological plausibility of DPP-A” and second, third, and fourth paragraphs of our response above to the topic “Clarification of DPP-A attentional modulation” under “Major comments (Public Reviews)”, which contain our responses to this issue.

      (15) Page 6 - "That encode higher spatial frequencies..." This seems intuitive, but it would be nice to give a more intuitive description of how this is related to the determinant of the covariance matrix.

      We refer to the third paragraph of our response above to the topic “Clarification of DPP-A attentional modulation” under “Major comments (Public Reviews)”, which contains our response to this issue.

      (16) Page 7 - log of both sides... Nf is number of frequencies... Would be good to mention here that they are referring to equation 6 which is only mentioned later in the paragraph.

      As suggested, we now refer to Equation 6 in the updated manuscript. The sentence now reads “This is achieved by maximizing the determinant of the covariance matrix over the within frequency grid cell embeddings of the training data, and Equation 6 is obtained by applying the log on both sides of Theorem 2.1, and in our case where refers to grid cells of a particular frequency.”

      (17) Page 7 - Equation 6 - They should discuss how this is proposed to be implemented in brain circuits.

      We refer to our response above to the topic “Biological plausibility of DPP-A” under “Major comments (Public Reviews)”, which contains our response to this issue.

      18) Page 9 - "egeneralize" - presumably this is a typo?

      Yes. We have corrected it to “generalize” in the updated manuscript.

      (19) Page 9 - "biologically plausible encoding scheme" - This is valid for the grid cell code, but they should be clear that this is not valid for other parts of the model, or specify how other parts of the model such as DPP-A could be biologically plausible.

      We refer to our response above to the topic “Biological plausibility of DPP-A” under “Major comments (Public Reviews)”, which contains our response to this issue.

      (20) Page 12 - Figure 7 - comparsion to one-hots or smoothed one-hots. The text should indicate whether the smoothed one-hots are similar to place cell coding. This is the most relevant comparison of coding for those knowledgeable about biological coding schemes.

      Yes, smoothed one-hots are similar to place cell coding. We now mention this in Section 5.3 of the updated manuscript.

      (21) Page 12 - They could compare to a broader range of potential biological coding schemes for the overall space. This could include using coding based on the boundary vector cell coding of the space, band cell coding (one dimensional input to grid cells), or egocentric boundary cell coding.

      We appreciate these useful suggestions, which we now mention as potentially valuable directions for future work in the second paragraph of Section 6 of the updated manuscript.

      (22) Page 13 - "transformers are particularly instructive" - They mention this as a useful comparison, but they might discuss further why a much better function is obtained when attention is applied to the system twice (once by DPP-A and then by a transformer in the inference module).

      We refer to the last paragraph of our response above to the topic “Clarification of DPP-A attentional modulation” under “Major comments (Public Reviews)”, which contains our response to this issue.

      (23) Page 13 - "Section 5.1 for analogy and Section 5.2 for arithmetic" - it would be clearer if they perhaps also mentioned the specific figures (Figure 4 and Figure 6) presenting the results for the transformer rather than the LSTM.

      We have now rephrased to also refer to the figures in the updated manuscript. The phrase now reads “a transformer (Figure 4 in Section 5.1 for analogy and Figure 6 in Section 5.2 for arithmetic tasks) failed to achieve the same level of OOD generalization as the network that used DPP-A.”

      (24) Page 14 - "statistics of the training data" - The most exciting feature of this paper is that learning during the training space analogies can so effectively generalize to other spaces based on the right attention DPP-A, but this is not really made intuitive. Again, they should illustrate the result of the xT w inner product to demonstrate why this work so effectively!

      We refer to the second, third, and fourth paragraphs of our response above to the topic “Clarification of DPP-A attentional modulation” under “Major comments (Public Reviews)”, which contains our response to this issue.

      (25) Bibliography - Silver et al., go paper - journal name "nature" should be capitalized. There are other journal titles that should be capitalized. Also, I believe eLife lists family names first.

      We have made the changes to the bibliography of the updated manuscript suggested by the reviewer.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We thank the editors and the reviewers for their time and constructive comments, which helped us to improve our manuscript “The Hungry Lens: Hunger Shifts Attention and Attribute Weighting in Dietary Choice” substantially. In the following we address the comments in depth:

      R1.1: First, in examining some of the model fits in the supplements, e.g. Figures S9, S10, S12, S13, it looks like the "taste weight" parameter is being constrained below 1. Theoretically, I understand why the authors imposed this constraint, but it might be unfairly penalizing these models. In theory, the taste weight could go above 1 if participants had a negative weight on health. This might occur if there is a negative correlation between attractiveness and health and the taste ratings do not completely account for attractiveness. I would recommend eliminating this constraint on the taste weight.

      We appreciate the reviewer’s suggestion to test a multi-attribute attentional drift-diffusion model (maaDDM) that does not constrain the taste and health weights to the range of 0 and 1. We tested two versions of such a model. First, we removed the phi-transformation, allowing the weight to take on any value (see Author response image 1). The results closely matched those found in the original model. Partially consistent with the reviewer’s comment, the health weight became slightly negative in some individuals in the hungry condition. However, this model had convergence issues with a maximal Rhat of 4.302. Therefore, we decided to run a second model in which we constrained the weights to be between -1 and 2. Again, we obtained effects that matched the ones found in the original model (see Author response image 2), but again we had convergence issues. These convergence issues could arise from the fact that the models become almost unidentifiable, when both attention parameters (theta and phi) as well as the weight parameters are unconstrained.

      Author response image 1.

      Author response image 2.

      R1.2: Second, I'm not sure about the mediation model. Why should hunger change the dwell time on the chosen item? Shouldn't this model instead focus on the dwell time on the tasty option?

      We thank the reviewer for spotting this inconsistency. In our GLMMs and the mediation model, we indeed used the proportion of dwell time on the tasty option as predictors and mediator, respectively. The naming and description of this variable was inconsistent in our manuscript and the supplements. We have now rephrased both consistently.

      R1.3: Third, while I do appreciate the within-participant design, it does raise a small concern about potential demand effects. I think the authors' results would be more compelling if they replicated when only analyzing the first session from each participant. Along similar lines, it would be useful to know whether there was any effect of order.

      R3.2: On the interpretation side, previous work has shown that beliefs about the nourishing and hunger-killing effectiveness of drinks or substances influence subjective and objective markers of hunger, including value-based dietary decision-making, and attentional mechanisms approximated by computational models and the activation of cognitive control regions in the brain. The present study shows differences between the protein shake and a natural history condition (fasted, state). This experimental design, however, cannot rule between alternative interpretations of observed effects. Notably, effects could be due to (a) the drink's active, nourishing ingredients, (b) consuming a drink versus nothing, or (c) both. […]

      R3 Recommendation 1:

      Therefore, I recommend discussing potential confounds due to expectancy or placebo effects on hunger ratings, dietary decision-making, and attention. […] What were verbatim instructions given to the participants about the protein shake and the fasted, hungry condition? Did participants have full knowledge about the study goals (e.g. testing hunger versus satiation)? Adding the instructions to the supplement is insightful for fully harnessing the experimental design and frame.

      Both reviewer 1 and reviewer 3 raise potential demand/ expectancy effects, which we addressed in several ways. First, we have translated and added participants’ instructions to the supplements SOM 6, in which we transparently communicate the two conditions to the participants. Second, we have added a paragraph in the discussion section addressing potential expectancy/demand effects in our design:

      “The present results and supplementary analyses clearly support the two-fold effect of hunger state on the cognitive mechanisms underlying choice. However, we acknowledge potential demand effects arising from the within-subject Protein-shake manipulation. A recent study (Khalid et al., 2024) showed that labeling water to decrease or increase hunger affected participants subsequent hunger ratings and food valuations. For instance, participants expecting the water to decrease hunger showed less wanting for food items. DDM modeling suggested that this placebo manipulation affected both drift rate and starting point. The absence of a starting point effect in our data speaks against any prior bias in participants due to any demand effects. Yet, we cannot rule out that such effects affected the decision-making process, for example by increasing the taste weight (and thus the drift rate) in the hungry condition.”

      Third, we followed Reviewer 1’s suggestion and tested, whether the order of testing affected the results. We did so by adding “order” to the main choice and response time (RT) GLMM. We neither found an effect of order on choice (β<sub>order</sub>=-0.001, SE\=0.163, p<.995), nor on RT (β<sub>order</sub>=0.106, SE\=0.205, p<.603) and the original effects remain stable (see Author response table 1a and Author response table 1 2a below). Further, we used two ANOVAs to compare models with and without the predictor “order”. The ANOVAs indicated that GLMMs without “order” better explained choice and RT (see Author response table 1b and Author response table 2b). Taken together, these results suggest that demand effects played a negligible role in our study.

      Author response table 1.

      a) GLMM: Results of Tasty vs Healthy Choice Given Condition, Attention and Order

      Note. p-values were calculated using Satterthwaites approximations. Model equation: choice ~ condition + scale(_rel_taste_DT) + order + (1+condition|subject);_ rel_taste_DT refers to the relative dwell time on the tasty option; order with hungry/sated as the reference

      b) Model Comparison

      Author response table 2.

      a) GLMM: Response Time Given Condition, Choice, Attention and Order

      Note. p-values were calculated using Satterthwaites approximations. Model equation: RT ~ choice + condition + scale(_rel_taste_DT) + order + choice * scale(rel_taste_DT) (1+condition|subject);_ rel_taste_DT refers to the relative dwell time on the tasty option; order with hungry/sated as the reference

      b) Model Comparison

      R1.4: Fourth, the authors report that tasty choices are faster. Is this a systematic effect, or simply due to the fact that tasty options were generally more attractive? To put this in the context of the DDM, was there a constant in the drift rate, and did this constant favor the tasty option?

      We thank the reviewer for their observant remark about faster tasty choices and potential links to the drift rate. While our starting point models show that there might be a small starting point bias towards the taste boundary, which would result in faster tasty decisions, we took a closer look at the simulated value differences as obtained in our posterior predictive checks to see if the drift rate was systematically more extreme for tasty choices (Author response image 3). In line with the reviewer’s suggestion that tasty options were generally more attractive, tasty decisions were associated with higher value differences (i.e., further away from 0) and consequently with faster decisions. This indicates that the main reason for faster tasty choices was a higher drift rate in those trials (as a consequence of the combination of attribute weights and attribute values rather than “a constant in the drift rate”), whereas a strong starting point bias played only a minor role.

      Author response image 3.

      Note. Value Difference as obtained from Posterior Predictive Checks of the maaDDM2𝜙 in hungry and sated condition for healthy (green) and tasty (orange) choices.

      R1.5: Fifth, I wonder about the mtDDM. What are the units on the "starting time" parameters? Seconds? These seem like minuscule effects. Do they align with the eye-tracking data? In other words, which attributes did participants look at first? Was there a correlation between the first fixations and the relative starting times? If not, does that cast doubt on the mtDDM fits? Did the authors do any parameter recovery exercises on the mtDDM?

      We thank Reviewer 1 for their observant remarks about the mtDDM. In line with their suggestion, we have performed a parameter recovery which led to a good recovery of all parameters except relative starting time (rst). In addition, we had convergence issues of rst as revealed by parameter Rhats around 20. Together these results indicate potential limitations of the mtDDM when applied to tasks with substantially different visual representations of attributes leading to differences in dwell time for each attribute (see Figure 3b and Figure S6b). We have therefore decided not to report the mtDDM in the main paper, only leaving a remark about convergence and recovery issues.

      R2: My main criticism, which doesn't affect the underlying results, is that the labeling of food choices as being taste- or health-driven is misleading. Participants were not cued to select health vs taste. Studies in which people were cued to select for taste vs health exist (and are cited here). Also, the label "healthy" is misleading, as here it seems to be strongly related to caloric density. A high-calorie food is not intrinsically unhealthy (even if people rate it as such). The suggestion that hunger impairs making healthy decisions is not quite the correct interpretation of the results here (even though everyone knows it to be true). Another interpretation is that hungry people in negative calorie balance simply prefer more calories.

      First, we agree with the reviewer that it should be tested to what extent participants’ choice behavior can be reduced to contrasting taste vs. health aspects of their dietary decisions (but note that prior to making decisions, they were asked to rate these aspects and thus likely primed to consider them in the choice task). Having this question in mind, we performed several analyses to demonstrate the suitability of framing decisions as contrasting taste vs. health aspects (including the PCA reported in the Supplemental Material).

      Second, we agree with the reviewer in that despite a negative correlation (Author response image 4) between caloric density and health, high-caloric items are not intrinsically unhealthy. This may apply only to two stimuli in our study (nuts and dried fruit), which are also by our participants recognized as such.

      Finally, Reviewer 2’s alternative explanation, that hungry individuals prefer more calories is tested in SOM5. In line with the reviewer’s interpretation, we show that hungry individuals indeed are more likely to select higher caloric options. This effect is even stronger than the effect of hunger state on tasty vs healthy choice. However, in this paper we were interested in the effect of hunger state on tasty vs healthy decisions, a contrast that is often used in modeling studies (e.g., Barakchian et al., 2021; Maier et al., 2020; Rramani et al., 2020; Sullivan & Huettel, 2021). In sum, we agree with Reviewer 2 in all aspects and have tested and provided evidence for their interpretation, which we do not see to stand in conflict with ours.

      Author response image 4.

      Note. strong negative correlation between health ratings and objective caloric content in both hungry (r\=-.732, t(64)=-8.589, p<.001) and sated condition (r\=-.731, t(64)=-8.569, p<.001).

      R3.1: On the positioning side, it does not seem like a 'bad' decision to replenish energy states when hungry by preferring tastier, more often caloric options. In this sense, it is unclear whether the observed behavior in the fasted state is a fallacy or a response to signals from the body. The introduction does mention these two aspects of preferring more caloric food when hungry. However, some ambiguity remains about whether the study results indeed reflect suboptimal choice behavior or a healthy adaptive behavior to restore energy stores.

      We thank Reviewer 3 for this remark, which encouraged us to interpret the results also form a slightly different perspective. We agree that choosing tasty over healthy options under hunger may be evolutionarily adaptive. We have now extended a paragraph in our discussion linking the cognitive mechanisms to neurobiological mechanisms:

      “From a neurobiological perspective, both homeostatic and hedonic mechanisms drive eating behaviour. While homeostatic mechanisms regulate eating behaviour based on energy needs, hedonic mechanisms operate independent of caloric deficit (Alonso-Alonso et al., 2015; Lowe & Butryn, 2007; Saper et al., 2002). Participants’ preference for tasty high caloric food options in the hungry condition aligns with a drive for energy restoration and could thus be taken as an adaptive response to signals from the body. On the other hand, our data shows that participants preferred less healthy options also in the sated condition. Here, hedonic drivers could predominate indicating potentially maladaptive decision-making that could lead to adverse health outcomes if sustained. Notably, our modeling analyses indicated that participants in the sated condition showed reduced attentional discounting of health information, which poses potential for attention-based intervention strategies to counter hedonic hunger. This has been investigated for example in behavioral (Barakchian et al., 2021; Bucher et al., 2016; Cheung et al., 2017; Sullivan & Huettel, 2021), eye-tracking (Schomaker et al., 2022; Vriens et al., 2020) and neuroimaging studies (Hare et al., 2011; Hutcherson & Tusche, 2022) showing that focusing attention on health aspects increased healthy choice. For example, Hutcherson and Tusche (2022) compellingly demonstrated that the mechanism through which health cues enhance healthy choice is shaped by increased value computations in the dorsolateral prefrontal cortex (dlPFC) when cue and choice are conflicting (i.e., health cue, tasty choice). In the context of hunger, these findings together with our analyses suggest that drawing people’s attention towards health information will promote healthy choice by mitigating the increased attentional discounting of such information in the presence of tempting food stimuli.”

      Recommendations for the authors:

      R1: The Results section needs to start with a brief description of the task. Otherwise, the subsequent text is difficult to understand.

      We included a paragraph at the beginning of the results section briefly describing the experimental design.

      R1/R2: In Figure 1a it might help the reader to have a translation of the rating scales in the figure legend.

      We have implemented an English rating scale in Figure 1a.

      R2: Were the ratings redone at each session? E.g. were all tastiness ratings for the sated session made while sated? This is relevant as one would expect the ratings of tastiness and wanting to be affected by the current fed state.

      The ratings were done at the respective sessions. As shown in S3a there is a high correlation of taste ratings across conditions. We decided to take the ratings of the respective sessions (rather than mean ratings across sessions) to define choice and taste/health value in the modeling analyses, for several reasons. First, by using mean ratings we might underestimate the impact of particularly high or low ratings that drove choice in the specific session (regression to the mean). Second, for the modeling analysis in particular, we want to model a decision-making process at a particular moment in time. Consequently, the subjective preferences in that moment are more accurate than mean preferences.

      R2: It would be helpful to have a diagram of the DDM showing the drifting information to the boundary, and the key parameters of the model (i.e. showing the nDT, drift rate, boundary, and other parameters). (Although it might be tricky to depict all 9 models).

      We thank the reviewer for their recommendation and have created Figure 6, which illustrates the decision-making process as depicted by the maaDDM2phi.

      R3.1: Past work has shown that prior preferences can bias/determine choices. This effect might have played a role during the choice task, which followed wanting, taste, health, and calorie ratings during which participants might have already formed their preferences. What are the authors' positions on such potential confound? How were the food images paired for the choice task in more detail?

      The data reported here, were part of a larger experiment. Next to the food rating and choice task, participants also completed a social preference rating and choice task, as well as rating and choice tasks for intertemporal discounting. These tasks were counterbalanced such that first the three rating tasks were completed in counterbalanced order and second the three choice tasks were completed in the same order (e.g. food rating, social rating, intertemporal rating; food choice, social choice, intertemporal choice). This means that there were always two other tasks between the food rating and food choice task. In addition, to the temporal delay between rating and choice tasks, our modeling analyses revealed that models including a starting point bias performed worse than those without the bias. Although we cannot rule out that participants might occasionally have tried to make their decision before the actual task (e.g., by keeping their most/least preferred option in mind and then automatically choosing/rejecting it in the choice task), we think that both our design as well as our modeling analyses speak against any systematic bias of preference in our choice task. The options were paired such that approximately half of the trials were random, while for the other half one option was rated healthier and the other option was rated tastier (e.g., Sullivan & Huettel, 2021)

      R3.2: In line with this thought, theoretically, the DDMs could also be fitted to reaction times and wanting ratings (binarized). This could be an excellent addition to corroborate the findings for choice behavior.

      We have implemented several alternative modeling analyses, including taste vs health as defined by Nutri-Score (Table S12 and Figures S22-S30) and higher wanted choice vs healthy choice (Table S13; Figure S30-34). Indeed, these models corroborate those reported in the main text demonstrating the robustness of our findings.

      R3.3: The principal component analysis was a good strategy for reducing the attribute space (taste, health, wanting, calories, Nutriscore, objective calories) into two components. Still, somehow, this part of the results added confusion to harnessing in which of the analyses the health attribute corresponded only to the healthiness ratings and taste to the tastiness ratings and if and when the components were used as attributes. This source of confusion could be mitigated by more clearly stating what health and taste corresponded to in each of the analyses.

      We thank the reviewer for this recommendation and have now reported the PCA before reporting the behavioural results to clarify that choices are binarized based on participants’ taste and health ratings, rather than the composite scores. We have chosen this approach, as it is closer to our hypotheses and improves interpretability.

      R3.4: From the methods, it seems that 66 food images were used, and 39 fell into A, B, C, and D Nutriscores. How were the remaining 27 images selected, and how healthy and tasty were the food stimuli overall?

      The selection of food stimuli was done in three steps: First, from Charbonnier and collegues (2016) standardized food image database (available at osf.io/cx7tp/) we excluded food items that were not familiar in Germany/unavailable in regular German supermarkets. Second, we excluded products that we would not be able to incentivize easily (i.e., fastfood, pastries and items that required cooking/baking/other types of preparation). Third, we added the Nutri Scores to the remaining products aiming to have an equal number of items for each Nutri-Score, of which approximately half of the items were sweet and the other half savory. This resulted in a final stimuli-set of 66 food images (13 items =A; 13 items=B; 12 items=C; 14 items =D; 14 items = E). The experiment with including the set of food stimuli used in our study is also uploaded here: osf.io/pef9t/.With respect to the second question, we would like to point out that preference of food stimuli is very individual, therefore we obtained the ratings (taste, health, wanting and estimated caloric density) of each participant individually. However, we also added the objective total calories, which is positively correlated subjective caloric density and negatively correlated with Nutri-Score (coded as A=5; B=4; C=3; D=2; E=1) and health ratings (see Figure S7).

      R3.5: It seems that the degrees of freedom for the paired t-test comparing the effects of the condition hungry versus satiated on hunger ratings were 63, although the participant sample counted 70. Please verify.

      This is correct and explained in the methods section under data analysis: “Due to missing values for one timepoint in six participants (these participants did not fill in the VAS and PANAS before the administration of the Protein Shake in the sated condition) the analyses of the hunger state manipulation had a sample size of 64.”

      R3.5: Please add the range of BMI and age of participants. Did all participants fall within a healthy BMI range

      The BMI ranged from 17.306 to 48.684 (see Author response image 5), with the majority of participants falling within a normal BMI (i.e., between 18.5 and 24.9. In our sample, 3 participants had a BMI lager than 30. By using subject as a random intercept in our GLMMs we accounted for potential deviations in their response.

      Author response image 5.

      R3.5: Defining the inference criterion used for the significance of the posterior parameter chains in more detail can be pedagogical for those new to or unfamiliar with inferences drawn from hierarchical Bayesian model estimations and Bayesian statistics.

      We have added an explanation of the highest density intervals and what they mean with respect to our data in the respective result section.

    1. Author response:

      The following is the authors’ response to the original reviews

      eLife Assessment

      This manuscript makes valuable contributions to our understanding of cell polarisation dynamics and its underlying mechanisms. Through the development of a computational pipeline, the authors provide solid evidence that compensatory actions, whether regulatory or spatial, are essential for the robustness of the polarisation pattern. However, a more comprehensive validation against experimental data and a proper estimation of model parameters are required for further characterization and predictions in natural systems, such as the C. elegans embryo.

      We sincerely thank the editor(s) for their pertinent assessment. We have carefully considered the constructive recommendations and made the necessary revisions in the manuscript, which are also detailed in this response letter. We have implemented most of the revisions requested by the reviewers. For the few requests we did not fully accept, we have provided justifications. The corresponding revisions in both the Manuscript and Supplementary Information are highlighted with a yellow background. To provide a more comprehensive validation against experimental data and model parameters used for characterizing and predicting natural systems, we reproduced the qualitative and semi-quantitative phenomenon in three more experimental groups previously published (Section 2.5; Fig. S8) [Gotta et al., Curr. Biol., 2001; Aceto et al., Dev. Biol., 2006]. Combined with the original experiments (Section 2.5; Fig. S7) [Hoege et al., Curr. Biol., 2010; Beatty et al., Development, 2010; Beatty et al., Development, 2013], now we have reproduced five experimental groups in total (two acting on LGL-1 and three on CDC-42), comprising eight perturbed conditions and using wild-type as the reference. These results effectively demonstrate how comprehensively the network structure and parameters capture the characteristics of the C. elegans embryo. We have also acknowledged the limitations of the current cell polarization model and provided, in 2. Results and 3. Discussion and conclusion, a detailed outline of potential model improvements.

      Joint Public Review:

      The polarisation phenomenon describes how proteins within a signalling network segregate into different spatial domains. This phenomenon holds fundamental importance in biology, contributing to various cellular processes such as cell migration, cell division, and symmetry breaking in embryonic morphogenesis. In this manuscript, the authors assess the robustness of stable asymmetric patterns using both a previously proposed minimal model of a 2-node network and a more realistic 5-node network based on the C. elegans cell polarisation network, which exhibits anterior-posterior asymmetry. They introduce a computational pipeline for numerically exploring the dynamics of a given reaction-diffusion network and evaluate the stability of a polarisation pattern. Typically, the establishment of polarisation requires the mutual inhibition of two groups of proteins, forming a 2-node antagonistic network. Through a reaction-diffusion formulation, the authors initially demonstrate that the widely-used 2-node antagonistic network for creating polarised patterns fails to maintain the polarised pattern in the face of simple modifications. However, the collapsed polarisation can be restored by combining two or more opposing regulations. The position of the interface can be adjusted with spatially varied kinetic parameters. Furthermore, the authors show that the 5-node network utilised by C. elegans is the most stable for maintaining polarisation against parameter changes, identifying key parameters that impact the position of the interface.

      We sincerely thank the editor(s) for the pertinent summary!

      While the results offer novel and insightful perspectives on the network's robustness for cell polarisation, the manuscript lacks comprehensive validation against experimental data, justified node-node network interactions, and proper estimation of model parameters (based on quantitative measurements or molecular intensity distributions). These limitations significantly restrict the utility of the model in making meaningful predictions or advancing our understanding of cell polarisation and pattern formation in natural systems, such as the C. elegans embryo.

      We sincerely thank the editor(s) for the comment!

      To provide a more comprehensive validation against experimental data and model parameters, we reproduced the qualitative and semi-quantitative phenomenon in three more experimental groups previously published (Section 2.5; Fig. S8) [Gotta et al., Curr. Biol., 2001; Aceto et al., Dev. Biol., 2006]. Combined with the original experiments (Section 2.5; Fig. S7) [Hoege et al., Curr. Biol., 2010; Beatty et al., Development, 2010; Beatty et al., Development, 2013], now we have reproduced five experimental groups in total (two acting on LGL-1 and three on CDC-42), comprising eight perturbed conditions and using wild-type as the reference. These meaningful predictions effectively demonstrate the utility of our model’s network structure and parameters in advancing our understanding of cell polarisation and pattern formation in natural systems, exemplified by the C. elegans embryo.

      We have also acknowledged the limitations of the current cell polarization model and provided, in 2. Results and 3. Discussion and conclusion, a detailed outline of potential model improvements. The limitations include, but are not limited to, issues involving “node-node network interactions” and the “proper estimation of model parameters (based on quantitative measurements or molecular intensity distributions)”, both of which rely on experimental measurements of biological information.   However, comprehensive experimental measurement data on every molecular species, their interactions, and each species’ intensity distribution in space and time were not fully available from prior research. Refinement is lacking for some of these interactions, potentially requiring years of additional experimentation. Moreover, for certain species at specific developmental stages, only relative (rather than absolute) intensity measurements are available. We agreed that such information is essential for establishing a more utilizable model and discussed it thoroughly in 3. Discussion and conclusion. From a theoretical perspective, we adopted assumptions from the previous literature and constructed a minimal model for a specific cell polarization phase to investigate the network's robustness, supported by five experimental groups and eight perturbed conditions in the C. elegans embryo.

      The study extends its significance by examining how cells maintain pattern stability amid spatial parameter variations, which are common in natural systems due to extracellular and intracellular fluctuations. The authors found that in the 2-node network, varying individual parameters spatially disrupt the pattern, but stability is restored with compensatory variations. Additionally, the polarisation interface stabilises around the step transition between parameter values, making its localisation tunable. This suggests a potential biological mechanism where localisation might be regulated through signalling perception.

      We sincerely thank the editor(s) for the pertinent review!

      Focusing on the C. elegans cell polarisation network, the authors propose a 5-node network based on an exhaustive literature review, summarised in a supplementary table. Using their computational pipeline, they identify several parameter sets capable of achieving stable polarisation and claim that their model replicates experimental behaviour, even when simulating mutants. They also found that among 34 possible network structures, the wild-type network with mutual inhibition is the only one that proves viable in the computational pipeline. Compared with previous studies, which typically considered only 2- or 3-node networks, this analysis provides a more complete and realistic picture of the signalling network behind polarisation in the C. elegans embryo. In particular, the model for C. elegans cell polarisation paves the way for further in silico experiments to investigate the role of the network structure over the polarisation dynamics. The authors suggest that the natural 5-node network of C. elegans is optimised for maintaining cell polarisation, demonstrating the elegance of evolution in finding the optimal network structure to achieve certain functions.

      We sincerely thank the editor(s) for the pertinent review!

      Noteworthy limitations are also found in this work. To simplify the model for numerical exploration, the authors assume several reactions have equivalent dynamics, reducing the parameter space to three independent dimensions. While the authors briefly acknowledge this limitation in the "Discussion and Conclusion" section, further analysis might be required to understand the implications. For instance, it is not clear how the results depend on the particular choice of parameters. The authors showed that adding additional regulation might disrupt the polarised pattern, with the conclusion apparently depending on the strength of the regulation. Even for the 5-node wild-type network, which is the most robust, adding a strong enough self-activation of [A], as done in the 2-node network, will probably cause the polarised pattern to collapse as well.

      We sincerely thank the editor(s) for the comment!

      Now we have thoroughly expanded our acknowledgment of the model’s limitations in in 2. Results and 3. Discussion and conclusion. To rule out the equivalent dynamics assumption undermines our conclusions, we have added simulations showing that the cell polarization pattern stability does not depend on the exact strength of each regulation, provided the regulations on both sides are initially balanced as a whole (Fig. S5). Specifically, we used a Monte Carlo method to sample a wide range of various parameter values ( i.e., γ, α, k<sub>1</sub>, k<sub>2</sub>, q<sub>1</sub>, q<sub>2</sub> and [X<sub>c</sub>) for all nodes and regulations in simple 2-node network and C. elegans 5-node network, to achieve pattern stability. Under these conditions (i.e., without any reduction in the parameter space), single-sided self-regulation, single-sided additional regulation, and unequal system parameters still cause the stable polarized pattern to collapse, consistent with our conclusions in the simplified conditions with the parameter space reduced to three independent dimensions.

      Additionally, the authors utilise parameter values that are unrealistic, fail to provide units for some of them, and assume unknown parameter values without justification. The model appears to have non-dimensionalised length but not time, resulting in a mix of dimensional and non-dimensional variables that can be confusing. Furthermore, they assume equal values for Hill coefficients and many parameters associated with activation and inhibition pathways, while setting inhibition intensity parameters to 1. These arbitrary choices raise concerns about the fidelity of the proposed model in representing the real system, as their selected values could potentially differ by many orders of magnitude from the actual parameters.

      We sincerely thank the editor(s) for the comment!

      We apologize for the confusion. The non-dimensionalised parameter values are adopted from previous theoretical research [Seirin-Lee et al., Cells, 2020], which originates from the experimental measurement in [Goehring et al., J. Cell Biol., 2011; Goehring et al., Science, 2011]. With the in silico time set as 2 sec per step, now we have added the Supplemental Text justifying how the units are removed during non-dimensionalization. This demonstrates that the derived non-dimensionalized parameter in this paper achieves realistic values on the same order of magnitude as those observed in reality, confirming the fidelity of the proposed model in representing the real system.

      The assumption of “equal values for Hill coefficients and many parameters associated with activation and inhibition pathways” is to reduce the parameter space for affordable computational cost. It is a widely-used strategy to fix Hill coefficients [Seirin-Lee et al., J. Theor. Biol., 2015; Seirin-Lee, Bull. Math. Biol., 2021] and unify parameter values for different pathways in network research about both cell polarization [Marée et al., Bull. Math. Biol., 2006; Goehring et al., Science, 2011; Trong et al., New J. Phys., 2014] and other biological topics (e.g., plasmid transferring in the microbial community [Wang et al., Nat. Commun., 2020]), to control computational cost. Nevertheless, to rule out that the equivalent dynamics assumption undermines our conclusions, we have added simulations showing that the cell polarization pattern stability does not depend on the exact parameter values associated with activation and inhibition pathways, provided the regulations on both sides are initially balanced as a whole (Fig. S5). Specifically, we used a Monte Carlo method to sample a wide range of various parameter values (i.e_., _γ, α, k<sub>1</sub>, k<sub>2</sub>, q<sub>1</sub>, q<sub>2</sub> and [X<sub>c</sub>) for all nodes and regulations in simple 2-node network and C. elegans 5-node network, to achieve pattern stability. Under these conditions ( i.e., without any reduction in the parameter space), single-sided self-regulation, single-sided additional regulation, and unequal system parameters still cause the stable polarized pattern to collapse, consistent with our conclusions in the simplified conditions with the parameter space reduced to three independent dimensions.

      To confirm the fidelity of the proposed model in representing the real system, we reproduced the qualitative and semi-quantitative phenomenon in three more experimental groups previously published (Section 2.5; Fig. S8) [Gotta et al., Curr. Biol., 2001; Aceto et al., Dev. Biol., 2006]. Combined with the original experiments (Section 2.5; Fig. 5; Fig. S7) [Hoege et al., Curr. Biol., 2010; Beatty et al., Development, 2010; Beatty et al., Development, 2013], now we have reproduced five experimental groups in total (two acting on LGL-1 and three on CDC-42), comprising eight perturbed conditions and using wild-type as the reference. These results effectively demonstrate how comprehensively the network structure and parameters capture the characteristics of the C. elegans embryo. We have also acknowledged the limitations of the current cell polarization model and provided, in 2. Results and 3. Discussion and conclusion, a detailed outline of potential model improvements.

      It is worth noting that, although a strict match between numerical and realistic parameter values with consistent units is always helpful, a lot of notable pure numerical studies successfully unveil principles that help interpret [Ma et al., Cell, 2009] and synthesize real biological systems [Chau et al., Cell, 2012]. These studies suggest that numerical analysis in biological systems remains powerful, even when comprehensive experimental data from prior research are not fully available.

      The definition of stability and its evaluation in the proposed pipeline might also be too narrow. Throughout the paper, the authors discuss the stability of the polarised pattern, checked by an exhaustive search of the parameter space where the system reaches a steady state with a polarised pattern instead of a homogeneous pattern. It is not clear if the stability is related to the linear stability analysis of the reaction terms, as conducted in Goehring et al. (Science, 2011), which could indicate if a homogeneous state exists and whether it is stable or unstable. The stability test is performed through a pipeline procedure where they always start from a polarised pattern described by their model and observe how it evolves over time. It is unclear if the conclusions depend on the chosen initial conditions. Particularly, it is unclear what would happen if the initial distribution of posterior molecules is not exactly symmetric with respect to the anterior molecules, or if the initial polarisation is not strong.

      We sincerely thank the editor(s) for the comment!

      The definition of stability and its evaluation in the proposed pipeline consider two criteria: 1. The pattern is polarized; 2. The pattern is stable. Following simulations, figures, and videos (Fig. 1-6; Fig. S1-S5; Fig. S7-S9; Movie S1-S5) have sufficiently demonstrated that the parameters and networks set up capture the cell polarization dynamis regarding both the stable and unstable states very well.

      Now we have added new simulation on alternative initial conditions. They demonstrating the necessity of a polarized initial pattern set up independently of the reaction-diffusion network during the establishment phase, probably through additional mechanisms such as the active actomyosin contractility and flow [Cuenca et al., Development, 2003; Gross et al., Nat. Phys., 2019]. Our conclusions ( i.e., single-sided self-regulation, single-sided additional regulation, and unequal system parameters cause the stable polarized pattern to collapse) have little dependence on the chosen initial conditions as long as the unsymmetric initial patterns can set up a stable polarized pattern. A part of the simulations institutively show our conclusions still hold if the initial distribution of posterior molecules is not exactly symmetric with respect to the anterior molecules, or if the initial polarisation is not strong (Fig. S4 and Fig. S9).

      Regarding the biological interpretation and relevance of the model, it overlooks some important aspects of the C. elegans polarisation system. The authors focus solely on a reaction-diffusion formulation to reproduce the polarisation pattern. However, the polarisation of the C. elegans zygote consists of two distinct phases: establishment and maintenance, with actomyosin dynamics playing a crucial role in both phases (see Munro et al., Dev Cell 2004; Shivas & Skop, MBoC 2012; Liu et al., Dev Biol 2010; Wang et al., Nat Cell Biol 2017). Both myosin and actin are crucial to maintaining the localisation of PAR proteins during cell polarisation, yet the authors neglect cortical flows during the establishment phase and any effects driven by myosin and actin in their model, failing to capture the system's complexity. How this affects the proposed model and conclusions about the establishment of the polarisation pattern needs careful discussion. Additionally, they assume that diffusion in the cytoplasm is infinitely fast and that cytoplasmic flows do not play any role in cell polarity. Finite cytoplasmic diffusion combined with cytoplasmic flows could compromise the stability of the anterior-posterior molecular distributions. The authors claim that cytoplasmic diffusion coefficients are two orders of magnitude higher than membrane diffusion coefficients, but they seem to differ by only one order of magnitude (Petrášek et al., Biophys. J. 2008). The strength of cytoplasmic flows has been quantified by a few studies, including Cheeks et al., and Curr Biol 2004.

      We sincerely thank the editor(s) for the comment!

      Indeed, previous research highlighted the importance of convective cortical flow in orchestrating the localisation of PAR proteins during the establishment phase of polarisation formation [Goehring et al., J. Cell Biol., 2011; Rose et al., WormBook, 2014; Beatty et al., Development, 2013]. However, during the maintenance phase, the non-muscle myosin II (NMY-2) is regulated downstream by the PAR protein network rather than serving as the primary upstream factor controlling PAR protein localization [Goehring et al., J. Cell Biol., 2011; Rose et al., WormBook, 2014; Beatty et al., Development, 2013]. While some theoretical studies integrated both reaction-diffusion dynamics and the effects of myosin and actin [Tostevin, 2008; Goehring, Science, 2011], others focused exclusively on reaction-diffusion dynamics [Dawes et al., Biophys. J., 2011; Seirin-Lee et al., Cells, 2020]. We have now clarified the distinction between the establishment and maintenance phases in 1. Introduction, emphasized our research focus on the reaction-diffusion dynamics during the maintenance phase in 2. Results, and provided a discussion of the omitted actomyosin dynamics to foster a more comprehensive understanding in the future in 3. Discussion and conclusion. The effect of the establishment phase is studied as the initial condition for the cell polarization simulation solely governed by reaction-diffusion dynamics, with new simulations demonstrating the necessity of a polarized initial pattern set up independently of the reaction-diffusion network during the establishment phase, probably through additional mechanisms such as the active actomyosin contractility and flow [Cuenca et al., Development, 2003; Gross et al., Nat. Phys., 2019].

      Cytoplasmic and membrane diffusion coefficients differ by two orders of magnitude according to previous experimental measurements on PAR-2 and PAR-6 [Goehring et al., J. Cell Biol., 2011; Lim et al., Cell Rep., 2021]. Many previous C. elegans cell polarization models have incorporated mass-conservation model combined with finite cytoplasmic diffusion, but this model description can lead to reverse spatial concentration distribution between the cell membrane and cytosol [Fig. 3 of Seirin-Lee et al., J. Theor. Biol., 2016; Fig. 2ab of Seirin-Lee et al., J. Math. Biol., 2020], disobeying experimental observation [Fig. 4A of Sailer et al., Dev. Cell, 2015; Fig. 1A of Lim et al., Cell Rep., 2021]. This implies that the infinite cytoplasmic diffusion, without precise experiment-based parameter assignment or accounting for other hidden biological processes ( e.g., protein production and degradation), may be inappropriate in modeling the real spatial concentration distributions distinguished between the cell membrane and cytosol. To address this issue, some theoretical research incorporated protein production and degradation into their model, to acquire the consistent spatial concentration distribution between the cell membrane and cytosol [Tostevin et al., Biophys. J., 2008]. More definitive experimental data on the spatiotemporal changes in protein diffusion, production, and degradation are essential for providing a more realistic representation of cellular dynamics and enhancing the model's predictive power.

      Now we have acknowledged the possibly overlooked aspects of the C. elegans polarisation system in 3. Discussion and conclusion, a detailed outline of potential model improvements. Those aspects include, but are not limited to, issues involving “neglect cortical flows” and the “diffusion in the cytoplasm is infinitely fast”. From a theoretical perspective, we adopted assumptions from the previous literature and constructed a minimal model for a specific cell polarization phase to investigate the network's robustness. The meaningful predictions of five experimental groups and eight perturbed conditions in the C. elegans embryo faithfully supports the biological interpretation and relevance of the model.

      Although the authors compare their model predictions to experimental observations, particularly in reproducing mutant behaviours, they do not explicitly show or discuss these comparisons in detail. Diffusion coefficients and off-rates for some PAR proteins have been measured (Goehring et al., JCB 2011), but the authors seem to use parameter values that differ by many orders of magnitude, perhaps due to applied scaling. To ensure meaningful predictions, whether their proposed model captures the extensive published data should be evaluated. Various cellular/genetic perturbations have been studied to understand their effects on anterior-posterior boundary positioning. Testing these perturbations' responses in the model would be important. For example, comparing the intensity distribution of PAR-6 and PAR-2 with measurements during the maintenance phase by Goehring et al., JCB 2011, or comparing the normalised intensity of PAR-3 and PKC-3 from the model with those measured by Wang et al., Nat Cell Biol 2017, during establishment and maintenance phases (in both wild-type and cdc-42 (RNAi) zygotes) could provide insightful validation. Additionally, in the presence of active CDC-42, it has been observed that PAR-6 extends further into the posterior side (Aceto et al., Dev Biol 2006). Conducting such validation tests is essential to convince readers that the model accurately represents the actual system and provides insights into pattern formation during cell polarisation.

      We sincerely thank the editor(s) for the comment!

      To provide more comprehensive validations and refinements to ensure the model accurately represents biological systems, we extensively reproduced the qualitative and semi-quantitative phenomenon in three more experimental groups previously published (Section 2.5; Fig. S8) [Gotta et al., Curr. Biol., 2001; Aceto et al., Dev. Biol., 2006]. Combined with the original experiments (Section 2.5; Fig. 5; Fig. S7) [Hoege et al., Curr. Biol., 2010; Beatty et al., Development, 2010; Beatty et al., Development, 2013], now we have reproduced five experimental groups in total from published data, comprising eight perturbed conditions and using wild-type as the reference. We have also explicitly show the comparison between model predictions and experimental observations (including the mutant behaviors reproduction as well) in detail, by describing how “cell polarization pattern characteristics in simulation” responds to various cellular/genetic perturbations (Section 2.5; Fig. 5; Fig. S7 and S8). The original and new validation tests conducted can convince readers that the model accurately represents the actual system and provides insights into pattern formation during cell polarisation.

      The diffusion coefficients for anterior and posterior molecular species were assigned according to previous experimental and theoretical research [Goehring et al., J. Cell Biol., 2011; Goehring et al., Science, 2011; Seirin-Lee et al., Cells, 2020]. The off-rates are assigned uniformly by searching viable parameter sets that can set up a network with cell polarization pattern stability. Now we have added simulations showing that the cell polarization pattern stability and response to network structure and parameter perturbation does not depend on the exact parameter values (incl., diffusion coefficients and off-rates), provided the parameter values on both sides are initially balanced as a whole (Fig. S5). Specifically, we used a Monte Carlo method to sample a wide range of various parameter values ( i.e., γ, α, k<sub>1</sub>, k<sub>2</sub>, q<sub>1</sub>, q<sub>2</sub> and [X<sub>c</sub>) for all nodes and regulations in simple 2-node network and C. elegans 5-node network, to achieve pattern stability. Under these conditions ( i.e., without any reduction in the parameter space), single-sided self-regulation, single-sided additional regulation, and unequal system parameters still cause the stable polarized pattern to collapse, consistent with our conclusions in the simplified conditions with the parameter space reduced to three independent dimensions.

      With the in silico time set as 2 sec per step, now we have added the Supplemental Text justifying how the units are removed during non-dimensionalization. This demonstrates that the derived non-dimensionalized parameter in this paper achieves realistic values on the same order of magnitude as those observed in reality, confirming the fidelity of the proposed model in representing the real system. We agreed that full experimental measurements of biological information are essential for establishing a more utilizable model and discussed it thoroughly in 3. Discussion and conclusion.

      A clear justification, with references, for each network interaction between nodes in the five-node model is needed. Some of the activatory/inhibitory signals proposed by the authors have not been demonstrated ( e.g. CDC-42 directly inhibiting CHIN-1). Table S2 provided by the authors is insufficient to justify each node-node interaction, requiring additional explanations. (See the review by Gubieda et al., Phil. Trans. R. Soc. B 2020, for a similar node network that differs from the authors' model.) Additionally, the intensity distributions of cortical PAR-3 and PKC-3 seem to vary significantly during both establishment and maintenance phases (Wang et al., Nat Cell Biol 2017), yet the authors consider the PAR-3/PAR-6/PKC-3 as a single complex. The choices in the model should be justified, as the presence or absence of clustering of these PAR proteins can be crucial during cell polarisation (Wang et al., Nat Cell Biol 2017; Dawes & Munro, Biophys J 2011).

      We sincerely thank the editor(s) for the comment!

      Now we have acknowledged the limitations of the current cell polarization model and provided, in 2. Results and 3. Discussion and conclusion, a detailed outline of potential model improvements. The limitations include, but are not limited to, issues involving “each network interaction between nodes” and the “consider the PAR-3/PAR-6/PKC-3 as a single complex”, in which the former one relies on experimental measurements of biological information. However, comprehensive experimental measurement data on every molecular species, their interactions, and each species’ intensity distribution in space and time were not fully available from prior research. Refinement is lacking for some of these interactions, potentially requiring years of additional experimentation. Moreover, for certain species at specific developmental stages, only relative (rather than absolute) intensity measurements are available. We agreed that such information is essential for establishing a more utilizable model and discussed it thoroughly in 3. Discussion and conclusion.

      In consistent with previous modeling efforts [Goehring et al., Science, 2011; Gross et al., Nat. Phys., 2019; Lim et al., Cell Rep., 2021], our model treats the PAR-3/PAR-6/PKC-3 complex as a single entity for simplification, thus neglecting the potentially distinct spatial distributions of each single molecular species. We agree that a more comprehensive model, capable of resolving the individual localization patterns of these anterior PAR proteins, would be a valuable future direction. From a theoretical perspective, we adopted assumptions from the previous literature and constructed a minimal model for a specific cell polarization phase to investigate the network's robustness, supported by five experimental groups and eight perturbed conditions in the C. elegans embryo.

      In summary, the authors successfully demonstrate the importance of compensatory actions in maintaining polarisation robustness. Their computational pipeline offers valuable insights into the dynamics of reaction-diffusion networks. However, the lack of detailed experimental validation and realistic parameter estimation limits the model's applicability to real biological systems. While the study provides a solid foundation, further work is needed to fully characterise and validate the model in natural contexts. This work has the potential to significantly impact the field by providing a new perspective on the robustness of cell polarisation networks.

      We sincerely thank the editor(s) for the pertinent summary!

      To provide a more comprehensive validation against experimental data and model parameters, three more groups of the qualitative and semi-quantitative phenomenon regarding CDC-42 are reproduced based on previously published experiments (Section 2.5; Fig. S8) [Gotta et al., Curr. Biol., 2001; Aceto et al., Dev. Biol., 2006]. Combined with the original experiments (Section 2.5; Fig. 5; Fig. S7) [Hoege et al., Curr. Biol., 2010; Beatty et al., Development, 2010; Beatty et al., Development, 2013], now we have reproduced five experimental groups in total, comprising eight perturbed conditions and using wild-type as the reference.

      With the in silico time set as 2 sec per step, now we have added the Supplemental Text justifying how the units are removed during non-dimensionalization. This demonstrates that the derived non-dimensionalized parameter in this paper achieves realistic values on the same order of magnitude as those observed in reality, confirming the fidelity of the proposed model in representing the real system. Together with the reproduction of five experimental groups (eight perturbed conditions with wild-type as the reference), the model’s applicability to real biological systems in natural contexts are are fully characterised and validated.

      The computational pipeline developed could be a valuable tool for further in silico experiments, allowing researchers to explore the dynamics of more complex networks. To maximise its utility, the model needs comprehensive validation and refinement to ensure it accurately represents biological systems. Addressing these limitations, particularly the need for more detailed experimental validation and realistic parameter choices, will enhance the model's predictive power and its applicability to understanding cell polarisation in natural systems.

      We sincerely thank the editor(s) for the comment!

      To provide more comprehensive validations and refinements to ensure the model accurately represents biological systems, we extensively reproduced the qualitative and semi-quantitative phenomenon in three more experimental groups previously published (Section 2.5; Fig. S8) [Gotta et al., Curr. Biol., 2001; Aceto et al., Dev. Biol., 2006]. Combined with the original experiments (Section 2.5; Fig. 5; Fig. S7) [Hoege et al., Curr. Biol., 2010; Beatty et al., Development, 2010; Beatty et al., Development, 2013], now we have reproduced five experimental groups in total from published data, comprising eight perturbed conditions and using wild-type as the reference. We have also explicitly show the comparison between model predictions and experimental observations (including the mutant behaviors reproduction as well) in detail, by describing how “cell polarization pattern characteristics in simulation” responds to various cellular/genetic perturbations (Section 2.5; Fig. 5; Fig. S7 and S8).

      With the in silico time set as 2 sec per step, now we have added the Supplemental Text justifying how the units are removed during non-dimensionalization. This demonstrates that the derived non-dimensionalized parameter in this paper achieves realistic values on the same order of magnitude as those observed in reality, confirming the fidelity of the proposed model in representing the real system. Together with the reproduction of five experimental groups (eight perturbed conditions with wild-type as the reference), the model's predictive power and its applicability to understanding cell polarisation in natural systems are enhanced.

      Now we have added simulations showing that the cell polarization pattern stability and response to network structure and parameter perturbation does not depend on the exact parameter values (incl., diffusion coefficients, basal off-rates and inhibition intensity), provided the parameter values on both sides are initially balanced as a whole (Fig. S5). Specifically, we used a Monte Carlo method to sample a wide range of various parameter values (i.e., γ, α, k<sub>1</sub>, k<sub>2</sub>, q<sub>1</sub>, q<sub>2</sub> and [X<sub>c</sub>) for all nodes and regulations in simple 2-node network and C. elegans 5-node network, to achieve pattern stability. Under these conditions ( i.e., without any reduction in the parameter space), single-sided self-regulation, single-sided additional regulation, and unequal system parameters still cause the stable polarized pattern to collapse, consistent with our conclusions in the simplified conditions with the parameter space reduced to three independent dimensions.

      Recommendations for the Authors:

      (1) Parameterisation and Model Validation: The authors utilise parameter values that lack realism and fail to provide units for some of them, which can lead to confusion. For instance, the length of the cell is set to 0.5 without clear justification, raising questions about the scale used. Additionally, there's a mix of dimensional and non-dimensional variables, potentially complicating interpretation. Furthermore, arbitrary choices such as equal Hill coefficients and setting inhibition intensity parameters to 1 raise concerns about model fidelity. To ensure meaningful predictions, the authors should validate their model against extensive published data, including cellular/genetic perturbations. For example, comparing intensity distributions of PAR proteins measured during maintenance phases by Goehring et al., JCB 2011, and those obtained from the model could provide valuable validation. Similarly, comparisons with data from Wang et al., Nat Cell Biol 2017, on wild-type and cdc-42 (RNAi) zygotes, as well as observations from Aceto et al., Dev Biol 2006, on PAR-6 extension in the presence of active CDC-42, would strengthen the model's validity. Such validation tests are essential for convincing readers that the model accurately represents the actual system and can provide insights into pattern formation during cell polarisation.

      We sincerely thank the editor(s) and referee(s) for the helpful suggestion!

      Now we have added a new section, Parameter Nondimensionalization and Order of Magtitude Consistency, into Supplemental Text. In this section, we introduced how we adopted the parameter nondimensionalization and value assignments from previous works [Goehring et al., J. Cell Biol., 2011; Goehring et al., Science, 2011; Seirin-Lee et al., Cells, 2020]. We listed four examples (i.e., evolution time, membrane diffusion coefficient, basal off-rate, and inhibition intensity) to show the consistency in order of magtitude between numerical and realistic values.

      The assumption of “equal Hill coefficients” is to reduce the parameter space for an affordable computational cost. It is a widely-used strategy to fix Hill coefficients [Seirin-Lee et al., J. Theor. Biol., 2015; Seirin-Lee, Bull. Math. Biol., 2021] in network research, to control computational cost. Besides, setting inhibition intensity parameters to 1 is for determining a numerical scale. Now we have added simulations showing that the cell polarization pattern stability does not depend on the exact parameter values associated with activation and inhibition pathways, provided the regulations on both sides are initially balanced as a whole (Fig. S5). Specifically, we used a Monte Carlo method to sample a wide range of various parameter values (i.e., γ, α, k<sub>1</sub>, k<sub>2</sub>, q<sub>1</sub>, q<sub>2</sub> and [X<sub>c</sub>) for all nodes and regulations in simple 2-node network and C. elegans 5-node network, to achieve pattern stability. Under these conditions (i.e., without any reduction in the parameter space), single-sided self-regulation, single-sided additional regulation, and unequal system parameters still cause the stable polarized pattern to collapse, consistent with our conclusions in the simplified conditions with the parameter space reduced to three independent dimensions.

      To confirm the fidelity of the proposed model in representing the real system, we reproduced the qualitative and semi-quantitative phenomenon in three more experimental groups previously published (Section 2.5; Fig. S8) [Gotta et al., Curr. Biol., 2001; Aceto et al., Dev. Biol., 2006]. Combined with the original experiments (Section 2.5; Fig. 5; Fig. S7) [Hoege et al., Curr. Biol., 2010; Beatty et al., Development, 2010; Beatty et al., Development, 2013], now we have reproduced five experimental groups in total (two acting on LGL-1 and three on CDC-42), comprising eight perturbed conditions and using wild-type as the reference. These results effectively demonstrate how comprehensively the network structure and parameters capture the characteristics of the C. elegans embryo. We have also acknowledged the limitations of the current cell polarization model and provided, in 2. Results and 3. Discussion and conclusion, a detailed outline of potential model improvements.

      It is worth noting that, although a strict match between numerical and realistic parameter values with consistent units is always helpful, a lot of notable pure numerical studies successfully unveil principles that help interpret [Ma et al., Cell, 2009] and synthesize real biological systems [Chau et al., Cell, 2012]. These studies suggest that numerical analysis in biological systems remains powerful, even when comprehensive experimental data from prior research are not fully available.

      (2) Parameter Changes: It is not clear how the parameters change as more complicated networks are explored, and how this affects the comparison between the simple and complete model. Clarification on this point would be beneficial.

      We sincerely thank the editor(s) and referee(s) for the helpful suggestion!

      The computational pipeline in Section 2.1 is generalized for all reaction-diffusion networks, including the simple and complete ones studied in this paper. The parameter changes included two parts: 1. The mutual activation in the anterior (none for the simple 2-node network and q<sub2</sub> for the complete 5-node network); 2. The viable parameter sets (122 sets for the simple 2-node network and 602 sets for the complete 5-node network). Now we have explicitly clarified those differences:

      Those differences don’t affect the comparison between the simple and complete models. Now we have added comprehensive comparisons between the simple and complete models about 1. How they respond to alternative initial conditions consistently (Fig. S2). 2. How they respond to alternative single modifications consistently (Fig. S4 and S9), even when the parameters (i.e., γ, α, k<sub>1</sub>, k<sub>2</sub>, q<sub>1</sub>, q<sub>2</sub> and [X<sub>c</sub>) are assigned with various values concerning all nodes and regulations (Fig. S5).

      (3) Exploration of Model Parameter Space: In the two-node dual antagonistic model, the authors observe that the cell polarisation pattern is unstable for different systems (Fig. 1). However, it remains uncertain whether this instability holds true for the entire model parameter space. Have the authors thoroughly screened the full model parameter space to support their statements? It would be beneficial for the authors to provide clarification on the extent of their exploration of the model parameter space to ensure the robustness of their conclusions.

      We sincerely thank the editor(s) and referee(s) for the helpful suggestion!

      The trade-off between considered parameter space and computational cost is a long-term challenge in network study as there are always numerous combinations of network nodes, edges, and parameters [Ma et al., Cell, 2009; Chau et al., Cell, 2012]. The computational pipeline in Section 2.1 generalized for all reaction-diffusion networks exerts two strategies to limit the computational cost and set up a basic network reference: 1. Dimension Reduction (Strategy 1) - Unifying the parameter values for different nodes and different edges within the same regulatory type to minimize the unidentical parameter numbers into 3; 2: Parameter Space Confinement (Strategy 2): Enumerating the dimensionless parameter set on a three-dimensional (3D) grid confined by γ∈ [0,0.05] in steps ∆γ = 0.001, k<sub>1</sub>∈[0,5] in steps ∆k<sub>1</sub> = 0.05,  and  in steps .

      In the early stage of our project, we tried to explore “the entire model parameter space” as indicated by the reviewer. We first tried to use the Monte Carlo method to find parameter solutions in an open parameter space and with all parameter values allowed to be different. However, such a process is full of randomness and is computationally expensive (taking months to search viable parameter sets but still unable to profile the continuous viable parameter space; the probability of finding a viable parameter set is no higher than 0.02%, making it very hard to profile a continuous viable parameter space). Now we clearly can see the viable parameter space is a thin curved surface where all parameters have to satisfy a critical balance (Fig. 3a, b, Fig. 5e, f). This is why we exert a typical strategy for dimension reduction in network research in both cell polarization [Marée et al., Bull. Math. Biol., 2006; Goehring et al., Science, 2011; Trong et al., New J. Phys., 2014] and other biological topics (e.g., plasmid transferring in the microbial community [Wang et al., Nat. Commun., 2020]), i.e., unifying the parameter values for different nodes and different edges within the same regulatory type.

      Additionally, the curved surface for viable parameter space can be extended to infinite as long as the parameter balance is achieved (Fig. 3a, b, Fig. 5e, f), it is impossible or unnecessary to explore “the entire model parameter space”. Setting up a confined parameter region near the original point for parameter enumeration can help profile the continuous viable parameter space, which is sufficient for presenting the central conclusion of this paper – that is - the network structure and parameter need to satisfy a balance for stable cell polarization.

      To support a comprehensive study considering all kinds of reference and perturbed networks, we have maximized the parameter domain size by exhausting all the computational research we can access, including 400-500 Intel(R) Core(TM) E5-2670v2 and Gold 6132 CPU on the server (High-Performance Computing Platform at Peking University) and 5 Intel(R) Core(TM) i9-14900HX CPU on personal computers.

      To make it certain that instability holds true when the model parameter space is extended, we add a comprehensive comparison between the simple and complete models about how their instability occurs consistently even when the parameters (i.e., γ, α, k<sub>1</sub>, k<sub>2</sub>, q<sub>1</sub>, q<sub>2</sub> and [X<sub>c</sub>) are assigned with various values concerning all nodes and regulations, searched by the Monte Carlo method (Fig. S5).

      (4) Sensitivity of Numerical Solutions to Initial Conditions: Are the numerical solutions in both models sensitive to the chosen initial condition? What results do the models provide if uniform initial distributions were utilised instead?

      We sincerely thank the editor(s) and referee(s) for the comments!

      To investigate both the simple network and the realistic network consisting of various node numbers and regulatory pathways [Goehring et al., Science, 2011; Lang et al., Development, 2017], we propose a computational pipeline for numerical exploration of the dynamics of a given reaction-diffusion network's dynamics, specifically targeting the maintenance phase of stable cell polarization after its initial establishment [Motegi et al., Nat. Cell Biol., 2011; Goehring et al., Science, 2011; Seirin-Lee et al., Cells, 2020].

      Now we have added new simulations and explanations for the sensitivity of numerical solutions to initial conditions. For both models, a uniform initial distribution leads to a homogeneous pattern while a Gaussian noise distribution leads to a multipolar pattern. In contrast, an initial polarized distribution (even with shifts in transition planes, weak polarization, or asymmetric curve shapes between the two molecular species) can maintain cell polarization reliably.

      (5) Initial Conditions and Stability Tests: In Figure 1, the authors discuss the stability of the basic two-node network (a) upon modifications in (b-d). The stability test is performed through a pipeline procedure in which they always start from a polarised pattern described by Equation (4) and observe how the pattern evolves over time. It would be beneficial to explore whether the stability test depends on this specific initial condition. For instance, what would happen if the posterior molecules have an initial distribution of 1/(1+e^(-10x)), which is not exactly symmetric with respect to the anterior molecules' distribution of 1-1/(1+e^(-20x))? Additionally, if the initial polarisation is not as strong, for example, with the anterior molecules having a distribution of 10-1/(1+e^(-20x)) and the posterior molecules having a distribution of 9+1/(1+e^(-20x)), how would this affect the results?

      We sincerely thank the editor(s) and referee(s) for the constructive advice!

      Now we have added comprehensive comparisons between the simple and complete models about how they respond to alternative initial conditions consistently (Fig. S4, Fig. S9). The successful cell polarization pattern requests an initial polarized pattern, but its following stability and response to perturbation depend very little on the specific form of the initial polarized pattern. All the conditions mentioned by the reviewer have been included.

      (6) Stability Analysis: Throughout the paper, the authors discuss the stability of the polarised pattern. The stability is checked by an exhaustive search of the parameter space, ensuring the system reaches a steady state with a polarised pattern instead of a homogeneous pattern. It would be beneficial to explore if this stability is related to a linear stability analysis of the model parameters, similar to what was conducted in Reference [18], which can determine if a homogeneous state exists and whether it is stable or unstable. Including such an analysis could provide deeper insights into the system's stability and validate its robustness.

      We sincerely thank the editor(s) and referee(s) for the comments!

      We agree that the linear stability analysis can potentially offer additional insights into polarized pattern behavior. However, this approach often requests the aid of numerical solutions and is therefore not entirely independent [Goehring et al., Science, 2011]. Over the past decade, numerical simulations have consistently proven to be a reliable and sufficient approach for studying network dynamics, spanning from C. elegans cell polarization [Tostevin et al., Biophys. J, 2008; Blanchoud et al., Biophys. J, 2015; Seirin-Lee, Dev. Growth Differ., 2020] to topics in metazoon [Chau et al., Cell, 2012; Qiao et al., eLife, 2022; Sokolowski et al., arXiv, 2023]. Numerous purely numerical studies have successfully unveiled principles that help interpret [Ma et al., Cell, 2009] and synthesized real biological systems [Chau et al., Cell, 2012], independent of additional mathematical analysis. Thus, we leverage our numerical framework to address the cell polarization problems cell polarization problems in this paper.

      To confirm the reliability of stability checked by an exhaustive search of the parameter space, now we reproduce the qualitative and semi-quantitative phenomenon in three more experimental groups previously published (Section 2.5; Fig. S8) [Gotta et al., Curr. Biol., 2001; Aceto et al., Dev. Biol., 2006]. Combined with the original experiments (Section 2.5; Fig. 5; Fig. S7) [Hoege et al., Curr. Biol., 2010; Beatty et al., Development, 2010; Beatty et al., Development, 2013], we reproduce five experimental groups in total (two acting on LGL-1 and three acting on CDC-42), comprising eight perturbed conditions and using wild-type as the reference.

      To confirm the robustness of our conclusions regarding the system's stability, now we add comprehensive comparisons between the simple and complete models about 1. How they respond to alternative initial conditions consistently (Fig. S4; Fig. S9). 2. How they respond to alternative single modifications consistently, even when the parameters (i.e., γ, α, k<sub>1</sub>, k<sub>2</sub>, q<sub>1</sub>, q<sub>2</sub> and [X<sub>c</sub> ) are assigned with various values concerning all nodes and regulations (Fig. S5).

      (7) Interface Position Determination: In Figure 4, the authors demonstrate that by using a spatially varied parameter, the position of the interface can be tuned. Particularly, the interface is almost located at the step where the parameter has a sharp jump. However, in the case of a homogeneous parameter (e.g., Figure 4(a)), the system also reaches a stable polarised pattern with the interface located in the middle (x = 0), similar to Figure 4(b), even though the homogeneous parameter does not contain any positional information of the interface. It would be helpful to clarify the difference between Figure 4(a) and Figure 4(b) in terms of the interface position determination.

      We sincerely thank the editor(s) and referee(s) for the comments!

      The case of a homogeneous parameter (e.g., Fig. 4a), in which the system also reaches a stable polarised pattern with the interface located in the middle (x = 0), is just a reference adopted from Fig. 1a to show that the inhomogeneous positional information in Fig. 4b can achieve a similar stable polarised pattern.

      Now we clarify the interface position determination to Section 2.4 to improve readability. Moreover, it is marked with grey dashed line in all the patterns in Fig. 4 and Fig. 6 to highlight the importance of inhomogeneous parameters on interface localization.

      (8) Presented Comparison with Experimental Observations: The comparison with experimental observations lacks clarity. It isn't clear that the model "faithfully recapitulates" the experimental observations (lines 369-370). We recommend discussing and showing these comparisons more carefully, highlighting the expectations and similarities.

      We sincerely thank the editor(s) and referee(s) for the constructive suggestion!

      Now we remove the word “faithfully” and highlight the expectations and similarities of each experimental group by describing “cell polarization pattern characteristics in simulation: …”.

      (9) Validation of Model with Experimental Data: Given the extensive number of model parameters and the uncertainty of their values, it is essential for the authors to validate their model by comparing their results with experimental data. While C. elegans polarisation has been extensively studied, the authors have yet to utilise existing data for parameter estimation and model validation. Doing so would considerably strengthen their study.

      We sincerely thank the editor(s) and referee(s) for the constructive suggestion!

      To utilise existing data for parameter estimation, now we add a new section, Parameter Nondimensionalization and Order of Magtitude Consistency, into Supplemental Text. In this section, we introduced how we adopted the parameter nondimensionalization and value assignments from previous works [Goehring et al., J. Cell Biol., 2011; Goehring et al., Science, 2011; Seirin-Lee et al., Cells, 2020]. We listed four examples (i.e., evolution time, membrane diffusion coefficient, basal off-rate, and inhibition intensity) to show the consistency in order of magtitude between numerical and realistic values.

      To utilise existing data for model validation, now we reproduce the qualitative and semi-quantitative phenomenon in three more experimental groups previously published (Section 2.5; Fig. S8) [Gotta et al., Curr. Biol., 2001; Aceto et al., Dev. Biol., 2006]. Combined with the original experiments (Section 2.5; Fig. 5; Fig. S7) [Hoege et al., Curr. Biol., 2010; Beatty et al., Development, 2010; Beatty et al., Development, 2013], we reproduce five experimental groups in total (two acting on LGL-1 and three acting on CDC-42), comprising eight perturbed conditions and using wild-type as the reference.

      Also, we acknowledge the limitations of the current cell polarization model and provided, in 3. Discussion and conclusion, a detailed outline of potential model improvements. The limitations include, but are not limited to, issues involving “extensive number of model parameters” and “uncertainty of their values”, both of which rely on experimental measurements of biological information. However, comprehensive experimental measurement data on every molecular species, their interactions, and each species’ intensity distribution in space and time were not fully available from prior research. Refinement is lacking for some of these interactions, potentially requiring years of additional experimentation. Moreover, for certain species at specific developmental stages, only relative (rather than absolute) intensity measurements are available. We agreed that such information is essential for establishing a more utilizable model and discussed it thoroughly in 3. Discussion and conclusion. From a theoretical perspective, we adopted assumptions from the previous literature and constructed a minimal model for a specific cell polarization phase to investigate the network's robustness, supported by five experimental groups and eight perturbed conditions with wild-type as a reference in the C. elegans embryo.

      (10) Enhancing Model Accuracy by Considering Cortical Flows: The authors are encouraged to include cortical flows in their cell polarisation model, as these flows are known to be pivotal in the process. Although the current model successfully predicts cell polarisation without accounting for cortical flows, research has demonstrated their significant role in polarisation formation. By incorporating cortical flows, the model would provide a more thorough and precise representation of the biological process. Furthermore, previous studies, such as those by Goehring et al. (References 17 and 18), highlight the importance of convective actin flow in initiating polarisation. It would be valuable for the authors to address the contribution of convection with actin flow to the establishment of the polarisation pattern. The polarisation of the C. elegans zygote progresses through two distinct phases: establishment and maintenance, both heavily influenced by actomyosin dynamics. Works by Munro et al. (Dev Cell 2004), Shivas & Skop (MBoC 2012), Liu et al. (Dev. Biol. 2010), and Wang et al. (Nat Cell Biol 2017) underscore the critical roles of myosin and actin in orchestrating the localisation of PAR proteins during cell polarisation. To enhance the fidelity of their model, we recommend that the authors either integrate cortical flows and consider the effects driven by myosin and actin, or provide a discussion on the repercussions of omitting these dynamics.

      We sincerely thank the editor(s) and referee(s) for the comment!

      Indeed, previous research highlighted the importance of convective cortical flow in orchestrating the localisation of PAR proteins during the establishment phase of polarisation formation [Goehring et al., J. Cell Biol., 2011; Rose et al., WormBook, 2014; Beatty et al., Development, 2013]. However, during the maintenance phase, the non-muscle myosin II (NMY-2) is regulated downstream by the PAR protein network rather than serving as the primary upstream factor controlling PAR protein localization. While some theoretical studies integrated both reaction-diffusion dynamics and the effects of myosin and actin [Tostevin et al., Biophys J, 2008; Goehring et al, Science, 2011], others focused exclusively on reaction-diffusion dynamics [Dawes et al., Biophys. J., 2011; Seirin-Lee et al., Cells, 2020]. Now we clarify the distinction between the establishment and maintenance phases, emphasize our research focus on the reaction-diffusion dynamics during the maintenance phase, and provide a discussion of these omitted dynamics to foster a more comprehensive understanding in the future, as suggested.

      (11) Further Justification of Network Interactions: The authors should provide additional explanations, supported by empirical evidence, for the network interactions assumed in their model. This includes both node-node interactions and the rationale behind protein complex formations. Some of the proposed interactions lack empirical validation, as noted in studies such as Gubieda et al., Phil. Trans. R. Soc. B 2020. Additionally, discrepancies in protein intensity distributions, as observed in Wang et al., Nat Cell Biol 2017, should be addressed, particularly concerning the consideration of the PAR-3/PAR-6/PKC-3 complex as a single entity. Justifying these choices is crucial for ensuring the model's credibility and alignment with experimental findings.

      We sincerely thank the editor(s) and referee(s) for the helpful advice!

      In consistency with previous modeling efforts [Goehring et al., Science, 2011; Gross et al., Nat. Phys., 2019; Lim et al., Cell Rep., 2021], our model treats the PAR-3/PAR-6/PKC-3 complex as a single entity for simplification, thus neglecting the potentially distinct spatial distributions of each single molecular species.

      Now we acknowledge the limitations of the current cell polarization model and provided, in 3. Discussion and conclusion, a detailed outline of potential model improvements. The limitations include, but are not limited to, issues involving “node-node interactions” and “discrepancies in protein intensity distributions”, both of which rely on experimental measurements of biological information. However, comprehensive experimental measurement data on every molecular species, their interactions, and each species’ intensity distribution in space and time were not fully available from prior research. Refinement is lacking for some of these interactions, potentially requiring years of additional experimentation. Moreover, for certain species at specific developmental stages, only relative (rather than absolute) intensity measurements are available. We agreed that such information is essential for establishing a more utilizable model and discussed it thoroughly in 3. Discussion and conclusion.

      To ensure the model's credibility and alignment with experimental findings, now we reproduce the qualitative and semi-quantitative phenomenon in three more experimental groups previously published (Section 2.5; Fig. S8) [Gotta et al., Curr. Biol., 2001; Aceto et al., Dev. Biol., 2006]. Combined with the original experiments (Section 2.5; Fig. 5; Fig. S7) [Hoege et al., Curr. Biol., 2010; Beatty et al., Development, 2010; Beatty et al., Development, 2013], now we have reproduced five experimental groups in total (two acting on LGL-1 and three on CDC-42), comprising eight perturbed conditions and using wild-type as the reference.

      (12) Further Justification of Node-Node Network Interactions: The authors should provide further justification for the node-node network interactions assumed in their study. To the best of our knowledge, some of the node-node interactions proposed have not yet been empirically demonstrated. Providing additional explanations for these interactions would enhance the credibility of the model and ensure its alignment with empirical evidence.

      We sincerely thank the editor(s) and referee(s) for the helpful advice!

      Now we acknowledge the limitations of the current cell polarization model and provided, in 3. Discussion and conclusion, a detailed outline of potential model improvements. The limitations include, but are not limited to, issues involving “node-node network interactions”, which rely on experimental measurements of biological information. However, comprehensive experimental measurement data on every molecular species, their interactions, and each species’ intensity distribution in space and time were not fully available from prior research. Refinement is lacking for some of these interactions, potentially requiring years of additional experimentation. Moreover, for certain species at specific developmental stages, only relative (rather than absolute) intensity measurements are available. We agreed that such information is essential for establishing a more utilizable model and discussed it thoroughly in 3. Discussion and conclusion.

      To enhance the credibility of the model and ensure its alignment with empirical evidence, we reproduced the qualitative and semi-quantitative phenomenon in three more experimental groups previously published (Section 2.5; Fig. S8) [Gotta et al., Curr. Biol., 2001; Aceto et al., Dev. Biol., 2006]. Combined with the original experiments (Section 2.5; Fig. 5; Fig. S7) [Hoege et al., Curr. Biol., 2010; Beatty et al., Development, 2010; Beatty et al., Development, 2013], now we have reproduced five experimental groups in total (two acting on LGL-1 and three on CDC-42), comprising eight perturbed conditions and using wild-type as the reference.

      (13) Justification for Network Interactions and Protein Complexes: The authors must provide clear justifications, supported by references, for each network interaction between nodes in the five-node model. Some of the activatory/inhibitory signals proposed lack empirical validation, such as CDC-42 directly inhibiting CHIN-1. The provided Table S2 is insufficient to justify these interactions, necessitating additional explanations. Reviewing relevant literature, such as the work by Gubieda et al., Phil. Trans. R. Soc. B 2020, may offer insights into similar node networks. Furthermore, the authors should address discrepancies in protein intensity distributions, as observed in studies like Wang et al., Nat Cell Biol 2017. Specifically, the authors consider the PAR-3/PAR-6/PKC-3 complex as a single entity despite potential differences in their distributions. Justification for this choice is essential, particularly considering the importance of clustering dynamics during cell polarisation, as demonstrated by Wang et al., Nat Cell Biol 2017, and Dawes & Munro, Biophys J 2011.

      We sincerely thank the editor(s) and referee(s) for the helpful advice!

      In consistent with previous modeling efforts [Goehring et al., Science, 2011; Gross et al., Nat. Phys., 2019; Lim et al., Cell Rep., 2021], our model treats the PAR-3/PAR-6/PKC-3 complex as a single entity for simplification, thus neglecting the potentially distinct spatial distributions of each single molecular species. Besides, the inhibition of CHIN-1 from CDC-42, which recruits cytoplasmic PAR-6/PKC-3 to form a complex, may act indirectly to restrict CHIN-1 localization through phosphorylation [Sailer et al., Dev. Cell, 2015; Lang et al., Development, 2017].

      Now we acknowledge the limitations of the current cell polarization model and provided, in 3. Discussion and conclusion, a detailed outline of potential model improvements. The limitations include, but are not limited to, issues involving “each network interaction between nodes in the five-node model” and “discrepancies in protein intensity distributions”, both of which rely on experimental measurements of biological information. However, comprehensive experimental measurement data on every molecular species, their interactions, and each species’ intensity distribution in space and time were not fully available from prior research. Refinement is lacking for some of these interactions, potentially requiring years of additional experimentation. Moreover, for certain species at specific developmental stages, only relative (rather than absolute) intensity measurements are available. We agreed that such information is essential for establishing a more utilizable model and discussed it thoroughly in 3. Discussion and conclusion. From a theoretical perspective, we adopted assumptions from the previous literature and constructed a minimal model for a specific cell polarization phase to investigate the network's robustness, supported by five experimental groups and eight perturbed conditions with wild-type as a reference in the C. elegans embryo.

      (14) Incorporating Cytoplasmic Dynamics into the Model: The authors assume infinite cytoplasmic diffusion and neglect the role of cytoplasmic flows in cell polarity, which may oversimplify the model. Finite cytoplasmic diffusion combined with flows could potentially compromise the stability of anterior-posterior molecular distributions, affecting the accuracy of the model's predictions. The authors claim a significant difference between cytoplasmic and membrane diffusion coefficients, but the actual disparity seems smaller based on data from Petrášek et al., Biophys. J. 2008. For example, cytosolic diffusion coefficients for NMY-2 and PAR-2 differ by less than one order of magnitude. Additionally, the strength of cytoplasmic flows, as quantified by studies such as Cheeks et al., and Curr Biol 2004, should be considered when assessing the impact of cytoplasmic dynamics on polarity stability. Incorporating finite cytoplasmic diffusion and cytoplasmic flows into the model could provide a more realistic representation of cellular dynamics and enhance the model's predictive power.

      We sincerely thank the editor(s) and referee(s) for the comment!

      Cytoplasmic and membrane diffusion coefficients differ by two orders of magnitude according to previous experimental measurements on PAR-2 and PAR-6 [Goehring et al., J. Cell Biol., 2011; Lim et al., Cell Rep., 2021]. Many previous C. elegans cell polarization models have incorporated mass-conservation model combined with finite cytoplasmic diffusion, but this model description can lead to reverse spatial concentration distribution between the cell membrane and cytosol [Fig. 3 of Seirin-Lee et al., J. Theor. Biol., 2016; Fig. 2ab of Seirin-Lee et al., J. Math. Biol., 2020], disobeying experimental observation [Fig. 4A of Sailer et al., Dev. Cell, 2015; Fig. 1A of Lim et al., Cell Rep., 2021]. This implies that the infinite cytoplasmic diffusion, without precise experiment-based parameter assignment or accounting for other hidden biological processes (e.g., protein production and degradation), may be inappropriate in modeling the real spatial concentration distributions distinguished between the cell membrane and cytosol. To address this issue, some theoretical research incorporated protein production and degradation into their model, to acquire the consistent spatial concentration distribution between the cell membrane and cytosol [Tostevin et al., Biophys. J., 2008]. More definitive experimental data on the spatiotemporal changes in protein diffusion, production, and degradation are essential for providing a more realistic representation of cellular dynamics and enhancing the model's predictive power.

      Cytoplasmic flows indeed play an unneglectable role in cell polarity during the establishment phase [Kravtsova et al., Bull. Math. Biol., 2014], which creates a spatial gradient of actomyosin contractility and directs PAR-3/PKC-3/PAR-6 to the anterior membrane by cortical flow [Rose et al., WormBook, 2014; Lang et al., Development, 2017]. However, during the maintenance phase, the non-muscle myosin II (NMY-2) is regulated downstream by the PAR protein network rather than serving as the primary upstream factor controlling PAR protein localization [Goehring et al., J. Cell Biol., 2011; Rose et al., WormBook, 2014; Geβele et al., Nat. Commun., 2020]. While some theoretical studies integrated both reaction-diffusion dynamics and the effects of myosin and actin [Tostevin, 2008; Goehring, Science, 2011], others focused exclusively on reaction-diffusion dynamics [Dawes et al., Biophys. J., 2011; Seirin-Lee et al., Cells, 2020]. We now emphasize our research focus on the reaction-diffusion dynamics during the maintenance phase, so the dynamics between NMY-2 and PAR-2 are not included. We have also provided a discussion of the simplified cytoplasmic diffusion and omitted cytoplasmic flows to foster a more comprehensive understanding in the future.

      (15) Explanation of Lethality References: On page 13, the authors mention lethality without adequately explaining why they are drawing connections with lethality experimental data.

      We sincerely thank the editor(s) and referee(s) for the comment!

      It is well-known that cell polarity loss in C. elegans zygote will lead to symmetric cell division, which brings out the more symmetric allocation of molecular-to-cellular contents in daughter cells; this will result in abnormal cell size, cell cycle length, and cell fate in daughter cells, followed by embryo lethality [Beatty et al., Development, 2010; Beatty et al., Development, 2013; Rodriguez et al., Dev. Cell, 2017; Jankele et al., eLife, 2021]. Now we explain why we are drawing connections with lethality experimental data in Section 2.5.

      (16) Improved Abstract: "...However, polarity can be restored through a combination of two modifications that have opposing effects..." This sentence could be revised for better clarity. For example, the authors could consider rephrasing it as follows: "...However, polarity restoration can be achieved by combining two modifications with opposing effects...".

      We sincerely thank the editor(s) and referee(s) for helpful advice!

      Now we revise the abstract as follows:

      “Abstract – However, polarity restoration can be achieved by combining two modifications with opposing effects.”

      (17) Conservation of Mass in Network Models: Is conservation of mass satisfied in their network models?

      We sincerely thank the editor (s) and referee(s) for the comment!

      While previous experiments provide evidence for near-constant protein mass during the establishment phase [Goehring et al., Science, 2011], whether this is consistent until the end of maintenance is unclear.

      Many previous C. elegans cell polarization models have assumed mass conservation on the cell membrane and in the cell cytosol, this model description can lead to reverse spatial concentration distribution between the cell membrane and cytosol [Fig. 3 of Seirin-Lee et al., J. Theor. Biol., 2016; Fig. 2ab of Seirin-Lee et al., J. Math. Biol., 2020], disobeying experimental observation [Fig. 4A of Sailer et al., Dev. Cell, 2015; Fig. 1A of Lim et al., Cell Rep., 2021]. This implies that mass conservation may be inappropriate in modeling the real spatial concentration distributions distinguished between the cell membrane and cytosol. To address this issue, some theoretical research incorporated protein production and degradation into their model, instead of assuming mass conservation [Tostevin et al., Biophys. J., 2008]. More definitive experimental data on the spatiotemporal changes in protein mass are essential for constructing a more accurate model.

      Given the absence of a universally accepted model in agreement with experimental observation, we adopted the assumption that the concentration of molecules in the cytosol (not the total mass on the cell membrane and in the cell cytosol) is spatially inhomogeneous and temporally constant, which was also used before [Kravtsova et al., Bull. Math. Biol., 2014]. In the context of this well-mixed constant cytoplasmic concentration, our model successfully reproduced the cell polarization phenotype in wild-type and eight perturbed conditions (Section 2.5; Fig. S7; Fig. S8), supporting the validity of this simplified, yet effective, model. Now we have provided a discussion of protein mass assumption to foster a more comprehensive understanding in the future.

      (18) Comparison of Network Structures: In Figure 1c, the authors demonstrate that the symmetric two-node network is susceptible to single-sided additional regulation. They considered four subtypes of modifications, depending on whether [L] is in the anterior or posterior and whether [A] and [L] are mutually activating or inhibiting. What is the difference between the structure where [L] is in the anterior and in the posterior? Upon comparing the time evolution of the left panel ([L] is sided with

      ) and the right panel ([L] is sided with [A]), the difference is so tiny that they are almost indistinguishable. It might be beneficial for the authors to provide a clearer explanation of the differences between these network structures to aid in understanding their implications.

      We sincerely thank the editor(s) and referee(s) for the constructive suggestion!

      The difference between the structures where [L] is in the anterior and posterior is the initial spatial concentration distribution of [L], which is polarized to have a higher concentration in the anterior and posterior respectively. The time evolution of the left panel ([L] is sided with [P]) and the right panel [L] is sided with [P]) is almost indistinguishable because the perturbation from [L] is slight (less than over one order of magnitude) compared to the predominant [A]~[P] interaction ( for [A]~[P] mutual inhibition while for [A]~[L] mutual inhibition and for [A]~[L] mutual activation), highlighting the response of cell polarization pattern. To aid the readers in understanding their implications, we have added the [L] and plotted the spatial concentration distribution of all three molecular species at t=0,100, 200, 300, 400 and 500 in Fig. S3, where the difference between the [L] ones in the left and right panels are distinguishably shown.

      (19) Figure Reference: In line 308, Fig. 4a is referenced when explaining the loss of pattern stability by modifying an individual parameter, but this is not shown in that panel. Please update the panel or adjust the reference in the main text.

      We sincerely thank the editor(s) and referee(s) for pointing out this problem!

      Fig. 4 focuses on the regulatable shift of the zero-velocity interface by modifying a pair of individual parameters, not on the loss (or recovery) of pattern stability, which has been analyzed as a focus in Fig. 1, Fig. 2, and Fig. 3. Fig. 4a is actually from the same simulation as the one in Fig. 1a, which has spatially uniform parameters used as a reference in Fig. 4. The individual parameter modification in other subfigures of Fig. 4 shows how the zero-velocity interface is shifted in a regulatable manner always in the context of pattern stability. Now we update the panel, adjust the reference, add one more paragraph, and improve the wording to clarify how the analyses in Fig. 4 are carried out on top of the pattern stability already studied.

      (20) Viable Parameter Sets: In line 355, the number of viable parameter sets (602) is not very informative by itself. We suggest reporting the fraction or percentage of sets tested that resulted in viable results instead. This applies similarly to lines 411 and 468.

      We sincerely thank the editor(s) and referee(s) for the constructive comment!

      Now the fraction/percentage of parameter sets tested that resulted in viable results are added everywhere the number appears.

      (21) Perturbation Experiments: In lines 358-359, "the perturbation experiments" implies that those considered are the only possible ones. Please rephrase to clarify.

      We sincerely thank the editor(s) and referee(s) for the helpful advice!

      Now we rephrase three paragraphs to clarify why the perturbation experiments involved with [L] and [C] are considered instead of other possible ones.

      (22) Figure 2S: This figure is unclear. The caption states that panel (a) shows the "final concentration distribution," but only a line is shown. If "distribution" refers to spatial distribution, please clarify which parameters are shown.

      We sincerely thank the editor(s) and referee(s) for pointing out this problem!

      Now we clarify the “spatial concentration distribution” and which parameters are shown in the figure caption.

      (23) Figure 5 and 6 Captions: The captions for Figures 5 and 6 could benefit from clarification for better understanding.

      We sincerely thank the editor(s) and referee(s) for the constructive suggestion!

      Now we clarify the details in the captions of Fig. 5 and Fig. 6 for better understanding.

      (24) Figure 5 Legend: The legend on the bottom right corner of Figure 5 is unclear. Please specify to which panel it refers.

      We sincerely thank the editor(s) and referee(s) for the constructive suggestion!

      Now we clarify to which the legend on the bottom right corner of Fig. 5 refers.

      (25) L and A~C Interactions: In paragraphs 405-418, please explain why the L and A~C interactions are removed for the comparison instead of others.

      We sincerely thank the editor(s) and referee(s) for the constructive suggestion!

      Now we add a separate paragraph and a supplemental figure to explain why the L and A~C interactions are removed for the comparison instead of others.

      (26) Network Structures in Figure S3: From the "34 possible network structures" considered in Figure S3 (lines 440-441), why are the "null cases" (L disconnected from the network) relevant? Shouldn't only 32 networks be considered?

      We sincerely thank the editor(s) and referee(s) for pointing out this problem!

      Now the two “null cases” are removed:

      (27) Figure S3 Caption: The caption must state that the position of the nodes (left or right) implies the polarisation pattern. Additionally, with the current size of the figure, the dashed lines are extremely hard to differentiate from the continuous lines.

      We sincerely thank the editor(s) and referee(s) for the constructive suggestion!

      Now we state that the position of the nodes (left or right) implies the polarization pattern. Additionally, we have modified the figure size and dashed lines so that the dash lines are adequately distinguishable from the continuous lines.

      (28) Equation #7: It is confusing to use P as the number of independent simulations when P is also one of the variables/species in the network. Please consider using different notation.

      We sincerely thank the editor(s) and refer(s) for the hhelpful advice!

      Now we replace the P in current Equation #8 with Q and the P in current Equation #10 with W.

      (29) Use of "Detailed Balance": The authors used the term "detailed balance" to describe the intricate balance between the two groups of proteins when forming a polarised pattern. However, "detailed balance" is a term with a specific meaning in thermodynamics. Breaking detailed balance is a feature of nonequilibrium systems, and the polarisation phenomenon is evidently a nonequilibrium process. Using the term "detailed balance" may cause confusion, especially for readers with a physics background. It might be advisable to reconsider the terminology to avoid potential confusion and ensure clarity for readers.

      We sincerely thank the editor(s) and referee(s) for the constructive suggestion!

      To avoid potential confusion and ensure clarity for readers, now we replace “detailed balance” with “balance”, “required balance”, or “interplay” regarding different contexts.

      (30) Terminology: The word "molecule" is used where "molecular species" would be more appropriate, e.g., lines 456 and 551. Please revise these instances.

      We sincerely thank the editor(s) and referee(s) for the constructive suggestion!

      Now we replace all the “molecule” by “molecular species” as suggested.

      (31) Section 2.5: This section is confusing. It isn't clear where the "method outlined" (line 464) is nor what "span an iso-velocity surface at vanishing speed" means in line 470. The sentence in lines 486-488, "An expression similar to Eq. 8 enables quantitative prediction...", is too vague. Please clarify these points and specify what the "similar expression" is and where it can be found.

      We sincerely thank the editor(s) and referee(s) for the constructive suggestion!

      Now we clarify these points and specify the terms as suggested.

      (32) Software Mention: The software is only mentioned in the abstract and conclusions. It should also be mentioned where the computational pipeline is described, and the instructions available in the supplementary information need to be referenced in the main text.

      We sincerely thank the editor(s) and referee(s) for pointing out this problem!

      Now we mention the software where the computational pipeline is described and reference the instructions available in the Supplemental Text.

      (33) Supplementary Material References: Several parts of the supplementary material are never referenced in the main text, including Figure S1, Movies S3-S4, and the Instructions for PolarSim. Please reference these in the main text to clarify their relevance and how they fit with the manuscript's narrative.

      We sincerely thank the editor(s) and referee(s) for pointing out this problem!

      Now we add all the missing references for supplementary materials to the main text properly.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      eLife assessment

      In this study, Ger and colleagues present a valuable new technique that uses recurrent neural networks to distinguish between model misspecification and behavioral stochasticity when interpreting cognitivebehavioral model fits. Evidence for the usefulness of this technique, which is currently based primarily on a relatively simple toy problem, is considered incomplete but could be improved via comparisons to existing approaches and/or applications to other problems. This technique addresses a long-standing problem that is likely to be of interest to researchers pushing the limits of cognitive computational modeling.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Ger and colleagues address an issue that often impedes computational modeling: the inherent ambiguity between stochasticity in behavior and structural mismatch between the assumed and true model. They propose a solution to use RNNs to estimate the ceiling on explainable variation within a behavioral dataset. With this information in hand, it is possible to determine the extent to which "worse fits" result from behavioral stochasticity versus failures of the cognitive model to capture nuances in behavior (model misspecification). The authors demonstrate the efficacy of the approach in a synthetic toy problem and then use the method to show that poorer model fits to 2-step data in participants with low IQ are actually due to an increase in inherent stochasticity, rather than systemic mismatch between model and behavior.

      Strengths:

      Overall I found the ideas conveyed in the paper interesting and the paper to be extremely clear and wellwritten. The method itself is clever and intuitive and I believe it could be useful in certain circumstances, particularly ones where the sources of structure in behavioral data are unknown. In general, the support for the method is clear and compelling. The flexibility of the method also means that it can be applied to different types of behavioral data - without any hypotheses about the exact behavioral features that might be present in a given task.

      Thank you for taking the time to review our work and for the positive remarks regarding the manuscript. Below is a point-by-point response to the concerns raised.

      Weaknesses:

      That said, I have some concerns with the manuscript in its current form, largely related to the applicability of the proposed methods for problems of importance in computational cognitive neuroscience. This concern stems from the fact that the toy problem explored in the manuscript is somewhat simple, and the theoretical problem addressed in it could have been identified through other means (for example through the use of posterior predictive checking for model validation), and the actual behavioral data analyzed were interpreted as a null result (failure to reject that the behavioral stochasticity hypothesis), rather than actual identification of model-misspecification. I expand on these primary concerns and raise several smaller points below.

      A primary question I have about this work is whether the method described would actually provide any advantage for real cognitive modeling problems beyond what is typically done to minimize the chance of model misspecification (in particular, post-predictive checking). The toy problem examined in the manuscript is pretty extreme (two of the three synthetic agents are very far from what a human would do on the task, and the models deviate from one another to a degree that detecting the difference should not be difficult for any method). The issue posed in the toy data would easily be identified by following good modeling practices, which include using posterior predictive checking over summary measures to identify model insufficiencies, which in turn would call for the need for a broader set of models (See Wilson & Collins 2019). Thus, I am left wondering whether this method could actually identify model misspecification in real world data, particularly in situations where standard posterior predictive checking would fall short. The conclusions from the main empirical data set rest largely on a null result, and the utility of a method for detecting model misspecification seems like it should depend on its ability to detect its presence, not just its absence, in real data.

      Beyond the question of its advantage above and beyond data- and hypothesis-informed methods for identifying model misspecification, I am also concerned that if the method does identify a modelinsufficiency, then you still would need to use these other methods in order to understand what aspect of behavior deviated from model predictions in order to design a better model. In general, it seems that the authors should be clear that this is a tool that might be helpful in some situations, but that it will need to be used in combination with other well-described modeling techniques (posterior predictive checking for model validation and guiding cognitive model extensions to capture unexplained features of the data). A general stylistic concern I have with this manuscript is that it presents and characterizes a new tool to help with cognitive computational modeling, but it does not really adhere to best modeling practices (see Collins & Wilson, eLife), which involve looking at data to identify core behavioral features and simulating data from best-fitting models to confirm that these features are reproduced. One could take away from this paper that you would be better off fitting a neural network to your behavioral data rather than carefully comparing the predictions of your cognitive model to your actual data, but I think that would be a highly misleading takeaway since summary measures of behavior would just as easily have diagnosed the model misspecification in the toy problem, and have the added advantage that they provide information about which cognitive processes are missing in such cases.

      As a more minor point, it is also worth noting that this method could not distinguish behavioral stochasticity from the deterministic structure that is not repeated across training/test sets (for example, because a specific sequence is present in the training set but not the test set). This should be included in the discussion of method limitations. It was also not entirely clear to me whether the method could be applied to real behavioral data without extensive pretraining (on >500 participants) which would certainly limit its applicability for standard cases.

      The authors focus on model misspecification, but in reality, all of our models are misspecified to some degree since the true process-generating behavior almost certainly deviates from our simple models (ie. as George Box is frequently quoted, "all models are wrong, but some of them are useful"). It would be useful to have some more nuanced discussion of situations in which misspecification is and is not problematic.

      We thank the reviewer for these comments and have made changes to the manuscript to better describe these limitations. We agree with the reviewer and accept that fitting a neural network is by no means a substitute for careful and dedicated cognitive modeling. Cognitive modeling is aimed at describing the latent processes that are assumed to generate the observed data, and we agree that careful description of the data-generating mechanisms, including posterior predictive checks, is always required. However, even a well-defined cognitive model might still have little predictive accuracy, and it is difficult to know how much resources should be put into trying to test and develop new cognitive models to describe the data. We argue that RNN can lead to some insights regarding this question, and highlight the following limitations that were mentioned by the review: 

      First, we accept that it is important to provide positive evidence for the existence of model misspecification. In that sense, a result where the network shows dramatic improvement over the best-fitting theoretical model is easier to interpret compared to when the network shows no (or very little) improvement in predictive accuracy. This is because there is always an option that the network, for some reason, was not flexible enough to learn the data-generating model, or because the data-generating mechanism has changed from training to test. We have now added this more clearly in the limitation section. However, when it comes to our empirical results, we would like to emphasize that the network did in fact improve the predictive accuracy for all participants. The result shows support in favor of a "null" hypothesis in the sense that we seem to find evidence that the change in predictive accuracy between the theoretical model and RNN is not systematic across levels of IQ. This allows us to quantify evidence (use Bayesian statistics) for no systematic model misspecification as a function of IQ. While it is always possible that a different model might systematically improve the predictive accuracy of low vs high IQ individuals' data, this seems less likely given the flexibility of the current results.  

      Second, we agree that our current study only applies to the RL models that we tested. In the context of RL, we have used a well-established and frequently applied paradigm and models. We emphasize in the discussion that simulations are required to further validate other uses for this method with other paradigms.  

      Third, we also accept that posterior predictive checks should always be capitalized when possible, which is now emphasized in the discussion. However, we note that these are not always easy to interpret in a meaningful way and may not always provide details regarding model insufficiencies as described by the reviewer. It is very hard to determine what should be considered as a good prediction and since the generative model is always unknown, sometimes very low predictive accuracy can still be at the peak of possible model performance. This is because the data might be generated from a very noisy process, capping the possible predictive accuracy at a very low point. However, when strictly using theoretical modeling, it is very hard to determine what predictive accuracy to expect. Also, predictive checks are not always easy to interpret visually or otherwise. For example, in two-armed bandit tasks where there are only two actions, the prediction of choices is easier to understand in our opinion when described using a confusion matrix that summarizes the model's ability to predict the empirical behavior (which becomes similar to the predictive estimation we describe in eq 22).  

      Finally, this approach indeed requires a large dataset, with at least three sessions for each participant (training, validation, and test). Further studies might shed more light on the use of optimal epochs as a proxy for noise/complexity that can be used with less data (i.e., training and validation, without a test set).

      Please see our changes at the end of this document.  

      Reviewer #2 (Public Review):

      SUMMARY:

      In this manuscript, Ger and colleagues propose two complementary analytical methods aimed at quantifying the model misspecification and irreducible stochasticity in human choice behavior. The first method involves fitting recurrent neural networks (RNNs) and theoretical models to human choices and interpreting the better performance of RNNs as providing evidence of the misspecifications of theoretical models. The second method involves estimating the number of training iterations for which the fitted RNN achieves the best prediction of human choice behavior in a separate, validation data set, following an approach known as "early stopping". This number is then interpreted as a proxy for the amount of explainable variability in behavior, such that fewer iterations (earlier stopping) correspond to a higher amount of irreducible stochasticity in the data. The authors validate the two methods using simulations of choice behavior in a two-stage task, where the simulated behavior is generated by different known models. Finally, the authors use their approach in a real data set of human choices in the two-stage task, concluding that low-IQ subjects exhibit greater levels of stochasticity than high-IQ subjects.

      STRENGTHS:

      The manuscript explores an extremely important topic to scientists interested in characterizing human decision-making. While it is generally acknowledged that any computational model of behavior will be limited in its ability to describe a particular data set, one should hope to understand whether these limitations arise due to model misspecification or due to irreducible stochasticity in the data. Evidence for the former suggests that better models ought to exist; evidence for the latter suggests they might not.

      To address this important topic, the authors elaborate carefully on the rationale of their proposed approach. They describe a variety of simulations - for which the ground truth models and the amount of behavioral stochasticity are known - to validate their approaches. This enables the reader to understand the benefits (and limitations) of these approaches when applied to the two-stage task, a task paradigm commonly used in the field. Through a set of convincing analyses, the authors demonstrate that their approach is capable of identifying situations where an alternative, untested computational model can outperform the set of tested models, before applying these techniques to a realistic data set.

      Thank you for reviewing our work and for the positive tone. Please find below a point-by-point response to the concerns you have raised.

      WEAKNESSES:

      The most significant weakness is that the paper rests on the implicit assumption that the fitted RNNs explain as much variance as possible, an assumption that is likely incorrect and which can result in incorrect conclusions. While in low-dimensional tasks RNNs can predict behavior as well as the data-generating models, this is not *always* the case, and the paper itself illustrates (in Figure 3) several cases where the fitted RNNs fall short of the ground-truth model. In such cases, we cannot conclude that a subject exhibiting a relatively poor RNN fit necessarily has a relatively high degree of behavioral stochasticity. Instead, it is at least conceivable that this subject's behavior is generated precisely (i.e., with low noise) by an alternative model that is poorly fit by an RNN - e.g., a model with long-term sequential dependencies, which RNNs are known to have difficulties in capturing.

      These situations could lead to incorrect conclusions for both of the proposed methods. First, the model misspecification analysis might show equal predictive performance for a particular theoretical model and for the RNN. While a scientist might be inclined to conclude that the theoretical model explains the maximum amount of explainable variance and therefore that no better model should exist, the scenario in the previous paragraph suggests that a superior model might nonetheless exist. Second, in the earlystopping analysis, a particular subject may achieve optimal validation performance with fewer epochs than another, leading the scientist to conclude that this subject exhibits higher behavioral noise. However, as before, this could again result from the fact that this subject's behavior is produced with little noise by a different model. Admittedly, the existence of such scenarios *in principle* does not mean that such scenarios are common, and the conclusions drawn in the paper are likely appropriate for the particular examples analyzed. However, it is much less obvious that the RNNs will provide optimal fits in other types of tasks, particularly those with more complex rules and long-term sequential dependencies, and in such scenarios, an ill-advised scientist might end up drawing incorrect conclusions from the application of the proposed approaches.

      Yes, we understand and agree. A negative result where RNN is unable to overcome the best fitting theoretical model would always leave room for doubt regarding the fact that a different approach might yield better results. In contrast, a dramatic improvement in predictive accuracy for RNN is easier to interpret since it implies that the theoretical model can be improved. We have made an effort to make this issue clear and more articulated in the discussion. We specifically and directly mention in the discussion that “Equating RNN performance with the generative model should be avoided”.   

      However, we would like to note that our empirical results provided a somewhat more nuanced scenario where we found that the RNN generally improved the predictive accuracy of most participants. Importantly, this improvement was found to be equal across participants with no systematic benefits for low vs high IQ participants. We understand that there is always the possibility that another model would show a systematic benefit for low vs. high IQ participants, however, we suggest that this is less likely given the current evidence. We have made an effort to clearly note these issues in the discussion.  

      In addition to this general limitation, the paper also makes a few additional claims that are not fully supported by the provided evidence. For example, Figure 4 highlights the relationship between the optimal epochs and agent noise. Yet, it is nonetheless possible that the optimal epoch is influenced by model parameters other than inverse temperature (e.g., learning rate). This could again lead to invalid conclusions, such as concluding that low-IQ is associated with optimal epoch when an alternative account might be that low-IQ is associated with low learning rate, which in turn is associated with optimal epoch. Yet additional factors such as the deep double-descent (Nakkiran et al., ICLR 2020) can also influence the optimal epoch value as computed by the authors.

      An additional issue is that Figure 4 reports an association between optimal epoch and noise, but noise is normalized by the true minimal/maximal inverse-temperature of hybrid agents (Eq. 23). It is thus possible that the relationship does not hold for more extreme values of inverse-temperature such as beta=0 (extremely noisy behavior) or beta=inf (deterministic behavior), two important special cases that should be incorporated in the current study. Finally, even taking the association in Figure 4 at face value, there are potential issues with inferring noise from the optimal epoch when their correlation is only r~=0.7. As shown in the figures, upon finding a very low optimal epoch for a particular subject, one might be compelled to infer high amounts of noise, even though several agents may exhibit a low optimal epoch despite having very little noise.

      Thank you for these comments. Indeed, there is much we do not yet fully understand about the factors that influence optimal epochs. Currently, it is clear to us that the number of optimal epochs is influenced by a variety of factors, including network size, the data size, and other cognitive parameters, such as the learning rate. We hope that our work serves as a proof-of-concept, suggesting that, in certain scenarios, the number of epochs can be utilized as an empirical estimate. Moreover, we maintain that, at least within the context of the current paradigm, the number of optimal epochs is primarily sensitive to the amount of true underlying noise, assuming the number of trials and network size are constant. We are therefore hopeful that this proofof-concept will encourage research that will further examine the factors that influence the optimal epochs in different behavioral paradigms.  

      To address the reviewer's justified concerns, we have made several amendments to the manuscript. First, we added an additional version of Figure 4 in the Supplementary Information material, where the noise parameter values are not scaled. We hope this adjustment clarifies that the parameters were tested across a broad spectrum of values (e.g., 0 to 10 for the hybrid model), spanning the two extremes of complete randomness and high determinism. Second, we included a linear regression analysis showing the association of all model parameters (including noise) with the optimal number of epochs. As anticipated by the reviewer, the learning rate was also found to be associated with the number of optimal epochs. Nonetheless, the noise parameter appears to maintain the most substantial association with the number of optimal epochs. We have also added a specific mentioning of these associations in the discussion, to inform readers that the association between the number of optimal epochs and model parameters should be examined using simulation for other paradigms/models. Lastly, we acknowledge in the discussion that the findings regarding the association between the number of optimal epochs and noise warrant further investigation, considering other factors that might influence the determination of the optimal epoch point and the fact that the correlation with noise is strong, but not perfect (in the range of 0.7).

      The discussion now includes the following:

      “Several limitations should be considered in our proposed approach. First, fitting a data-driven neural network is evidently not enough to produce a comprehensive theoretical description of the data generation mechanisms. Currently, best practices for cognitive modeling \citep{wilson2019ten} require identifying under what conditions the model struggles to predict the data (e.g., using posterior predictive checks), and describing a different theoretical model that could account for these disadvantages in prediction. However, identifying conditions where the model shortcomings in predictive accuracy are due to model misspecifications rather than noisier behavior is a challenging task. We propose leveraging data-driven RNNs as a supplementary tool, particularly when they significantly outperform existing theoretical models, followed by refined theoretical modeling to provide insights into what processes were mis-specified in the initial modeling effort.

      Second, although we observed a robust association between the optimal number of epochs and true noise across varying network sizes and dataset sizes (see Fig.~\ref{figS2}), additional factors such as network architecture and other model parameters (e.g., learning rate, see .~\ref{figS7}) might influence this estimation. Further research is required to allow us to better understand how and why different factors change the number of optimal epochs for a given dataset before it can be applied with confidence to empirical investigations. 

      Third, the empirical dataset used in our study consisted of data collected from human participants at a single time point, serving as the training set for our RNN. The test set data, collected with a time interval of approximately $\sim6$ and $\sim18$ months, introduced the possibility of changes in participants' decision-making strategies over time. In our analysis, we neglected any possible changes in participants' decision-making strategies during that time, changes that may lead to poorer generalization performance of our approach. Thus, further studies are needed to eliminate such possible explanations.

      Fourth, our simulations, albeit illustrative, were confined to known models, necessitating in-silico validation before extrapolating the efficacy of our approach to other model classes and tasks. Our aim was to showcase the potential benefits of using a data-driven approach, particularly when faced with unknown models. However, whether RNNs will provide optimal fits for tasks with more complex rules and long-term sequential dependencies remains uncertain.

      Finally, while positive outcomes where RNNs surpass theoretical models can prompt insightful model refinement, caution is warranted in directly equating RNN performance with that of the generative model, as seen in our simulations (e.g., Figure 3). We highlight that our empirical findings depict a more complex scenario, wherein the RNN enhanced the predictive accuracy for all participants uniformly. Notably, we also provide evidence supporting a null effect among individuals, with no consistent difference in RNN improvement over the theoretical model based on IQ. Although it remains conceivable that a different datadriven model could systematically heighten the predictive accuracy for individuals with lower IQs in this task, such a possibility seems less probable in light of the current findings.”

      Reviewer #1 (Recommendations For The Authors):

      Minor comments:

      Is the t that gets fed as input to RNN just timestep?

      t = last transition type (rare/common). not timestep

      Line 378: what does "optimal epochs" mean here?

      The number of optimal training epochs that minimize both underfitting and overfitting (define in the line ~300)

      Line 443: I don't think "identical" is the right word here - surely the authors just mean that there is not an obvious systematic difference in the distributions.

      Fixed

      I was expecting to see ~500 points in Figure 7a, but there seem to be only 50... why weren't all datasets with at least 2 sessions used for this analysis?

      We used the ~500 subjects (only 2 datasets) to pre-train the RNN, and then fine-tuned the pre-trained RNN on the other 54 subjects that have 3 datasets. The correlation of IQ and optimal epoch also hold for the 500 subjects as shown below. 

      Author response image 1.

      Reviewer #2 (Recommendations For The Authors):

      Figure 3b: despite spending a long time trying to understand the meaning of each cell of the confusion matrix, I'm still unsure what they represent. Would be great if you could spell out the meaning of each cell individually, at least for the first matrix in the paper.

      We added a clarification to the Figure caption. 

      Figure 5: Why didn't the authors show this exact scenario using simulated data? It would be much easier to understand the predictions of this figure if they had been demonstrated in simulated data, such as individuals with different amounts of behavioral noise or different levels of model misspecifications.

      In Figure 5 the x-axis represents IQ. Replacing the x-axis with true noise would make what we present now as Figure 4. We have made an effort to emphasize the meaning of the axes in the caption. 

      Line 195 ("...in the action selection. Where"). Typo? No period is needed before "where".

      Fixed

      Line 213 ("K dominated-hand model"). I was intrigued by this model, but wasn't sure whether it has been used previously in the literature, or whether this is the first time it has been proposed.

      This is the first time that we know of that this model is used.  

      Line 345 ("This suggests that RNN is flexible enough to approximate a wide range of different behavioral models"): Worth explaining why (i.e., because the GRUs are able to capture dependencies across longer delays than a k-order Logistic Regression model).

      Line 356 ("We were interested to test"): Suggestion: "We were interested in testing".

      Fixed

      Line 389 ("However, as long as the number of observations and the size of the network is the same between two datasets, the number of optimal epochs can be used to estimate whether the dataset of one participant is noisier compared with a second dataset."): This is an important claim that should ideally be demonstrated directly. The paper only illustrates this effect through a correlation and a scatter plot, where higher noise tends to predict a lower optimal epoch. However, is the claim here that, in some circumstances, optimal epoch can be used to *deterministically* estimate noise? If so, this would be a strong result and should ideally be included in the paper.

      We have now omitted this sentenced and toned down our claims, suggesting that while we did find a strong association between noise and optimal epochs, future research is required to established to what extent this could be differentiated from other factors (i.e., network size, amount of observations).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Preliminary note from the Reviewing Editor:

      The evaluations of the two Reviewers are provided for your information. As you can see, their opinions are very different.

      Reviewer #1 is very harsh in his/her evaluation. Clearly, we don't expect you to be able to affect one type of actin network without affecting the other, but rather to change the balance between the two. However, he/she also raises some valid points, in particular that more rationale should be added for the perturbations (also mentioned by Reviewer #2). Both Reviewers have also excellent suggestions for improving the presentation of the data.

      We sincerely appreciate your and the reviewers’ suggestions. The comments are amended accordingly.

      On another point, I was surprised when reading your manuscript that a molecular description of chirality change in cells is presented as a completely new one. Alexander Bershadsky's group has identified several factors (including alpha-actinin) as important regulators of the direction of chirality. The articles are cited, but these important results are not specifically mentioned. Highlighting them would not call into question the importance of your work, but might even provide additional arguments for your model.

      We appreciate the editor’s comment. Alexander Bershadsky's group has done marvelous work in cell chirality. They introduced the stair-stepping and screw theory, which suggested how radial fiber polymerization generates ACW force and drives the actin cytoskeleton into the ACW pattern. Moreover, they have identified chiral regulators like alpha-actinin 1, mDia1, capZB, and profilin 1, which can reverse or neutralize the chiral expression.

      It is worth noting that Bershadsky's group primarily focuses on radial fibers. In our manuscript, instead, we primarily focused on the contractile unit in the transverse arcs and CW chirality in our investigation. Our manuscript incorporates our findings in the transverse arcs and the radial fibers theory by Bershadsky's group into the chirality balance hypothesis, providing a more comprehensive understanding of the chirality expression.

      We have included relevant articles from Alexander Bershadsky's group, we agree that highlighting these important results of chiral regulators would further strengthen our manuscript. The manuscript was revised as follows:

      “ACW chirality can be explained by the right-handed axial spinning of radial fibers during polymerization, i.e. ‘stair-stepping' mode proposed by Tee et al. (Tee et al. 2015) (Figure 8A; Video 4). As actin filament is formed in a right-handed double helix, it possesses an intrinsic chiral nature. During the polymerization of radial fiber, the barbed end capped by formin at focal adhesion was found to recruit new actin monomers to the filament. The tethering by formin during the recruitment of actin monomers contributes to the right-handed tilting of radial fibers, leading to ACW rotation. Supporting this model, Jalal et al. (Jalal et al. 2019) showed that the silencing of mDia1, capZB, and profilin 1 would abolish the ACW chiral expression or reverse the chirality into CW direction. Specifically, the silencing of mDia1, capZB or profilin-1 would attenuate the recruitment of actin monomer into the radial fiber, with mDia1 acting as the nucleator of actin filament (Tsuji et al. 2002), CapZB promoting actin polymerization as capping protein (Mukherjee et al. 2016), and profilin-1 facilitating ATP-bound G-actin to the barbed ends(Haarer and Brown 1990; Witke 2004). The silencing resulted in a decrease in the elongation velocity of radial fiber, driving the cell into neutral or CW chirality. These results support that our findings that reduction of radial fiber elongation can invert the balance of chirality expression, changing the ACW-expressing cell into a neutral or CW-expressing cell.”

      By incorporating their findings into our revision and discussion, we provide additional support for our radial fiber-transverse arc balance model for chirality expression. The revision is made on pages 8 to 9, 13, lines 253 to 256, 284, 312 to 313, 443, 449 to 459.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Kwong et al. present evidence that two actin-filament based cytoskeletal structures regulate the clockwise and anticlockwise rotation of the cytoplasm. These claims are based on experiments using cells plated on micropatterned substrates (circles). Previous reports have shown that the actomyosin network that forms on the dorsal surface of a cell plated on a circle drives a rotational or swirling pattern of movement in the cytoplasm. This actin network is composed of a combination of non-contractile radial stress fibers (AKA dorsal stress fibers) which are mechanically coupled to contractile transverse actin arcs (AKA actin arcs). The authors claim that directionality of the rotation of the cytoplasm (i.e., clockwise or anticlockwise) depends on either the actin arcs or radial fibers, respectively. While this would interesting, the authors are not able to remove either actin-based network without effecting the other. This is not surprising, as it is likely that the radial fibers require the arcs to elongate them, and the arcs require the radial fibers to stop them from collapsing. As such, it is difficult to make simple interpretations such as the clockwise bias is driven by the arcs and anticlockwise bias is driven by the radial fibers.

      Weaknesses:

      (1) There are also multiple problems with how the data is displayed and interpreted. First, it is difficult to compare the experimental data with the controls as the authors do not include control images in several of the figures. For example, Figure 6 has images showing myosin IIA distribution, but Figure 5 has the control image. Each figure needs to show controls. Otherwise, it will be difficult for the reader to understand the differences in localization of the proteins shown. This could be accomplished by either adding different control examples or by combining figures.

      We appreciate the reviewer’s comment. We agree with the reviewer that it is difficult to compare our results in the current arrangement. The controls are included in the new Figure 6.

      (2) It is important that the authors should label the range of gray values of the heat maps shown. It is difficult to know how these maps were created. I could not find a description in the methods, nor have previous papers laid out a standardized way of doing it. As such, the reader needs some indication as to whether the maps showing different cells were created the same and show the same range of gray levels. In general, heat maps showing the same protein should have identical gray levels. The authors already show color bars next to the heat maps indicating the range of colors used. It should be a simple fix to label the minimum (blue on the color bar) and the maximum (red on the color bar) gray levels on these color bars. The profiles of actin shown in Figure 3 and Figure 3- figure supplement 3 were useful for interpretating the distribution of actin filaments. Why did not the authors show the same for the myosin IIa distributions?

      We appreciate the reviewer’s comment. For generating the distribution heatmap, the images were taken under the same setting (e.g., fluorescent staining procedure, excitation intensity, or exposure time). The prerequisite of cells for image stacking was that they had to be fully spread on either 2500 µm2 or 750 µm2 circular patterns. Then, the location for image stacking was determined by identifying the center of each cell spread in a perfect circle. Finally, the images were aligned at the cell center to calculate the averaged intensity to show the distribution heatmap on the circular pattern. Revision is made on pages 19 to 20, lines 668 to 677.

      It is important to note that the individual heatmaps represent the normalized distribution generated using unique color intensity ranges. This approach was chosen to emphasize the proportional distribution of protein within cells and its variations among samples, especially for samples with generally lower expression levels. Additionally, a differential heatmap with its own range was employed to demonstrate the normalized differences compared to the control sample. Furthermore, to provide additional insight, we plotted the intensity profile of the same protein with the same size for comparative analysis. Revision is made on pages 20, lines 679 to 682.

      The labels of the heatmap are included to show the intensity in the revised Figure 3, Figure 5, Figure 6, and Figure 3 —figure supplement 4.

      To better illustrate the myosin IIa distribution, the myosin intensity profiles were plotted for Y27 treatment and gene silencing. The figures are included as Figure 5—figure supplement 2 and Figure 6—figure supplement 2. Revisions are made on pages 10, lines 332 to 334 and pages 11, lines 377 to 379.

      (3) Line 189 "This absence of radial fibers is unexpected". The authors should clarify what they mean by this statement. The claim that the cell in Figure 3B has reduced radial stress fiber is not supported by the data shown. Every actin structure in this cell is reduced compared to the cell on the larger micropattern in Figure 3A. It is unclear if the radial stress fibers are reduced more than the arcs. Are the authors referring to radial fiber elongation?

      We appreciate the reviewer’s comment. We calculated the structures' pixel number and the percentage in the image to better illustrate the reduction of radial fiber or transverse arc. As radial fibers emerge from the cell boundary and point towards the cell center and the transverse arcs are parallel to the cell edge, the actin filament can be identified by their angle with respect to the cell center. We found that the pixel number of radial fiber is greatly reduced by 91.98 % on 750 µm2 compared to the 2500 µm2 pattern, while the pixel number of transverse arc is reduced by 70.58 % (Figure 3- figure supplement 3A). Additionally, we compared the percentage of actin structures on different pattern sizes (Figure 3- figure supplement 3B). On 2500 µm2 pattern, the percentage of radial fiber in the actin structure is 61.76 ± 2.77 %, but it only accounts for 31.13 ± 2.76 % while on 750 µm2 pattern. These results provide evidence of the structural reduction on a smaller pattern.

      Regarding the radial fiber elongation, we only discussed the reduction of radial fiber on 750 µm2 compared to the 2500 µm2 pattern in this part. For more understanding of the radial fiber contribution to chirality, we compared the radial fiber elongation rate in the LatA treatment and control on 2500 µm2 pattern (Figure 4). This result suggests the potential role of radial fiber in cell chirality. Revisions are made on page 6, lines 186 to 194; pages 17 to 18, 601 to 606; and the new Figure 3- figure supplement 3.

      (4) The choice of the small molecule inhibitors used in this study is difficult to understand, and their results are also confusing. For example, sequestering G actin with Latrunculin A is a complicated experiment. The authors use a relatively low concentration (50 nM) and show that actin filament-based structures are reduced and there are more in the center of the cell than in controls (Figure 3E). What was the logic of choosing this concentration?

      We appreciate the reviewer’s comment. The concentration of drugs was selected based on literatures and their known effects on actin arrangement or chiral expression.

      For example, Latrunculin A was used at 50 nM concentration, which has been proven effective in reversing the chirality at or below 50 nM (Bao et al., 2020; Chin et al., 2018; Kwong et al., 2019; Wan et al., 2011). Similarly, the 2 µM A23187 treatment concentration was selected to initiate the actin remodeling (Shao et al., 2015). Furthermore, NSC23677 at 100 µM was found to efficiently inhibit the Rac1 activation and resulted in a distinct change in actin structure (Chen et al., 2011; Gao et al., 2004), enhancing ACW chiral expression. The revision is made on pages 6 to 7, lines 202 to 211.

      (5) Using a small molecule that binds the barbed end (e.g., cytochalasin) could conceivably be used to selectively remove longer actin filaments, which the radial fibers have compared to the lamellipodia and the transverse arcs. The authors should articulate how the actin cytoskeleton is being changed by latruculin treatment and the impact on chirality. Is it just that the radial stress fibers are not elongating? There seems to be more radial stress fibers than in controls, rather than an absence of radial stress fibers.

      We appreciate the reviewer’s comment. Our results showed Latrunculin A treatment reversed the cell chirality. To compare the amount of radial fiber and transverse arc, we calculated the structures' pixel percentage. We found that, the percentage of radial fibers pixel with LatA treatment was reduced compared to that of the control, while the percentage of transverse arcs pixel increased (Figure 3— figure supplement 5). This result suggests that radial fibers are inhibited under Latrunculin A treatment.

      Furthermore, the elongation rate of radial fibers is reduced by Latrunculin A treatment (Figure 4). This result, along with the reduction of radial fiber percentage under Latrunculin A treatment suggests the significant impact of radial fiber on the ACW chirality.  Revisions are made on pages 7 to 8, lines 244 to 250 and the new Figure 3— figure supplement 5 and Figure 3— figure supplement 6.

      (6) Similar problems arise from the other small molecules as well. LPA has more effects than simply activating RhoA. Additionally, many of the quantifiable effects of LPA treatment are apparent only after the cells are serum starved, which does not seem to be the case here.

      We appreciate the reviewer’s comment. The reviewer mentioned that the quantifiable effects of LPA treatments were seen after the cells were serum-starved. LPA is known to be a serum component and has an affinity to albumin in serum (Moolenaar, 1995). Serum starvation is often employed to better observe the effects of LPA by comparing conditions with and without LPA. We agree with the reviewer that the effect of LPA cannot be fully seen under the current setting. Based on the reviewer’s comment and after careful consideration, we have decided to remove the data related to LPA from our manuscript. Revisions are made on pages 6 to 7, 17 and Figure 3— figure supplement 4.

      (7) Furthermore, inhibiting ROCK with, Y-27632, effects myosin light chain phosphorylation and is not specific to myosin IIA. Are the two other myosin II paralogs expressed in these cells (myosin IIB and myosin IIC)? If so, the authors’ statements about this experiment should refer to myosin II not myosin IIa.

      We appreciate the reviewer’s comment. We agree that ensuring accuracy and clarity in our statements is important. The terminology is revised to myosin II regarding the Y27632 experiment for a more concise description. Revision is made on pages 9 to 10 and 29, lines 317 to 341, 845 and 848.  

      (8) None of the uses of the small molecules above have supporting data using a different experimental method. For example, backing up the LPA experiment by perturbing RhoA tho.

      We appreciate the reviewer’s comment. After careful consideration, we have decided to remove the data related to LPA from our manuscript. Revisions are made on pages 6 to 7, 17 and Figure 3— figure supplement 4.

      (9) The use of SMIFH2 as a "formin inhibitor" is also problematic. SMIFH2 also inhibits myosin II contractility, making interpreting its effects on cells difficult to impossible. The authors present data of mDia2 knockdown, which would be a good control for this SMIFH2.

      We appreciate the reviewer’s comment. We agree that there is potential interference of SMIFH2 with myosin II contractility, which could introduce confounding factors to the results. Based on your comment and further consideration, we have decided to remove the data related to SMIFH2 from our manuscript. Revisions are made on pages 6 to 7, 10, 17 and Figure 3— figure supplement 4.

      (10) However, the authors claim that mDia2 "typically nucleates tropomyosin-decorated actin filaments, which recruit myosin II and anneal endwise with α-actinin- crosslinked actin filaments."

      There is no reference to this statement and the authors own data shows that both arcs and radial fibers are reduced by mDia2 knockdown. Overall, the formin data does not support the conclusions the authors report.

      We appreciate the reviewer’s comment. We apologize for the lack of citation for this claim. To address this, we have added a reference to support this claim in the revised manuscript (Tojkander et al., 2011). Revision is made on page 10, line 345 to 347.

      Regarding the actin structure of mDia2 gene silencing, our results showed that myosin II was disassociated from the actin filament compared to the control. At the same time, there is no considerable differences in the actin structure of radial fibers and transverse arcs between the mDia2 gene silencing and the control.  

      (11) The data in Figure 7 does not support the conclusion that myosin IIa is exclusively on top of the cell. There are clear ventral stress fibers in A (actin) that have myosin IIa localization. The authors simply chose to not draw a line over them to create a height profile.

      We appreciate the reviewer’s comment. To better illustrate myosin IIa distribution in a cell, we have included a video showing the myosin IIa staining from the base to the top of the cell (Video 7). At the cell base, the intensity of myosin IIa is relatively low at the center. However, when the focal plane elevates, we can clearly see the myosin II localizes near the top of the cell (Figure 7B and Video 7). Revision is made on page 12, lines 421 to 424, and the new Video 7. 

      Reviewer #2 (Public Review):

      Summary:

      Chirality of cells, organs, and organisms can stem from the chiral asymmetry of proteins and polymers at a much smaller lengthscale. The intrinsic chirality of actin filaments (F-actin) is implicated in the chiral arrangement and movement of cellular structures including F-actin-based bundles and the nucleus. It is unknown how opposite chiralities can be observed when the chirality of F-actin is invariant. Kwong, Chen, and co-authors explored this problem by studying chiral cell-scale structures in adherent mammalian cultured cells. They controlled the size of adhesive patches, and examined chirality at different timepoints. They made various molecular perturbations and used several quantitative assays. They showed that forces exerted by antiparallel actomyosin bundles on parallel radial bundles are responsible for the chirality of the actomyosin network at the cell scale.

      Strengths:

      Whereas previously, most effort has been put into understanding radial bundles, this study makes an important distinction that transverse or circumferential bundles are made of antiparallel actomyosin arrays. A minor point that was nice for the paper to make is that between the co-existing chirality of nuclear rotation and radial bundle tilt, it is the F-actin driving nuclear rotation and not the other way around. The paper is clearly written.

      Weaknesses:

      The paper could benefit from grammatical editing. Once the following Major and Minor points are addressed, which may not require any further experimentation and does not entail additional conditions, this manuscript would be appropriate for publication in eLife.

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      Major:

      (1) The binary classification of cells as exhibiting clockwise or anticlockwise F-actin structures does not capture the instances where there is very little chirality, such as in the mDia2-depleted cells on small patches (Figure 6B). Such reports of cell chirality throughout the cell population need to be reported as the average angle of F-actin structures on a per cell basis as a rose plot or scatter plot of angle. These changes to cell-scoring and data display will be important to discern between conditions where chirality is random (50% CW, 50% ACW) from conditions where chirality is low (radial bundles are radial and transverse arcs are circumferential).

      We appreciate the reviewer’s comment. We apologize if we did not convey our analysis method clearly enough. Throughout the manuscript, unless mentioned otherwise, the chirality analysis was based on the chiral nucleus rotation within a period of observation. The only exception is the F-actin structure chirality, in Figure 3—figure supplement 1, which we analyzed the angle of radial fiber of the control cell on 2500 µm2. It was described on pages 5 to 6, lines 169-172, and the method section “Analysis of fiber orientation and actin structure on circular pattern” on page 17.

      Based on the feedback, we attempted to use a scatter plot to present the mDia2 overexpression and silencing to show the randomness of the result. However, because scatter plots primarily focus on visualizing the distribution, they become cluttered and visually overwhelming, as shown below.

      Author response image 1.

      (A) Percentage of ACW nucleus rotational bias on 2500 µm2 with untreated control (reused data from Figure 3D, n = 57), mDia2 silencing (n = 48), and overexpression (n = 25). (B) Probability of ACW/CW rotation on 750 µm2 pattern with untreated control (reused data from Figure 3E, n = 34), mDia2 silencing (n = 53), and overexpressing (n = 22). Mean ± SEM. Two-sample equal variance two-tailed t-test.

      Therefore, in our manuscript, the presentation primarily used a column bar chart with statistical analysis, the Student T-test. The column bar chart makes it easier to understand and compare values. In brief, the Student T-test is commonly used to evaluate whether the means between the two groups are significantly different, assuming equal variance. As such, the Student T-test is able to discern the randomness of the chirality.

      (2) The authors need to discuss the likely nucleator of F-actin in the radial bundles, since it is apparently not mDia2 in these cells.

      We appreciate the reviewer’s comment. In our manuscript, we originally focused on mDia2 and Tpm4 as they are the transverse arc nucleator and the mediator of myosin II motion. However, we agree with the reviewer that discussing the radial fiber nucleator would provide more insight into radial fiber polymerization in ACW chirality and improve the completeness of the story.

      Radial fiber polymerizes at the focal adhesion. Serval proteins are involved in actin nucleation or stress fiber formation at the focal adhesion, such as Arp2/3 complex (Serrels et al., 2007), Ena/VASP (Applewhite et al., 2007; Gateva et al., 2014), and formins (Dettenhofer et al., 2008; Sahasrabudhe et al., 2016; Tsuji et al., 2002), etc. Within the formin family, mDia1 is the likely nucleator of F-actin in the radial bundle. The presence of mDia1 facilitates the elongation of actin bundles at focal adhesion (Hotulainen and Lappalainen, 2006). Studies by Jalal, et al (2019) (Jalal et al., 2019) and Tee, et al (2023) (Tee et al., 2023), have demonstrated the silencing of mDia1 abolished the ACW actin expression. Silencing of other nucleation proteins like Arp2/3 complex or Ena/VASP would only reduce the ACW actin expression without abolishing it.

      Based on these findings, the attenuation of radial fiber elongation would abolish the ACW chiral expression, providing more support for our model in explaining chirality expression.

      This part is incorporated into the Discussion. The revision is made on page 13, lines 443, 449 to 459.

      Minor:

      (1) In the introduction, additional observations of handedness reversal need to be referenced (line 79), including Schonegg, Hyman, and Wood 2014 and Zaatri, Perry, and Maddox 2021.

      We appreciate the reviewer’s comment. The observations of handedness reversal references are cited on page 3, line 78 to 79.

      (2) For clarity of logic, the authors should share the rationale for choosing, and results from administering, the collection of compounds as presented in Figure 3 one at a time instead of as a list.

      We appreciate the reviewer’s comment. The concentration of drugs was determined based on existing literature and their known outcomes on actin arrangement or chiral expression.

      To elucidate, the use of Latrunculin A was based on previous studies, which have demonstrated to reverse the chirality at or below 50 nM (Bao et al., 2020; Chin et al., 2018; Kwong et al., 2019; Wan et al., 2011).  Because inhibiting F-actin assembly can lead to the expression of CW chirality, we hypothesized that the opposite treatment might enhance ACW chirality. Therefore, we chose A23187 treatment with 2 µM concentration as it could initiate the actin remodeling and stress fiber formation (Shao et al., 2015).

      Furthermore, in the attempt to replicate the reversal of chirality by inhibiting F-actin assembly through other pathways, we explored NSC23677 at 100 µM, which was found to inhibit the Rac1 activation (Chen et al., 2011; Gao et al., 2004) and reduce cortical F-actin assembly (Head et al., 2003). However, it failed to reverse the chirality but enhanced the ACW chirality of the cell.

      We carefully selected the drugs and the applied concentration to investigate various pathways and mechanisms that influence actin arrangement and might affect the chiral expression. We believe that this clarification strengthens the rationale behind our choice of drug. The revision is made on pages 6 to 7, lines 202 to 211.

      (3) "Image stacking" isn't a common term to this referee. Its first appearance in the main text (line 183) should be accompanied with a call-out to the Methods section. The authors could consider referring to this approach more directly. Related issue: Image stacking fails to report the prominent enrichment of F-actin at the very cell periphery (see Figure 3 A and F) except for with images of cells on small islands (Figure 3H). Since this data display approach seems to be adding the intensity from all images together, and since cells on circular adhesive patches are relatively radially symmetric, it is unclear how to align cells, but perhaps cells could be aligned based on a slight asymmetry such as the peripheral location with highest F-actin intensity or the apparent location of the centrosome.

      We appreciate the reviewer’s comment. We fully acknowledge the uncommon use of “image stacking” and the insufficient description of image stacking under the Method section. First, we have added a call-out to the Methods section at its first appearance (Page 6, Lines 182 to 183). The method of image stacking is as follows. During generating the distribution heatmap, the images were taken under the same setting (e.g., staining procedure, fluorescent intensity, exposure time, etc.). The prerequisite of cells to be included in image stacking was that they had to be fully spread on either 2500 µm2 or 750 µm2 circular patterns. Then, the consistent position for image stacking could be found by identifying the center of each cell spreading in a perfect circle. Finally, the images were aligned at the center to calculate the averaged intensity to show the distribution heatmap on the circular pattern.

      We agree with the reviewer that our image alignment and stacking are based on cells that are radially symmetric. As such, the intensity distribution of stacked image is to compare the difference of F-actin along the radial direction. Revision is made on page 19, lines 668 to 682.

      (4) The authors need to be consistent with wording about chirality, avoiding "right" and left (e.g. lines 245-6) since if the cell periphery were oriented differently in the cropped view, the tilt would be a different direction side-to-side but the same chirality. This section is confusing since the peripheral radial bundles are quite radial, and the inner ones are pointing from upper left to lower right, pointing (to the right) more downward over time, rather than more right-ward, in the cropped images.

      We appreciate the reviewer’s comment. We apologize for the confusion caused by our description of the tilting direction. For consistency in our later description, we mention the “right” or “left” direction of the radial fibers referencing to the elongation of the radial fiber, which then brings the “rightward tilting” toward the ACW rotation of the chiral pattern. To maintain the word “rightward tilting”, we added the description to ensure accurate communication in our writing. We also rearrange the image in the new Figure 4A and Video 2 for better observation. Revision is made on page 8, lines 262 to 263.

      (5) Why are the cells Figure 4A dominated by radial (and more-central, tilting fibers, while control cells in 4D show robust circumferential transverse arcs? Have these cells been plated for different amounts of time or is a different optical section shown?

      We appreciate the reviewer’s comment. The cells in Figure 4A and Figure 4D are prepared with similar conditions, such as incubation time and optical setting. Actin organization is a dynamic process, and cells can exhibit varied actin arrangements, transitioning between different forms such as circular, radial, chordal, chiral, or linear patterns, as they spread on a circular island (Tee et al., 2015). In Figure 4A, the actin is arranged in a chiral pattern, whereas in Figure 4D, the actin exhibits a radial pattern. These variations reflect the natural dynamics of actin organization within cells during the imaging process.

      (6) All single-color images (such as Fig 5 F-actin) need to be black-on-white, since it is far more difficult to see F-actin morphology with red on black.

      We appreciate the reviewer’s comment. We have changed all F-actin images (single color) into black and white for better image clarity. Revisions are made in the new Figure 5, Figure 6 and Figure 7.

      (7) Figure 5A, especially the F-actin staining, is quite a bit blurrier than other micrographs. These images should be replaced with images of comparable quality to those shown throughout.

      We appreciate the reviewer’s comment. We agree that the F-actin staining in Figure 5 is difficult to observe. To improve image clarity, the F-actin staining images are replaced with more zoomed-in image. Revision is made in the new Figure 5.

      (8) F-actin does not look unchanged by Y27632 treatment, as the authors state in line 306. This may be partially due to image quality and the ambiguities of communicating with the blue-to-red colormap. Similarly, I don't agree that mDia2 depletion did not change F-actin distribution (line 330) as cells in that condition had a prominent peripheral ring of F-actin missing from cells in other conditions.

      We appreciate the reviewer’s comment. We agree with the reviewer’s observation that the F-actin distribution is indeed changed under Y27632 treatment compared to the control in Figure 5A-B. Here, we would like to emphasize that the actin ring persists despite the actin structure being altered under the Y27632 treatment. The actin ring refers to the darker red circle in the distribution heatmap. It presents the condensed actin structure, including radial fibers and transverse arcs. This important structure remains unaffected despite the disruption of myosin II, the key component in radial fiber.

      Furthermore, we agree with the reviewer that mDia2 depletion does change F-actin distribution. Similar to the Y27632 treatment, the actin ring persists despite the actin structure being altered under mDia2 gene silencing. Moreover, compared to other treatments, mDia2 depletion has less significant impact on actin distribution. To address these points more comprehensively, we have made revision in Y27632 treatment and mDia2 sections. The revisions of Y27632 and mDia2 are made on pages 10, lines 324-327 and 352-353, respectively.

      (9) The colormap shown for intensity coding should be reconsidered, as dark red is harder to see than the yellow that is sub-maximal. Verdis is a colormap ranging from cooler and darker blue, through green, to warmer and lighter yellow as the maximum. Other options likely exist as well.

      We appreciate the reviewer’s comment. We carefully considered the reviewer’s concern and explored other color scale choices in the colormap function in Matlab. After evaluating different options, including “Verdis” color scale, we found that “jet” provides a wide range of colors, allowing the effective visual presentation of intensity variation in our data. The use of ‘jet’ allows us to appropriately visualize the actin ring distribution, which represented in red or dark re. While we understand that dark red could be harder to see than the sub-maximal yellow, we believe that “jet” serves our purpose of presenting the intensity information.

      (10) For Figure 6, why doesn't average distribution of NMMIIa look like the example with high at periphery, low inside periphery, moderate throughout lamella, low perinuclear, and high central?

      We appreciate the reviewer’s comment. We understand that the reviewer’s concern about the average distribution of NMMIIa not appearing as the same as the example. The chosen image is the best representation of the NMMIIa disruption from the transverse arcs after the mDia2 silencing. Additionally, it is important to note that the average distribution result is a stacked image which includes other images. As such, the NMMIIA example and the distribution heatmap might not necessarily appear identical.

      (11) In 2015, Tee, Bershadsky and colleagues demonstrated that transverse bundles are dorsal to radial bundles, using correlative light and electron microscopy. While it is important for Kwong and colleagues to show that this is true in their cells, they should reference Tee et al. in the rationale section of text pertaining to Figure 7.

      We appreciate the reviewer’s comment. Tee, et al (Tee et al., 2015) demonstrated the transverse fiber is at the same height as the radial fiber based on the correlative light and electron microscopy. Here, using the position of myosin IIa, a transverse arc component, our results show the dorsal positioning of transverse arcs with connection to the extension of radial fibers (Figure 7C), which is consistent with their findings. It is included in our manuscript, page 12, lines 421 to 424, and page 14 lines 477 to 480.

      Reference

      Applewhite, D.A., Barzik, M., Kojima, S.-i., Svitkina, T.M., Gertler, F.B., and Borisy, G.G. (2007). Ena/Vasp Proteins Have an Anti-Capping Independent Function in Filopodia Formation. Mol. Biol. Cell. 18, 2579-2591. DOI: https://doi.org/10.1091/mbc.e06-11-0990

      Bao, Y., Wu, S., Chu, L.T., Kwong, H.K., Hartanto, H., Huang, Y., Lam, M.L., Lam, R.H., and Chen, T.H. (2020). Early Committed Clockwise Cell Chirality Upregulates Adipogenic Differentiation of Mesenchymal Stem Cells. Adv. Biosyst. 4, 2000161. DOI: https://doi.org/10.1002/adbi.202000161

      Chen, Q.-Y., Xu, L.-Q., Jiao, D.-M., Yao, Q.-H., Wang, Y.-Y., Hu, H.-Z., Wu, Y.-Q., Song, J., Yan, J., and Wu, L.-J. (2011). Silencing of Rac1 Modifies Lung Cancer Cell Migration, Invasion and Actin Cytoskeleton Rearrangements and Enhances Chemosensitivity to Antitumor Drugs. Int. J. Mol. Med. 28, 769-776. DOI: https://doi.org/10.3892/ijmm.2011.775

      Chin, A.S., Worley, K.E., Ray, P., Kaur, G., Fan, J., and Wan, L.Q. (2018). Epithelial Cell Chirality Revealed by Three-Dimensional Spontaneous Rotation. Proc. Natl. Acad. Sci. U.S.A. 115, 12188-12193. DOI: https://doi.org/10.1073/pnas.1805932115

      Dettenhofer, M., Zhou, F., and Leder, P. (2008). Formin 1-Isoform IV Deficient Cells Exhibit Defects in Cell Spreading and Focal Adhesion Formation. PLoS One 3, e2497. DOI:  https://doi.org/10.1371/journal.pone.0002497

      Gao, Y., Dickerson, J.B., Guo, F., Zheng, J., and Zheng, Y. (2004). Rational Design and Characterization of a Rac GTPase-Specific Small Molecule Inhibitor. Proc. Natl. Acad. Sci. U.S.A. 101, 7618-7623. DOI: https://doi.org/10.1073/pnas.0307512101

      Gateva, G., Tojkander, S., Koho, S., Carpen, O., and Lappalainen, P. (2014). Palladin Promotes Assembly of Non-Contractile Dorsal Stress Fibers through Vasp Recruitment. J. Cell Sci. 127, 1887-1898. DOI: https://doi.org/10.1242/jcs.135780

      Haarer, B., and Brown, S.S. (1990). Structure and Function of Profilin.

      Head, J.A., Jiang, D., Li, M., Zorn, L.J., Schaefer, E.M., Parsons, J.T., and Weed, S.A. (2003). Cortactin Tyrosine Phosphorylation Requires Rac1 Activity and Association with the Cortical Actin Cytoskeleton. Mol. Biol. Cell. 14, 3216-3229. DOI: https://doi.org/10.1091/mbc.e02-11-0753

      Hotulainen, P., and Lappalainen, P. (2006). Stress Fibers are Generated by Two Distinct Actin Assembly Mechanisms in Motile Cells. J. Cell Biol. 173, 383-394. DOI: https://doi.org/10.1083/jcb.200511093

      Jalal, S., Shi, S., Acharya, V., Huang, R.Y., Viasnoff, V., Bershadsky, A.D., and Tee, Y.H. (2019). Actin Cytoskeleton Self-Organization in Single Epithelial Cells and Fibroblasts under Isotropic Confinement. J. Cell Sci. 132. DOI: https://doi.org/10.1242/jcs.220780

      Kwong, H.K., Huang, Y., Bao, Y., Lam, M.L., and Chen, T.H. (2019). Remnant Effects of Culture Density on Cell Chirality after Reseeding. J. Cell Sci. 132. DOI: https://doi.org/10.1242/jcs.220780

      Moolenaar, W.H. (1995). Lysophosphatidic Acid, a Multifunctional Phospholipid Messenger. J. Cell Sci. 132. DOI: https://doi.org/10.1242/jcs.220780

      Mukherjee, K., Ishii, K., Pillalamarri, V., Kammin, T., Atkin, J.F., Hickey, S.E., Xi, Q.J., Zepeda, C.J., Gusella, J.F., and Talkowski, M.E. (2016). Actin Capping Protein Capzb Regulates Cell Morphology, Differentiation, and Neural Crest Migration in Craniofacial Morphogenesis. Hum. Mol. Genet. 25, 1255-1270. DOI: https://doi.org/10.1093/hmg/ddw006

      Sahasrabudhe, A., Ghate, K., Mutalik, S., Jacob, A., and Ghose, A. (2016). Formin 2 Regulates the Stabilization of Filopodial Tip Adhesions in Growth Cones and Affects Neuronal Outgrowth and Pathfinding In Vivo. Development 143, 449-460. DOI: https://doi.org/10.1242/dev.130104

      Serrels, B., Serrels, A., Brunton, V.G., Holt, M., McLean, G.W., Gray, C.H., Jones, G.E., and Frame, M.C. (2007). Focal Adhesion Kinase Controls Actin Assembly via a Ferm-Mediated Interaction with the Arp2/3 Complex. Nat. Cell Biol. 9, 1046-1056. DOI: https://doi.org/10.1038/ncb1626

      Shao, X., Li, Q., Mogilner, A., Bershadsky, A.D., and Shivashankar, G. (2015). Mechanical Stimulation Induces Formin-Dependent Assembly of a Perinuclear Actin Rim. Proc. Natl. Acad. Sci. U.S.A. 112, E2595-E2601. DOI: https://doi.org/10.1073/pnas.1504837112

      Tee, Y.H., Goh, W.J., Yong, X., Ong, H.T., Hu, J., Tay, I.Y.Y., Shi, S., Jalal, S., Barnett, S.F., and Kanchanawong, P. (2023). Actin Polymerisation and Crosslinking Drive Left-Right Asymmetry in Single Cell and Cell Collectives. Nat. Commun. 14, 776. DOI: https://doi.org/10.1038/s41467-023-35918-1

      Tee, Y.H., Shemesh, T., Thiagarajan, V., Hariadi, R.F., Anderson, K.L., Page, C., Volkmann, N., Hanein, D., Sivaramakrishnan, S., Kozlov, M.M., and Bershadsky, A.D. (2015). Cellular Chirality Arising from the Self-Organization of the Actin Cytoskeleton. Nat. Cell Biol. 17, 445-457. DOI: https://doi.org/10.1038/ncb3137

      Tojkander, S., Gateva, G., Schevzov, G., Hotulainen, P., Naumanen, P., Martin, C., Gunning, P.W., and Lappalainen, P. (2011). A Molecular Pathway for Myosin II Recruitment to Stress Fibers. Curr. Biol. 21, 539-550. DOI: https://doi.org/10.1016/j.cub.2011.03.007

      Tsuji, T., Ishizaki, T., Okamoto, M., Higashida, C., Kimura, K., Furuyashiki, T., Arakawa, Y., Birge, R.B., Nakamoto, T., Hirai, H., and Narumiya, S. (2002). Rock and mdia1 Antagonize in Rho-Dependent Rac Activation in Swiss 3T3 Fibroblasts. J. Cell Biol. 157, 819-830. DOI: https://doi.org/10.1083/jcb.200112107

      Wan, L.Q., Ronaldson, K., Park, M., Taylor, G., Zhang, Y., Gimble, J.M., and Vunjak-Novakovic, G. (2011). Micropatterned Mammalian Cells Exhibit Phenotype-Specific Left-Right Asymmetry. Proc. Natl. Acad. Sci. U.S.A. 108, 12295-12300. DOI: https://doi.org/10.1073/pnas.1103834108

      Witke, W. (2004). The Role of Profilin Complexes in Cell Motility and Other Cellular Processes. Trends Cell Biol. 14, 461-469. DOI: https://doi.org/10.1016/j.tcb.2004.07.003

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The authors develop a method to fluorescently tag peptides loaded onto dendritic cells using a two-step method with a tetracystein motif modified peptide and labelling step done on the surface of live DC using a dye with high affinity for the added motif. The results are convincing in demonstrating in vitro and in vivo T cell activation and efficient label transfer to specific T cells in vivo. The label transfer technique will be useful to identify T cells that have recognised a DC presenting a specific peptide antigen to allow the isolation of the T cell and cloning of its TCR subunits, for example. It may also be useful as a general assay for in vitro or in vivo T-DC communication that can allow the detection of genetic or chemical modulators.

      Strengths:

      The study includes both in vitro and in vivo analysis including flow cytometry and two-photon laser scanning microscopy. The results are convincing and the level of T cell labelling with the fluorescent pMHC is surprisingly robust and suggests that the approach is potentially revealing something about fundamental mechanisms beyond the state of the art.

      Weaknesses:

      The method is demonstrated only at high pMHC density and it is not clear if it can operate at at lower peptide doses where T cells normally operate. However, this doesn't limit the utility of the method for applications where the peptide of interest is known. It's not clear to me how it could be used to de-orphan known TCR and this should be explained if they want to claim this as an application. Previous methods based on biotin-streptavidin and phycoerythrin had single pMHC sensitivity, but there were limitations to the PE-based probe so the use of organic dyes could offer advantages.

      We thank the reviewer for the valuable comments and suggestions. Indeed, we have shown and optimized this labeling technique for a commonly used peptide at rather high doses to provide a proof of principle for the possible use of tetracysteine tagged peptides for in vitro and in vivo studies. However, we completely agree that the studies that require different peptides and/or lower pMHC concentrations may require preliminary experiments if the use of biarsenical probes is attempted. We think it can help investigate the functional and biological properties of the peptides for TCRs deorphaned by techniques. Tetracysteine tagging of such peptides would provide a readily available antigen-specific reagent for the downstream assays and validation. Other possible uses for modified immunogenic peptides could be visualizing the dynamics of neoantigen vaccines or peptide delivery methods in vivo. For these additional uses, we recommend further optimization based on the needs of the prospective assay.

      Reviewer #2 (Public Review):

      Summary:

      The authors here develop a novel Ovalbumin model peptide that can be labeled with a site-specific FlAsH dye to track agonist peptides both in vitro and in vivo. The utility of this tool could allow better tracking of activated polyclonal T cells particularly in novel systems. The authors have provided solid evidence that peptides are functional, capable of activating OTII T cells, and that these peptides can undergo trogocytosis by cognate T cells only.

      Strengths:

      -An array of in vitro and in vivo studies are used to assess peptide functionality.

      -Nice use of cutting-edge intravital imaging.

      -Internal controls such as non-cogate T cells to improve the robustness of the results (such as Fig 5A-D).

      -One of the strengths is the direct labeling of the peptide and the potential utility in other systems.

      Weaknesses:

      1. What is the background signal from FlAsH? The baselines for Figure 1 flow plots are all quite different. Hard to follow. What does the background signal look like without FLASH (how much fluorescence shift is unlabeled cells to No antigen+FLASH?). How much of the FlAsH in cells is actually conjugated to the peptide? In Figure 2E, it doesn't look like it's very specific to pMHC complexes. Maybe you could double-stain with Ab for MHCII. Figure 4e suggests there is no background without MHCII but I'm not fully convinced. Potentially some MassSpec for FLASH-containing peptides.

      We thank the reviewer for pointing out a possible area of confusion. In fact, we have done extensive characterization of the background and found that it has varied with the batch of FlAsH, TCEP, cytometer and also due to the oxidation prone nature of the reagents. Because Figure 1 subfigures have been derived from different experiments, a combination of the factors above have likely contributed to the inconsistent background. To display the background more objectively, we have now added the No antigen+Flash background to the revised Fig 1.

      It is also worthwhile noting that nonspecific Flash incorporation can be toxic at increasing doses, and live cells that display high backgrounds may undergo early apoptotic changes in vitro. However, when these cells are adoptively transferred and tracked in vivo, the compromised cells with high background possibly undergo apoptosis and get cleared by macrophages in the lymph node. The lack of clearance in vitro further contributes to different backgrounds between in vitro and in vivo, which we think is also a possible cause for the inconsistent backgrounds throughout the manuscript. Altogether, comparison of absolute signal intensities from different experiments would be misleading and the relative differences within each experiment should be relied upon. We have added further discussion about this issue.

      1. On the flip side, how much of the variant peptides are getting conjugated in cells? I'd like to see some quantification (HPLC or MassSpec). If it's ~10% of peptides that get labeled, this could explain the low shifts in fluorescence and the similar T cell activation to native peptides if FlasH has any deleterious effects on TCR recognition. But if it's a high rate of labeling, then it adds confidence to this system.

      We agree that mass spectrometry or, more specifically tandem MS/MS, would be an excellent addition to support our claim about peptide labeling by FlAsH being reliable and non-disruptive. Therefore, we have recently undertaken a tandem MS/MS quantitation project with our collaborators. However, this would require significant time to determine the internal standard based calibration curves and to run both analytical and biological replicates. Hence, we have decided pursuing this as a follow up study and added further discussion on quantification of the FlAsH-peptide conjugates by tandem MS/MS.

      1. Conceptually, what is the value of labeling peptides after loading with DCs? Why not preconjugate peptides with dye, before loading, so you have a cleaner, potentially higher fluorescence signal? If there is a potential utility, I do not see it being well exploited in this paper. There are some hints in the discussion of additional use cases, but it was not clear exactly how they would work. One mention was that the dye could be added in real-time in vivo to label complexes, but I believe this was not done here. Is that feasible to show?

      We have already addressed preconjugation as a possible avenue for labeling peptides. In our hands, preconjugation resulted in low FlAsH intensity overall in both the control and tetracysteine labeled peptides (Author response image 1). While we don’t have a satisfactory answer as to why the signal was blunted due to preconjugation, it could be that the tetracysteine tagged peptides attract biarsenical compounds better intracellularly. It may be due to the redox potential of the intracellular environment that limits disulfide bond formation. (PMID: 18159092)

      Author response image 1.

      Preconjugation yields poor FlAsH signal. Splenic DCs were pulsed with peptide then treated with FlAsH or incubated with peptide-FlAsH preconjugates. Overlaid histograms show the FlAsH intensities on DCs following the two-step labeling (left) and preconjugation (right). Data are representative of two independent experiments, each performed with three biological replicates.

      1. Figure 5D-F the imaging data isn't fully convincing. For example, in 5F and 2G, the speeds for T cells with no Ag should be much higher (10-15micron/min or 0.16-0.25micron/sec). The fact that yours are much lower speeds suggests technical or biological issues, that might need to be acknowledged or use other readouts like the flow cytometry.

      We thank the reviewer for drawing attention to this technical point. We would like to point out that the imaging data in fig 5 d-f was obtained from agarose embedded live lymph node sections. Briefly, the lymph nodes were removed, suspended in 2% low melting temp agarose in DMEM and cut into 200 µm sections with a vibrating microtome. Prior to imaging, tissue sections were incubated in complete RPMI medium at 37 °C for 2 h to resume cell mobility. Thus, we think the cells resuming their typical speeds ex vivo may account for slightly reduced T cell speeds overall, for both control and antigen-specific T cells (PMID: 32427565, PMID: 25083865). We have added text to prevent the ambiguity about the technique for dynamic imaging. The speeds in Figure 2g come from live imaging of DC-T cell cocultures, in which the basal cell movement could be hampered by the cell density. Additionally, glass bottom dishes have been coated with Fibronectin to facilitate DC adhesion, which may be responsible for the lower average speeds of the T cells in vitro.

      Reviewer #1 (Recommendations For The Authors):

      Does the reaction of ReAsH with reactive sites on the surface of DC alter them functionally? Functions have been attributed to redox chemistry at the cell surface- could this alter this chemistry?

      We thank the reviewer for the insight. It is possible that the nonspecific binding of biarsenical compounds to cysteine residues, which we refer to as background throughout the manuscript, contribute to some alterations. One possible way biarsenicals affect the redox events in DCs can be via reducing glutathione levels (PMID: 32802886). Glutathione depletion is known to impair DC maturation and antigen presentation (PMID: 20733204). To avoid toxicity, we have carried out a stringent titration to optimize ReAsH and FlAsH concentrations for labeling and conducted experiments using doses that did not cause overt toxicity or altered DC function.

      Have the authors compared this to a straightforward approach where the peptide is just labelled with a similar dye and incubated with the cell to load pMHC using the MHC knockout to assess specificity? Why is this that involves exposing the DC to a high concentration of TCEP, better than just labelling the peptide? The Davis lab also arrived at a two-step method with biotinylated peptide and streptavidin-PE, but I still wonder if this was really necessary as the sensitivity will always come down to the ability to wash out the reagents that are not associated with the MHC.

      We agree with the reviewer that small undisruptive fluorochrome labeled peptide alternatives would greatly improve the workflow and signal to noise ratio. In fact, we have been actively searching for such alternatives since we have started working on the tetracysteine containing peptides. So far, we have tried commercially available FITC and TAMRA conjugated OVA323-339 for loading the DCs, however failed to elicit any discernible signal. We also have an ongoing study where we have been producing and testing various in-house modified OVA323-339 that contain fluorogenic properties. Unfortunately, at this moment, the ones that provided us with a crisp, bright signal for loading revealed that they have also incorporated to DC membrane in a nonspecific fashion and have been taken up by non-cognate T cells from double antigen-loaded DCs. We are actively pursuing this area of investigation and developing better optimized peptides with low/non-significant membrane incorporation.

      Lastly, we would like to point out that tetracysteine tags are visible by transmission electron microscopy without FlAsH treatment. Thus, this application could add a new dimension for addressing questions about the antigen/pMHCII loading compartments in future studies. We have now added more in-depth discussion about the setbacks and advantages of using tetracysteine labeled peptides in immune system studies.

      The peptide dosing at 5 µM is high compared to the likely sensitivity of the T cells. It would be helpful to titrate the system down to the EC50 for the peptide, which may be nM, and determine if the specific fluorescence signal can still be detected in the optimal conditions. This will not likely be useful in vivo, but it will be helpful to see if the labelling procedure would impact T cell responses when antigen is limited, which will be more of a test. At 5 µM it's likely the system is at a plateau and even a 10-fold reduction in potency might not impact the T cell response, but it would shift the EC50.

      We thank the reviewer for the comment and suggestion. We agree that it is possible to miss minimally disruptive effects at 5 µM and titrating the native peptide vs. modified peptide down to the nM doses would provide us a clearer view. This can certainly be addressed in future studies and also with other peptides with different affinity profiles. A reason why we have chosen a relatively high dose for this study was that lowering the peptide dose had costed us the specific FlAsH signal, thus we have proceeded with the lowest possible peptide concentration.

      In Fig 3b the level of background in the dsRed channel is very high after DC transfer. What cells is this associated with and does this appear be to debris? Also, I wonder where the ReAsH signal is in the experiments in general. I believe this is a red dye and it would likely be quite bright given the reduction of the FlAsH signal. Will this signal overlap with signals like dsRed and PHK-26 if the DC is also treated with this to reduce the FlAsH background?

      We have already shown that ReAsH signal with DsRed can be used for cell-tracking purposes as they don’t get transferred to other cells during antigen specific interactions (Author response image 2). In fact, combining their exceptionally bright fluorescence provided us a robust signal to track the adoptively transferred DCs in the recipient mice. On the other hand, the lipophilic membrane dye PKH-26 gets transferred by trogocytosis while the remaining signal contributes to the red fluorescence for tracking DCs. Therefore, the signal that we show to be transferred from DCs to T cells only come from the lipophilic dye. To address this, we have added a sentence to elaborate on this in the results section. Regarding the reviewer’s comment on DsRed background in Figure 3b., we agree that the cells outside the gate in recipient mice seems slightly higher that of the control mice. It may suggest that the macrophages clearing up debris from apoptotic/dying DCs might contribute to the background elicited from the recipient lymph node. Nevertheless, it does not contribute to any DsRed/ReAsH signal in the antigen-specific T cells.

      Author response image 2.

      ReAsH and DsRed are not picked up by T cells during immune synapse. DsRed+ DCs were labeled with ReAsH, pulsed with 5 μM OVACACA, labeled with FlAsH and adoptively transferred into CD45.1 congenic mice mice (1-2 × 106 cells) via footpad. Naïve e450-labeled OTII and e670-labeled polyclonal CD4+ T cells were mixed 1:1 (0.25-0.5 × 106/ T cell type) and injected i.v. Popliteal lymph nodes were removed at 42 h post-transfer and analyzed by flow cytometry. Overlaid histograms show the ReAsh/DsRed, MHCII and FlAsH intensities of the T cells. Data are representative of two independent experiments with n=2 mice per group.

      In Fig 5b there is a missing condition. If they look at Ea-specific T cells for DC with without the Ova peptide do they see no transfer of PKH-26 to the OTII T cells? Also, the FMI of the FlAsH signal transferred to the T cells seems very high compared to other experiments. Can the author estimate the number of peptides transferred (this should be possible) and would each T cell need to be collecting antigens from multiple DC? Could the debris from dead DC also contribute to this if picked up by other DC or even directly by the T cells? Maybe this could be tested by transferring DC that are killed (perhaps by sonication) prior to inoculation?

      To address the reviewer’s question on the PKH-26 acquisition by T cells, Ea-T cells pick up PKH-26 from Ea+OVA double pulsed DCs, but not from the unpulsed or single OVA pulsed DCs. OTII T cells acquire PKH-26 from OVA-pulsed DCs, whereas Ea T cells don’t (as expected) and serve as an internal negative control for that condition. Regarding the reviewer’s comment on the high FlAsH signal intensity of T cells in Figure 5b, a plausible explanation can be that the T cells accumulate pMHCII through serial engagements with APCs. In fact, a comparison of the T cell FlAsH intensities 18 h and 36-48 h post-transfer demonstrate an increase (Author response image 3) and thus hints at a cumulative signal. As DCs are known to be short-lived after adoptive transfer, the debris of dying DCs along with its peptide content may indeed be passed onto macrophages, neighboring DCs and eventually back to T cells again (or for the first time, depending on the T:DC ratio that may not allow all T cells to contact with the transferred DCs within the limited time frame). We agree that the number and the quality of such contacts can be gauged using fluorescent peptides. However, we think peptides chemically conjugated to fluorochromes with optimized signal to noise profiles and with less oxidation prone nature would be more suitable for quantification purposes.

      Author response image 3.

      FlAsH signal acquisition by antigen specific T cells becomes more prominent at 36-48 h post-transfer. DsRed+ splenic DCs were double-pulsed with 5 μM OVACACA and 5 μM OVA-biotin and adoptively transferred into CD45.1 recipients (2 × 106 cells) via footpad. Naïve e450-labeled OTII (1 × 106 cells) and e670-labeled polyclonal T cells (1 × 106 cells) were injected i.v. Popliteal lymph nodes were analyzed by flow cytometry at 18 h or 48 h post-transfer. Overlaid histograms show the T cell levels of OVACACA (FlAsH). Data are representative of three independent experiments with n=3 mice per time point

      Reviewer #2 (Recommendations For The Authors):

      As mentioned in weaknesses 1 & 2, more validation of how much of the FlAsH fluorescence is on agonist peptides and how much is non-specific would improve the interpretation of the data. Another option would be to preconjugate peptides but that might be a significant effort to repeat the work.

      We agree that mass spectrometry would be the gold standard technique to measure the percentage of tetracysteine tagged peptide is conjugated to FlAsH in DCs. However, due to the scope of such endevour this can only be addressed as a separate follow up study. As for the preconjugation, we have tried and unfortunately failed to get it to work (Reviewer Figure 1). Therefore, we have shifted our focus to generating in-house peptide probes that are chemically conjugated to stable and bright fluorophore derivates. With that, we aim to circumvent the problems that the two-step FlAsH labeling poses.

      Along those lines, do you have any way to quantify how many peptides you are detecting based on fluorescence? Being able to quantify the actual number of peptides would push the significance up.

      We think two step procedure and background would pose challenges to such quantification in this study. although it would provide tremendous insight on the antigen-specific T cell- APC interactions in vivo, we think it should be performed using peptides chemically conjugated to fluorochromes with optimized signal to noise profiles.

      In Figure 3D or 4 does the SA signal correlate with Flash signal on OT2 cells? Can you correlate Flash uptake with T cell activation, downstream of TCR, to validate peptide transfers?

      To answer the reviewer’s question about FlAsH and SA correlation, we have revised the Figure 3d to show the correlation between OTII uptake of FlAsH, Streptavidin and MHCII. We also thank the reviewer for the suggestion on correlating FlAsH uptake with T cell activation and/or downstream of TCR activation. We have used proliferation and CD44 expressions as proxies of activation (Fig 2, 6). Nevertheless, we agree that the early events that correspond to the initiation of T-DC synapse and FlAsH uptake would be valuable to demonstrate the temporal relationship between peptide transfer and activation. Therefore, we have addressed this in the revised discussion.

      Author response image 4.

      FlAsH signal acquisition by antigen specific T cells is correlates with the OVA-biotin (SA) and MHCII uptake. DsRed+ splenic DCs were double-pulsed with 5 μM OVACACA and 5 μM OVA-biotin and adoptively transferred into CD45.1 recipients (2 × 106 cells) via footpad. Naïve e450-labeled OTII (1 × 106 cells) and e670-labeled polyclonal T cells (1 × 106 cells) were injected i.v. Popliteal lymph nodes were analyzed by flow cytometry. Overlaid histograms show the T cell levels of OVACACA (FlAsH) at 48 h post-transfer. Data are representative of three independent experiments with n=3 mice.

      Minor:

      Figure 3F, 5D, and videos: Can you color-code polyclonal T cells a different color than magenta (possibly white or yellow), as they have the same look as the overlay regions of OT2-DC interactions (Blue+red = magenta).

      We apologize for the inconvenience about the color selection. We have had difficulty in assigning colors that are bright and distinct. Unfortunately, yellow and white have also been easily mixed up with the FlAsH signal inside red and blue cells respectively. We have now added yellow and white arrows to better point out the polyclonal vs. antigen specific cells in 3f and 5d.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This important study combines fMRI and electrophysiology in sedated and awake rats to show that LFPs strongly explain spatial correlations in resting-state fMRI but only weakly explain temporal variability. They propose that other, electrophysiology-invisible mechanisms contribute to the fMRI signal. The evidence supporting the separation of spatial and temporal correlations is convincing, however, the support of electrophysiological-invisible mechanisms is incomplete, considering alternative potential factors that could account for the differences in spatial and temporal correlation that were observed. This work will be of interest to researchers who study the fundamental mechanisms behind resting-state fMRI.

      We appreciate the encouraging comments. We added a section in discussion that thoroughly discussed the potential alternative factors that could account for the differences in spatial and temporal correlation that we observed. 

      Public Reviews:

      Reviewer #1 (Public Review):

      Tu et al investigated how LFPs recorded simultaneously with rsfMRI explain the spatiotemporal patterns of functional connectivity in sedated and awake rats. They find that connectivity maps generated from gamma band LFPs (from either area) explain very well the spatial correlations observed in rsfMRI signals, but that the temporal variance in rsfMRI data is more poorly explained by the same LFP signals. The authors excluded the effects of sedation in this effect by investigating rats in the awake state (a remarkable feat in the MRI scanner), where the findings generally replicate. The authors also performed a series of tests to assess multiple factors (including noise, outliers, and nonlinearity of the data) in their analysis.

      This apparent paradox is then explained by a hypothetical model in which LFPs and neurovascular coupling are generated in some sense "in parallel" by different neuron types, some of which drive LFPs and are measured by ePhys, while others (nNOS, etc.) have an important role in neurovascular coupling but are less visible in Ephys data. Hence the discrepancy is explained by the spatial similarity of neural activity but the more "selective" LFPs picked up by Ephys account for the different temporal aspects observed.

      This is a deep, outstanding study that harnesses multidisciplinary approaches (fMRI and ephys) for observing brain activity. The results are strongly supported by the comprehensive analyses done by the authors, which ruled out many potential sources for the observed findings. The study's impact is expected to be very large.

      Comment: There are very few weaknesses in the work, but I'd point out that the 1second temporal resolution may have masked significant temporal correlations between

      LFPs and spontaneous activity, for instance, as shown by Cabral et al Nature Communications 2023, and even in earlier QPP work from the Keilholz Lab. The synchronization of the LFPs may correlate more with one of these modes than the total signal. Perhaps a kind of "dynamic connectivity" analysis on the authors' data could test whether LFPs correlate better with the activity at specific intervals. However, this could purely be discussed and left for future work, in my opinion.

      We appreciate this great point. Indeed, it is likely that LFP and rsfMRI signals are more strongly related during some modes/instances than others, and hence correlation across the entire time series may have masked this effect. In addition, we agree that 1-second temporal resolution may obscure some temporal correlations between LFPs and rsfMRI signal. The choice of 1-second temporal resolution was made to be consistent with the TR in our fMRI experiment, considering the slow hemodynamic response. Ultrafast fMRI imaging combined with dynamic connectivity analysis in a future study might enable more detailed examination of BOLD-LFP temporal correlations at higher temporal resolutions. We have added the following paragraph to the revised manuscript:

      “Our proposed theoretic model represents just one potential explanation for the apparent discrepancy in temporal and spatial relationships between resting-state electrophysiology and BOLD signals. It is important to acknowledge that there may be other scenarios where a stronger temporal relationship between LFP and BOLD signals could manifest. For instance, recent research suggests that the relationship between LFP and rsfMRI signals may vary across different modes or instances (Cabral et al., 2023), which can be masked by correlations across the entire time series. Moreover, the 1-second temporal resolution employed in our study may obscure certain temporal correlations between LFPs and rsfMRI signals. Future investigations employing ultrafast fMRI imaging coupled with dynamic connectivity analysis could offer a more nuanced exploration of BOLD-LFP temporal correlations at higher temporal resolutions (Bolt et al., 2022; Cabral et al., 2023; Ma and Zhang, 2018; Thompson et al., 2014).”

      Reviewer #2 (Public Review):

      The authors address a question that is interesting and important to the sub-field of rsfMRI that examines electrophysiological correlates of rsfMRI. That is, while electrophysiology-produced correlation maps often appear similar to correlation maps produced from BOLD alone (as has been shown in many papers) is this actually coming from the same source of variance, or independent but spatially-correlated sources of variance? To address this, the authors recorded LFP signals in 2 areas (M1 and ACC) and compared the maps produced by correlating BOLD with them to maps produced by BOLD-BOLD correlations. They then attempt to remove various sources of variance and see the results.

      The basic concept of the research is sound, though primarily of interest to the subset of rsfMRI researchers who use simultaneous electrophysiology. However, there are major problems in the writing, and also a major methodological problem.

      Major problems with writing:

      Comment 1: There is substantial literature on rats on site-specific LFP recording compared to rsfMRI, and much of it already examined removing part of the LFP and examining rsfMRI, or vice versa. The authors do not cover it and consider their work on signal removal more novel than it is.

      We have added more literature studies to the revised manuscript. It is important to note that while there exists a substantial body of literature on site-specific LFP recording coupled with rsfMRI, our paper makes a significant contribution by unveiling the disparity in temporal and spatial relationships between resting-state electrophysiological and fMRI signals. This goes beyond mere reporting of spatial/temporal correlations. Furthermore, our exploration of the impact of removing LFP on rsfMRI spatial patterns constitutes one among several analyses employed to demonstrate that the temporal fluctuations of LFP minimally affect BOLD-derived RSN spatial patterns. We wish to clarify that our intention is not to claim this aspect of our work is more novel than similar analyses conducted in previous studies (we apologize if our original manuscript conveyed that impression). Rather, the novelty lies in the objective of this analysis, which is to elucidate the displarity in temporal and spatial relationships between resting-state electrophysiological and fMRI signals—a crucial issue that has not been thoroughly addressed previously. 

      Comment 2: The conclusion of the existence of an "electrophysiology-invisible signal" is far too broad considering the limited scope of this study. There are many factors that can be extracted from LFP that are not used in this study (envelope, phase, infraslow frequencies under 0.1Hz, estimated MUA, etc.) and there are many ways of comparing it to the rsfMRI data that are not done in this study (rank correlation, transformation prior to comparison, clustering prior to comparison, etc.). The one non-linear method used, mutual information, is low sensitivity and does not cover every possible nonlinear interaction. Mutual information is also dependent upon the number of bins selected in the data. Previous studies (see 1) have seen similar results where fMRI and LFP were not fully commensurate but did not need to draw such broad conclusions.

      First we would like to clarify that the existence of "electrophysiologyinvisible signal" is not necessarily a conclusion of the present study, per se, as described by the reviewer. As we stated in our manuscript, it is a proposed theoretical model. We fully acknowledge that this model represents just one potential explanation for the apparent discrepancy in temporal and spatial relationships between resting-state electrophysiology and BOLD signals. It is important to acknowledge that there may be other scenarios where a stronger temporal relationship between LFP and BOLD signals could manifest. This issue has been further clarified in the revised manuscript (see the section of Potential pitfalls). 

      We agree with the reviewer that not all factors that can be extracted from LFP are examined. In our current study we focused solely on band-limited LFP power as the primary feature in our analysis, given its prevalence in prior studies of LFP-rsfMRI correlates. More importantly, we demonstrate that band-specific LFP powers can yield spatial patterns nearly identical to those derived from rsfMRI signals, prompting a closer examination of the temporal relationship between these same features. Furthermore, since correlational analysis was used in studying the LFP-BOLD spatial relationship, we used the same analysis method when comparing their temporal relationship. 

      Extracting all possible features from the electrophysiology signal and examining their relationship with the rsfMRI signal or exploring all other types of ways of comparing LFP and rsfMRI signals goes beyond the scope of the current study. However, to address the reviewer’s concern, we tried a couple of analysis methods suggested by the reviewer, and results remain persistent. Figure S14 shows the results from (A) the rank correlation and (B) z transformation prior to comparison. We added these new results to the revised manuscript.

      Comment 3: The writing refers to the spatial extent of correlation with the LFP signal as "spatial variance." However, LFP was recorded from a very limited point and the variance in the correlation map does not necessarily reflect underlying electrophysiological spatial distributions (e.g. Yu et al. Nat Commun. 2023 Mar 24;14(1):1651.)

      The reviewer accurately pointed out that in our paper, “spatial variance” refers to the spatial variance of BOLD correlates with the LFP signal. Our objective is to assess the extent to which this spatial variance, which is derived from the neural activity captured by LFP in the M1 or ACC, corresponds to the BOLD-derived spatial patterns from the same regions. We acknowledge that this spatial variance may differ from the spatial map obtained by multi-site electrophysiology recordings. Nevertheless, numerous studies have consistently reported a high spatial correspondence between BOLD- and electrophysiology-derived RSNs using various methodologies across different physiological states in both humans and animals. For instance, research employing electroencephalography (EEG) or electrocorticography (ECoG) in humans demonstrates that RSNs derived from the power of multiple-site electrophysiological signals exhibit similar spatial patterns to classic BOLD-derived RSNs such as the default-mode network (Hacker et al., 2017; Kucyi et al., 2018). These studies well agree with our findings. Notably, the reference paper cited by the reviewer studies brain-wide changes during transitions between awake and various sleep stages, which is quite different from the brain states examined in our study.

      Major method problem:

      Comment 4: Correlating LFP to fMRI is correlating two biological signals, with unknown but presumably not uniform distributions. However, correlating CC results from correlation maps is comparing uniform distributions. This is not a fair comparison, especially considering that the noise added is also uniform as it was created with the rand() function in MATLAB.

      This is a good point. We examined the distributions of both LFP powers and fMRI signals. They both seem to follow a normal distribution. Below shows distributions of the two signals from a random scan. In addition, z transformation prior to comparison generated the same results (Fig. S14).

      Author response image 1.

      Exemplar distributions of A) the fMRI signal of M1, and B) HRF-convolved LFP power in M1.

      Reviewer #1 (Recommendations For The Authors):

      Comment 1: In the Discussion, a few more calcium imaging papers could be fruitfully discussed (e.g. Ma et al Resting-state hemodynamics are spatiotemporally coupled to synchronized and symmetric neural activity in excitatory neurons, PNAS 2016, or more recently Vafaii et al, Multimodal measures of spontaneous brain activity reveal both common and divergent patterns of cortical functional organization, Nat Comms 2024).

      We appreciate this suggestion. We have added the following discussions to the revised manuscript: 

      “These findings indicate the temporal information provided by gamma power can only explain a minor portion (approximately 35%) of the temporal variance in the BOLD time series, even after accounting for the noise effect, which is in line with the reported correlation value between the cerebral blood volume and fluctuations in GCaMP signal in head-fixed mice during periods of immobility (R = 0.63) (Ma et al., 2016).” 

      “It is plausible that employing different features or comparison methods could yield a stronger BOLD-electrophysiology temporal relationship (Ma et al., 2016).”

      “Furthermore, in a more recent study by Vafaii and colleagues, overlapping cortical networks were identified using both fMRI and calcium imaging modalities, suggesting that networks observable in fMRI studies exhibit corresponding neural activity spatial patterns (Vafaii et al., 2024).” 

      “Furthermore, Vafaii et. al. revealed notable differences in functional connectivity strength measured by fMRI and calcium imaging, despite an overlapping spatial pattern of cortical networks identified by both modalities (Vafaii et al., 2024).”

      Comment 2: Similarly when discussing the "invisible" populations, perhaps Uhlirova et al eLife 2016 should be mentioned as some types of inhibitory processes may also be less clearly observed in LFPs but rather strongly contribute to NVC.

      We appreciate the suggestion. We added the following sentences to the revised manuscript. 

      “Additionally, Uhlirova et al. conducted a study where they utilized optogenetic stimulation and two-photon imaging to investigate how the activation of different neuron types affects blood vessels in mice. They discovered that only the activation of inhibitory neurons led to vessel constriction, albeit with a negligible impact on LFP (Uhlirova et al., 2016).”

      Reviewer #2 (Recommendations For The Authors):

      Major problems with writing:

      Comment 1: The authors need to review past work to better place their study in the context of the literature (some review articles: Lurie et al. Netw Neurosci. 2020 Feb 1;4(1):30-69. & Thompson et al. Neuroimage. 2018 Oct 15;180(Pt B):448-462.)

      Here are some LFP and BOLD "resting state" papers focused on dynamic changes.

      Many of these papers examine both spatial and temporal extents of correlations. Several of these papers use similar methods to the reviewed paper.

      Also, many of these papers dispute the claim that correlations seen are

      "electrophysiology invisible signal." Note that I am NOT saying that "electrophysiology invisible" correlations do not exist (it seems very likely some DO exist). However, the authors did not show that in the reviewed paper, and some of the correlations which they call an "electrophysiology invisible signal" probably would be visible if analyzed in a different manner.

      Quite a few literature studies that the reviewer suggested were already included in the original manuscript. We have also added more literature studies to the revised manuscript. Again, we would like to emphasize that the novelty of our study centers on the discovery of the disparity in temporal and spatial relationships between resting-state electrophysiological and fMRI signals. See below our responses to individual literature studies listed.

      In humans:

      https://pubmed.ncbi.nlm.nih.gov/38082179/ Predicts by using models the paper under review does not use here.

      The following discussion was added to the revised manuscript: 

      “Some other comparison methods such as rank correlation and transformation prior to comparison were also tested and results remain persistent (Fig. S14). These findings align with the notion that, compared to nonlinear models, linear models offer superior predictive value for the rsfMRI signal using LFP data, as comprehensively illustrated in (Nozari et al., 2024) (also see Fig. S7). Importantly, in this study, the predictive powers (represented by R2) of various comparison methods tested all remain below 0.5 (Nozari et al., 2024), suggesting that while certain models may enhance the temporal relationship between LFP and BOLD signals, the improvement is likely modest.”

      In nonhuman primates: https://pubmed.ncbi.nlm.nih.gov/34923136/ Most of the variance that could be creating resting state networks is in the <1 Hz band which the paper under review did not study

      ]We also examined infraslow LFP activity (< 1Hz) in our data. Consistent with the finding in the reference paper (Li et al., 2022), infraslow LFP power and the BOLD signal can derive consistent RSN spatial patterns (for M1, spatial correlation = 0.70), while the temporal correlation remains very low (temporal correlation = 0.08). These results and the reference paper were added to the revised manuscript.

      https://pubmed.ncbi.nlm.nih.gov/28461461/ Compares actual spread of LFP vs. spread of BOLD instead of just correlation between LFP and BOLD.

      The following sentence has been added to the revised manuscript.

      “This high spatial correspondence between rsfMRI and LFP signals can even be found at the columnar level (Shi et al., 2017).”   

      https://pubmed.ncbi.nlm.nih.gov/24048850/ Comparison of small (from LFP) to large (from BOLD) spatial correlations in the context of temporal correlations.

      In this study, researchers compared neurophysiological maps and fMRI maps of the inferior temporal cortex in macaques in response to visual images. They observed that the spatial correlation increased as the neurophysiological maps got greater levels of spatial smoothing. This suggests that fMRI can capture large-scale spatial information, but it may be limited in capturing fine details. Although interesting, this paper did not study the electrophysiology-fMRI relationship at the resting state and hence is not very relevant to our study.

      https://pubmed.ncbi.nlm.nih.gov/20439733/ Electrophysiology from a single site can correlate across nearly the entire cerebral cortex.

      We have included the discussion of this paper in the original manuscript.

      https://pubmed.ncbi.nlm.nih.gov/18465799/ The original dynamic BOLD and LFP work from 2008 by Shmuel and Leopold included spatiotemporal dynamics.

      We have included the discussion of this paper in the original manuscript.

      In rodents:

      https://pubmed.ncbi.nlm.nih.gov/34296178/ Better electrophysiological correspondence was found using alternate methods the paper under review does not use.

      This study investigates the electrophysiological correspondence in taskbased fMRI, while our study focused on resting state signals.

      https://pubmed.ncbi.nlm.nih.gov/31785420/ Electrophysiological basis of co-activation patterns, similar comparisons to the paper under review.

      We have included the discussion of this paper in the original manuscript.

      https://pubmed.ncbi.nlm.nih.gov/29161352/ Cross-frequency coupling of LFP modulating the BOLD, perhaps more so than raw amplitudes.

      This paper investigated the impact of AMPA microinjections in the VTA and found reduced ventral striatal functional connectivity, correlation between the delta band and BOLD signal, and phase–amplitude coupling of low-frequency LFP and highfrequency LFP, suggesting changes in low-frequency LFP might modulate the BOLD signal.

      Consistent with our study, we also found that low-frequency LFP is negatively coupled with the BOLD signal, but we did not investigate changes in neurovascular coupling with disturbed neural activity using pharmacological methods, and hence, we did not discuss this paper in our study.

      https://pubmed.ncbi.nlm.nih.gov/24071524/ This paper did the same kind of tests comparing LFP-BOLD correlations to BOLD-BOLD correlations as the paper under review.

      This study examined the neural mechanism underpinning dynamic restingstate fMRI, revealing a spatiotemporal coupling of infra-slow neural activity with a quasiperiodic pattern (QPP). While our current investigation centered on stationary restingstate functional connectivity, we acknowledge that dynamic analysis will provide additional value for investigating the relationship between LFP and rsfMRI signals. This warrants more investigation in a future study. This point has been added to the revised manuscript.

      https://pubmed.ncbi.nlm.nih.gov/24904325/ This paper found that different frequencies of electrophysiology (including ones not studied in the reviewed paper) contribute independently to the BOLD signal

      This paper identified phase-amplitude coupling in rats anesthetized with isoflurane but not with dexmedetomidine, indicating that this coupling arises from a special type of neural activity pattern, burst-suppression, which was probably induced by high-dose isoflurane. They conjectured that high and low-frequency neural activities may independently or differentially influence the BOLD signal. Our study also examined the influence of various LFP frequency bands on the BOLD signal and found inversed LFP-BOLD relationship between low- and high-frequency LFP powers. We also added more results on the analysis of infraslow LFP signals. Regardless, since the reference study did not examine the spatial relationship of LFP and BOLD activities, we cannot comment on how it may provide insight into our results. 

      https://pubmed.ncbi.nlm.nih.gov/26041826/ This paper found electrophysiological correlates within the BOLD signal when using BOLD analysis methods not used in the reviewed paper, and furthermore that some of these correlate with electrophysiological frequencies not studied in the reviewed paper (< 1 Hz).

      We have added more results on the analysis of infraslow LFP signals and acknowledged the value of dynamic rsfMRI analysis in studies of BOLDelectrophysiology relationship.

      I am not saying the authors need to use all these methods or even cite these papers. As I stated in their review, they merely need to (1) cite some of the most relevant for the proper context, the above list can maybe help (2) remove the claim of an "electrophysiology invisible signal" (3) use terms more commonly used in these papers for the extent of correlation with the electrode, other than "spatial variance."

      We thank the reviewer again for providing a detailed list of reference studies. We have added the related discussion to the revised manuscript as described above.

      Comment 2: The abstract entirely and much of the rest of the paper should be rewritten to be more reasonable. The authors would do well to review some of the past controversies in this area, e.g. Magri et al. J Neurosci. 2012 Jan 25;32(4):1395-407.

      We have made significant revision to improve the writing of the paper. The reference paper has been added to the revised manuscript.

      Comment 3: This should be re-written and the terminology used here should be chosen more carefully.

      The writing of the manuscript has been improved with more careful choice of terminology.    

      Major method problem:

      Comment 4: At a minimum, the authors should be transforming the uniform distribution of CC results to Z or T values and using randn() instead of rand() in MATLAB.

      Below is the figure illustrating the simulation results by transforming CC values to Z score. Results obtained remain consistent.

      Author response image 2.

      Minor problems:

      Comment 5: "MR-510 compatible electrodes (MRCM16LP, NeuroNexus Inc)"

      Details of this type of electrode are not readily available. But for studies like this one, further information on materials is critical as this determines the frequency coverage, which is not even across all LFP frequencies for all materials. Most commercially prepared electrodes cannot record <1Hz accurately, and this study includes at least 0.11Hz in some of its analysis.

      The type of electrode used in our current study is a silicon-based micromachined probe. These probes are fabricated using photolithographic techniques to pattern thin layers of conductive materials onto a silicon substrate. This probe is capable of recording the LFP activity within a broad frequency range, starting from 0.1Hz . We added this information to the revised manuscript. 

      Comment 6: Grounding to the cerebellum in theory would remove global conduction from the LFP but also global signal regression is done to the fMRI. Does the LFP-rsfMRI correlation change due to the regression or does only the rsfMRI-rsfMRI correlation change?

      The results obtained with global signal regression were consistent with those obtained without it (see Figs. S4-S5), and therefore, we do not believe our results are affected by this preprocessing step. 

      Comment 7. Avoid colloquial language like "on the other hand" etc.

      We used more appropriate language in the revised manuscript.

      References:

      Bolt, T., Nomi, J.S., Bzdok, D., Salas, J.A., Chang, C., Thomas Yeo, B.T., Uddin, L.Q., Keilholz, S.D., 2022. A parsimonious description of global functional brain organization in three spatiotemporal patterns. Nat Neurosci 25, 1093-1103.

      Cabral, J., Fernandes, F.F., Shemesh, N., 2023. Intrinsic macroscale oscillatory modes driving long range functional connectivity in female rat brains detected by ultrafast fMRI. Nat Commun 14, 375.

      Hacker, C.D., Snyder, A.Z., Pahwa, M., Corbetta, M., Leuthardt, E.C., 2017. Frequencyspecific electrophysiologic correlates of resting state fMRI networks. Neuroimage 149, 446-457.

      Kucyi, A., Schrouff, J., Bickel, S., Foster, B.L., Shine, J.M., Parvizi, J., 2018. Intracranial Electrophysiology Reveals Reproducible Intrinsic Functional Connectivity within Human Brain Networks. J Neurosci 38, 4230-4242.

      Li, J.M., Acland, B.T., Brenner, A.S., Bentley, W.J., Snyder, L.H., 2022. Relationships between correlated spikes, oxygen and LFP in the resting-state primate. Neuroimage 247, 118728.

      Ma, Y., Shaik, M.A., Kozberg, M.G., Kim, S.H., Portes, J.P., Timerman, D., Hillman, E.M., 2016. Resting-state hemodynamics are spatiotemporally coupled to synchronized and symmetric neural activity in excitatory neurons. Proc Natl Acad Sci U S A 113, E8463-E8471.

      Ma, Z., Zhang, N., 2018. Temporal transitions of spontaneous brain activity. Elife 7.

      Shi, Z., Wu, R., Yang, P.F., Wang, F., Wu, T.L., Mishra, A., Chen, L.M., Gore, J.C., 2017. High spatial correspondence at a columnar level between activation and resting state fMRI signals and local field potentials. Proc Natl Acad Sci U S A 114, 52535258.

      Thompson, G.J., Pan, W.J., Magnuson, M.E., Jaeger, D., Keilholz, S.D., 2014. Quasiperiodic patterns (QPP): large-scale dynamics in resting state fMRI that correlate with local infraslow electrical activity. Neuroimage 84, 1018-1031.

      Uhlirova, H., Kilic, K., Tian, P., Thunemann, M., Desjardins, M., Saisan, P.A., Sakadzic, S., Ness, T.V., Mateo, C., Cheng, Q., Weldy, K.L., Razoux, F., Vandenberghe, M.,

      Cremonesi, J.A., Ferri, C.G., Nizar, K., Sridhar, V.B., Steed, T.C., Abashin, M.,

      Fainman, Y., Masliah, E., Djurovic, S., Andreassen, O.A., Silva, G.A., Boas, D.A., Kleinfeld, D., Buxton, R.B., Einevoll, G.T., Dale, A.M., Devor, A., 2016. Cell type specificity of neurovascular coupling in cerebral cortex. Elife 5.

      Vafaii, H., Mandino, F., Desrosiers-Gregoire, G., O'Connor, D., Markicevic, M., Shen, X.,

      Ge, X., Herman, P., Hyder, F., Papademetris, X., Chakravarty, M., Crair, M.C., Constable, R.T., Lake, E.M.R., Pessoa, L., 2024. Multimodal measures of spontaneous brain activity reveal both common and divergent patterns of cortical functional organization. Nat Commun 15, 229.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This important study provides solid evidence that both psychiatric dimensions (e.g. anhedonia, apathy, or depression) and chronotype (i.e., being a morning or evening person) influence effort-based decision-making. Notably, the current study does not elucidate whether there may be interactive effects of chronotype and psychiatric dimensions on decision-making. This work is of importance to researchers and clinicians alike, who may make inferences about behaviour and cognition without taking into account whether the individual may be tested or observed out-of-sync with their phenotype.

      We thank the three reviewers for their comments, and the Editors at eLife. We have taken the opportunity to revise our manuscript considerably from its original form, not least because we feel a number of the reviewers’ suggested analyses strengthen our manuscript considerably (in one instance even clarifying our conclusions, leading us to change our title)—for which we are very appreciative indeed. 

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This study uses an online cognitive task to assess how reward and effort are integrated in a motivated decision-making task. In particular the authors were looking to explore how neuropsychiatric symptoms, in particular apathy and anhedonia, and circadian rhythms affect behavior in this task. Amongst many results, they found that choice bias (the degree to which integrated reward and effort affects decisions) is reduced in individuals with greater neuropsychiatric symptoms, and late chronotypes (being an 'evening person').

      Strengths:

      The authors recruited participants to perform the cognitive task both in and out of sync with their chronotypes, allowing for the important insight that individuals with late chronotypes show a more reduced choice bias when tested in the morning.<br /> Overall, this is a well-designed and controlled online experimental study. The modelling approach is robust, with care being taken to both perform and explain to the readers the various tests used to ensure the models allow the authors to sufficiently test their hypotheses.

      Weaknesses:

      This study was not designed to test the interactions of neuropsychiatric symptoms and chronotypes on decision making, and thus can only make preliminary suggestions regarding how symptoms, chronotypes and time-of-assessment interact.

      We appreciate the Reviewer’s positive view of our research and agree with their assessment of its weaknesses; the study was not designed to assess chronotype-mental health interactions. We hope that our new title and contextualisation makes this clearer. We respond in more detail point-by-point below.

      Reviewer #2 (Public Review):

      Summary:

      The study combines computational modeling of choice behavior with an economic, effort-based decision-making task to assess how willingness to exert physical effort for a reward varies as a function of individual differences in apathy and anhedonia, or depression, as well as chronotype. They find an overall reduction in effort selection that scales with apathy and anhedonia and depression. They also find that later chronotypes are less likely to choose effort than earlier chronotypes and, interestingly, an interaction whereby later chronotypes are especially unwilling to exert effort in the morning versus the evening.

      Strengths:

      This study uses state-of-the-art tools for model fitting and validation and regression methods which rule out multicollinearity among symptom measures and Bayesian methods which estimate effects and uncertainty about those estimates. The replication of results across two different kinds of samples is another strength. Finally, the study provides new information about the effects not only of chronotype but also chronotype by timepoint interactions which are previously unknown in the subfield of effort-based decision-making.

      Weaknesses:

      The study has few weaknesses. One potential concern is that the range of models which were tested was narrow, and other models might have been considered. For example, the Authors might have also tried to fit models with an overall inverse temperature parameter to capture decision noise. One reason for doing so is that some variance in the bias parameter might be attributed to noise, which was not modeled here. Another concern is that the manuscripts discuss effort-based choice as a transdiagnostic feature - and there is evidence in other studies that effort deficits are a transdiagnostic feature of multiple disorders. However, because the present study does not investigate multiple diagnostic categories, it doesn't provide evidence for transdiagnosticity, per se.

      We appreciate Reviewer 2’s assessment of our research and agree generally with its weaknesses. We have now addressed the Reviewer’s comments regarding transdiagnosticity in the discussion of our revised version and have addressed their detailed recommendations below (see point-by-point responses).

      In addition to the below specific changes, in our Discussion section, we now have also added the following (lines 538 – 540):

      “Finally, we would like to note that as our study is based on a general population sample, rather than a clinical one. Hence, we cannot speak to transdiagnosticity on the level of multiple diagnostic categories.”

      Reviewer #3 (Public Review):

      Summary:

      In this manuscript, Mehrhof and Nord study a large dataset of participants collected online (n=958 after exclusions) who performed a simple effort-based choice task. They report that the level of effort and reward influence choices in a way that is expected from prior work. They then relate choice preferences to neuropsychiatric syndromes and, in a smaller sample (n<200), to people's circadian preferences, i.e., whether they are a morning-preferring or evening-preferring chronotype. They find relationships between the choice bias (a model parameter capturing the likelihood to accept effort-reward challenges, like an intercept) and anhedonia and apathy, as well as chronotype. People with higher anhedonia and apathy and an evening chronotype are less likely to accept challenges (more negative choice bias). People with an evening chronotype are also more reward sensitive and more likely to accept challenges in the evening, compared to the morning.

      Strengths:

      This is an interesting and well-written manuscript which replicates some known results and introduces a new consideration related to potential chronotype relationships which have not been explored before. It uses a large sample size and includes analyses related to transdiagnostic as well as diagnostic criteria. I have some suggestions for improvements.

      Weaknesses:

      (1) The novel findings in this manuscript are those pertaining to transdiagnostic and circadian phenotypes. The authors report two separate but "overlapping" effects: individuals high on anhedonia/apathy are less willing to accept offers in the task, and similarly, individuals tested off their chronotype are less willing to accept offers in the task. The authors claim that the latter has implications for studying the former. In other words, because individuals high on anhedonia/apathy predominantly have a late chronotype (but might be tested early in the day), they might accept less offers, which could spuriously look like a link between anhedonia/apathy and choices but might in fact be an effect of the interaction between chronotype and time-of-testing. The authors therefore argue that chronotype needs to be accounted for when studying links between depression and effort tasks.

      The authors argue that, if X is associated with Y and Z is associated with Y, X and Z might confound each other. That is possible, but not necessarily true. It would need to be tested explicitly by having X (anhedonia/apathy) and Z (chronotype) in the same regression model. Does the effect of anhedonia/apathy on choices disappear when accounting for chronotype (and time-of-testing)? Similarly, when adding the interaction between anhedonia/apathy, chronotype, and time-of-testing, within the subsample of people tested off their chronotype, is there a residual effect of anhedonia/apathy on choices or not?

      If the effect of anhedonia/apathy disappeared (or got weaker) while accounting for chronotype, this result would suggest that chronotype mediates the effect of anhedonia/apathy on effort choices. However, I am not sure it renders the direct effect of anhedonia/apathy on choices entirely spurious. Late chronotype might be a feature (induced by other symptoms) of depression (such as fatigue and insomnia), and the association between anhedonia/apathy and effort choices might be a true and meaningful one. For example, if the effect of anhedonia/apathy on effort choices was mediated by altered connectivity of the dorsal ACC, we would not say that ACC connectivity renders the link between depression and effort choices "spurious", but we would speak of a mechanism that explains this effect. The authors should discuss in a more nuanced way what a significant mediation by the chronotype/time-of-testing congruency means for interpreting effects of depression in computational psychiatry.

      We thank the Reviewer for pointing out this crucial weakness in the original version of our manuscript. We have now thought deeply about this and agree with the Reviewer that our original results did not warrant our interpretation that reported effects of anhedonia and apathy on measures of effort-based decision-making could potentially be spurious. At the Reviewer’s suggestion, we decided to test this explicitly in our revised version—a decision that has now deepened our understanding of our results, and changed our interpretation thereof.  

      To investigate how the effects of neuropsychiatric symptoms and the effects of circadian measures relate to each other, we have followed the Reviewer’s advice and conducted an additional series of analyses (see below). Surprisingly (to us, but perhaps not the Reviewer) we discovered that all three symptom measures (two of anhedonia, one of apathy) have separable effects from circadian measures on the decision to expend effort (note we have also re-named our key parameter ‘motivational tendency’ to address this Reviewer’s next comment that the term ‘choice bias’ was unclear). In model comparisons (based on leave-one-out information criterion which penalises for model complexity) the models including both circadian and psychiatric measures always win against the models including either circadian or psychiatric measures. In essence, this strengthens our claims about the importance of measuring circadian rhythm in effort-based tasks generally, as circadian rhythm clearly plays an important role even when considering neuropsychiatric symptoms, but crucially does not support the idea of spurious effects: statistically, circadian measures contributes separably from neuropsychiatric symptoms to the variance in effort-based decision-making. We think this is very interesting indeed, and certainly clarifies (and corrects the inaccuracy in) our original interpretation—and can only express our thanks to the Reviewer for helping us understand our effect more fully.

      In response to these new insights, we have made numerous edits to our manuscript. First, we changed the title from “Overlapping effects of neuropsychiatric symptoms and circadian rhythm on effort-based decision-making” to “Both neuropsychiatric symptoms and circadian rhythm alter effort-based decision-making”. In the remaining manuscript we now refrain from using the word ‘overlapping’ (which could be interpreted as overlapping in explained variance), and instead opted to describe the effects as parallel. We hope our new analyses, title, and clarified/improved interpretations together address the Reviewer’s valid concern about our manuscript’s main weakness.

      We detail these new analyses in the Methods section as follows (lines 800 – 814):

      “4.5.2. Differentiating between the effects of neuropsychiatric symptoms and circadian measures on motivational tendency

      To investigate how the effects of neuropsychiatric symptoms on motivational tendency (2.3.1) relate to effects of chronotype and time-of-day on motivational tendency we conducted exploratory analyses. In the subsamples of participants with an early or late chronotype (including additionally collected data), we first ran Bayesian GLMs with neuropsychiatric questionnaire scores (SHAPS, DARS, AES respectively) predicting motivational tendency, controlling for age and gender. We next added an interaction term of chronotype and time-of-day into the GLMs, testing how this changes previously observed neuropsychiatric and circadian effects on motivational tendency. Finally, we conducted a model comparison using LOO, comparing between motivational tendency predicted by a neuropsychiatric questionnaire, motivational tendency predicted by chronotype and time-of-day, and motivational tendency predicted by a neuropsychiatric questionnaire and time-of-day (for each neuropsychiatric questionnaire, and controlling for age and gender).”

      Results of the outlined analyses are reported in the results section as follows (lines 356 – 383):

      “2.5.2.1 Neuropsychiatric symptoms and circadian measures have separable effects on motivational tendency

      Exploratory analyses testing for the effects of neuropsychiatric questionnaires on motivational tendency in the subsamples of early and late chronotypes confirmed the predictive value of the SHAPS (M=-0.24, 95% HDI=[-0.42,-0.06]), the DARS (M=-0.16, 95% HDI=[-0.31,-0.01]), and the AES (M=-0.18, 95% HDI=[-0.32,-0.02]) on motivational tendency.

      For the SHAPS, we find that when adding the measures of chronotype and time-of-day back into the GLMs, the main effect of the SHAPS (M=-0.26, 95% HDI=[-0.43,-0.07]), the main effect of chronotype (M=-0.11, 95% HDI=[-0.22,-0.01]), and the interaction effect of chronotype and time-of-day (M=0.20, 95% HDI=[0.07,0.34]) on motivational tendency remain. Model comparison by LOOIC reveals motivational tendency is best predicted by the model including the SHAPS, chronotype and time-of-day as predictors, followed by the model including only the SHAPS. Note that this approach to model comparison penalizes models for increasing complexity.

      Repeating these steps with the DARS, the main effect of the DARS is found numerically, but the 95% HDI just includes 0 (M=-0.15, 95% HDI=[-0.30,0.002]). The main effect of chronotype (M=-0.11, 95% HDI=[-0.21,-0.01]), and the interaction effect of chronotype and time-of-day (M=0.18, 95% HDI=[0.05,0.33]) on motivational tendency remain. Model comparison identifies the model including the DARS and circadian measures as the best model, followed by the model including only the DARS.

      For the AES, the main effect of the AES is found (M=-0.19, 95% HDI=[-0.35,-0.04]). For the main effect of chronotype, the 95% narrowly includes 0 (M=-0.10, 95% HDI=[-0.21,0.002]), while the interaction effect of chronotype and time-of-day (M=0.20, 95% HDI=[0.07,0.34]) on motivational tendency remains. Model comparison identifies the model including the AES and circadian measures as the best model, followed by the model including only the AES.”

      We have now edited parts of our Discussion to discuss and reflect these new insights, including the following.

      Lines 399 – 402:

      “Various neuropsychiatric disorders are marked by disruptions in circadian rhythm, such as a late chronotype. However, research has rarely investigated how transdiagnostic mechanisms underlying neuropsychiatric conditions may relate to inter-individual differences in circadian rhythm.”

      Lines 475 – 480:

      “It is striking that the effects of neuropsychiatric symptoms on effort-based decision-making largely are paralleled by circadian effects on the same neurocomputational parameter. Exploratory analyses predicting motivational tendency by neuropsychiatric symptoms and circadian measures simultaneously indicate the effects go beyond recapitulating each other, but rather explain separable parts of the variance in motivational tendency.”

      Lines 528 – 532:

      “Our reported analyses investigating neuropsychiatric and circadian effects on effort-based decision-making simultaneously are exploratory, as our study design was not ideally set out to examine this. Further work is needed to disentangle separable effects of neuropsychiatric and circadian measures on effort-based decision-making.”

      Lines 543 – 550:

      “We demonstrate that neuropsychiatric effects on effort-based decision-making are paralleled by effects of circadian rhythm and time-of-day. Exploratory analyses suggest these effects account for separable parts of the variance in effort-based decision-making. It unlikely that effects of neuropsychiatric effects on effort-based decision-making reported here and in previous literature are a spurious result due to multicollinearity with chronotype. Yet, not accounting for chronotype and time of testing, which is the predominant practice in the field, could affect results.”

      (2) It seems that all key results relate to the choice bias in the model (as opposed to reward or effort sensitivity). It would therefore be helpful to understand what fundamental process the choice bias is really capturing in this task. This is not discussed, and the direction of effects is not discussed either, but potentially quite important. It seems that the choice bias captures how many effortful reward challenges are accepted overall which maybe captures general motivation or task engagement. Maybe it is then quite expected that this could be linked with questionnaires measuring general motivation/pleasure/task engagement. Formally, the choice bias is the constant term or intercept in the model for p(accept), but the authors never comment on what its sign means. If I'm not mistaken, people with higher anhedonia but also higher apathy are less likely to accept challenges and thus engage in the task (more negative choice bias). I could not find any discussion or even mention of what these results mean. This similarly pertains to the results on chronotype. In general, "choice bias" may not be the most intuitive term and the authors may want to consider renaming it. Also, given the sign of what the choice bias means could be flipped with a simple sign flip in the model equation (i.e., equating to accepting more vs accepting less offers), it would be helpful to show some basic plots to illustrate the identified differences (e.g., plotting the % accepted for people in the upper and lower tertile for the SHAPS score etc).

      We apologise that this was not made clear previously: the meaning and directionality of “choice bias” is indeed central to our results. We also thank the Reviewer for pointing out the previousely-used term “choice bias” itself might not be intuitive. We have now changed this to ‘motivational tendency’ (see below) as well as added substantial details on this parameter to the manuscript, including additional explanations and visualisations of the model as suggested by the Reviewer (new Figure 3) and model-agnostic results to aid interpretation (new Figure S3). Note the latter is complex due to our staircasing procedure (see new figure panel D further detailing our staircasing procedure in Figure 2). This shows that participants with more pronounced anhedonia are less likely to accept offers than those with low anhedonia (Fig. S3A), a model-agnostic version of our central result.

      Our changes are detailed below:

      After careful evaluation we have decided to term the parameter “motivational tendency”, hoping that this will present a more intuitive description of the parameter.

      To aid with the understanding and interpretation of the model parameters, and motivational tendency in particular, we have added the following explanation to the main text:

      Lines 149 – 155:

      “The models posit efforts and rewards are joined into a subjective value (SV), weighed by individual effort (and reward sensitivity (parameters. The subjective value is then integrated with an individual motivational tendency (a) parameter to guide decision-making. Specifically, the motivational tendency parameter determines the range at which subjective values are translated to acceptance probabilities: the same subjective value will translate to a higher acceptance probability the higher the motivational tendency.”

      Further, we have included a new figure, visualizing the model. This demonstrates how the different model parameters contribute to the model (A), and how different values on each parameter affects the model (B-D).

      We agree that plotting model agnostic effects in our data may help the reader gain intuition of what our task results mean. We hope to address this with our added section on “Model agnostic task measures relating to questionnaires”. We first followed the reviewer’s suggestion of extracting subsamples with higher and low anhedonia (as measured with the SHAPS, highest and lowest quantile) and plotted the acceptance proportion across effort and reward levels (panel A in figure below). However, due to our implemented task design, this only shows part of the picture: the staircasing procedure individualises which effort-reward combination a participant is presented with. Therefore, group differences in choice behaviour will lead to differences in the development of the staircases implemented in our task. Thus, we plotted the count of offered effort-reward combinations for the subsamples of participants with high vs. low SHAPS scores by the end of the task, averaged across staircases and participants.

      As the aspect of task development due to the implemented staircasing may not have been explained sufficiently in the main text, we have included panel (D) in figure 2.

      Further, we have added the following figure reference to the main text (lines 189 – 193):

      “The development of offered effort and reward levels across trials is shown in figure 2D; this shows that as participants generally tend to accept challenges rather than reject them, the implemented staircasing procedure develops toward higher effort and lover reward challenges.”

      To statistically test effects of model-agnostic task measures on the neuropsychiatric questionnaires, we performed Bayesian GLMs with the proportion of accepted trials predicted by SHAPS and AES. This is reported in the text as follows.

      Supplement, lines 172 – 189:

      “To explore the relationship between model agnostic task measures to questionnaire measures of neuropsychiatric symptoms, we conducted Bayesian GLMs, with the proportion of accepted trials predicted by SHAPS scores, controlling for age and gender. The proportion of accepted trials averaged across effort and reward levels was predicted by the Snaith-Hamilton Pleasure Scale (SHAPS) sum scores (M=-0.07; 95%HDI=[-0.12,-0.03]) and the Apathy Evaluation Scale (AES) sum scores (M=-0.05; 95%HDI=[-0.10,-0.002]). Note that this was not driven only by higher effort levels; even confining data to the lowest two effort levels, SHAPS has a predictive value for the proportion of accepted trials: M=-0.05; 95%HDI=[-0.07,-0.02].<br /> A visualisation of model agnostic task measures relating to symptoms is given in Fig. S4, comparing subgroups of participants scoring in the highest and lowest quartile on the SHAPS. This shows that participants with a high SHAPS score (i.e., more pronounced anhedonia) are less likely to accept offers than those with a low SHAPS score (Fig. S4A). Due to the implemented staircasing procedure, group differences can also be seen in the effort-reward combinations offered per trial. While for both groups, the staircasing procedure seems to devolve towards high effort – low reward offers, this is more pronounced in the subgroup of participants with a lower SHAPS score (Fig S4B).”

      (3) None of the key effects relate to effort or reward sensitivity which is somewhat surprising given the previous literature and also means that it is hard to know if choice bias results would be equally found in tasks without any effort component. (The only analysis related to effort sensitivity is exploratory and in a subsample of N=56 per group looking at people meeting criteria for MDD vs matched controls.) Were stimuli constructed such that effort and reward sensitivity could be separated (i.e., are uncorrelated/orthogonal)? Maybe it would be worth looking at the % accepted in the largest or two largest effort value bins in an exploratory analysis. It seems the lowest and 2nd lowest effort level generally lead to accepting the challenge pretty much all the time, so including those effort levels might not be sensitive to individual difference analyses?

      We too were initially surprised by the lack of effect of neuropsychiatric symptoms on reward and effort sensitivity. To address the Reviewer’s first comment, the nature of the ‘choice bias’ parameter (now motivational tendency) is its critical importance in the context of effort-based decision-making: it is not modelled or measured explicitly in tasks without effort (such as typical reward tasks), so it would be impossible to test this in tasks without an effort component. 

      For the Reviewer’s second comment, the exploratory MDD analysis is not our only one related to effort sensitivity: the effort sensitivity parameter is included in all of our central analyses, and (like reward sensitivity), does not relate to our measured neuropsychiatric symptoms (e.g., see page 15). Note most previous effort tasks do not include a ‘choice bias’/motivational tendency parameter, potentially explaining this discrepancy. However, our model was quantitatively superior to models without this parameter, for example with only effort- and reward-sensitivity (page 11, Fig. 3).

      Our three model parameters (reward sensitivity, effort sensitivity, and choice bias/motivational tendency) were indeed uncorrelated/orthogonal to one another (see parameter orthogonality analyses below), making it unlikely that the variance and effect captured by our motivational tendency parameter (previously termed “choice bias”) should really be attributed to reward sensitivity. As per the Reviewer’s suggestion, we also examined whether the lowest two effort levels might not be sensitive to individual differences; in fact, we found out proportion of accepted trials on the lowest effort levels alone was nevertheless predicted by anhedonia (see ceiling effect analyses below).

      Specifically, in terms of parameter orthogonality:

      When developing our task design and computational modelling approach we were careful to ensure that meaningful neurocomputational parameters could be estimated and that no spurious correlations between parameters would be introduced by modelling. By conducting parameter recoveries for all models, we showed that our modelling approach could reliably estimate parameters, and that estimated parameters are orthogonal to the other underlying parameters (as can be seen in Figure S1 in the supplement). It is thus unlikely that the variance and effect captured by our motivational tendency parameter (previously termed “choice bias”) should really be attributed to reward sensitivity.

      And finally, regarding the possibility of a ceiling effect for low effort levels:

      We agree that visual inspection of the proportion of accepted results across effort and reward values can lead to the belief that a ceiling effect prevents the two lowest effort levels from capturing any inter-individual differences. To test whether this is the case, we ran a Bayesian GLM with the SHAPS sum score predicting the proportion of accepted trials (controlling for age and gender), in a subset of the data including only trials with an effort level of 1 or 2. We found the SHAPS has a predictive value for the proportion of accepted trials in the lowest two effort levels: M=-0.05; 95%HDI=[-0.07,-0.02]). This is noted in the text as follows.

      Supplement, lines 175 – 180:

      “The proportion of accepted trials averaged across effort and reward levels was predicted by the Snaith-Hamilton Pleasure Scale (SHAPS) sum scores (M=-0.07; 95%HDI=[-0.12,-0.03]) and the Apathy Evaluation Scale (AES) sum scores (M=-0.05; 95%HDI=[-0.10,-0.002]). Note that this was not driven only by higher effort levels; even confining data to the lowest two effort levels, SHAPS has a predictive value for the proportion of accepted trials: M=-0.05; 95%HDI=[-0.07,-0.02].”

      (4) The abstract and discussion seem overstated (implications for the school system and statements on circadian rhythms which were not measured here). They should be toned down to reflect conclusions supported by the data.

      We thank the Reviewer for pointing this out, and have now removed these claims from the abstract and Discussion; we hope they now better reflect conclusions supported by these data directly.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Suggestions for improved or additional experiments, data or analyses.

      - For a non-computational audience, it would be useful to unpack the influence of the choice bias on behavior, as it is less clear how this would affect decision-making than sensitivity to effort or reward. Perhaps a figure showing accept/reject decisions when sensitivities are held and choice bias is high would be beneficial.

      We thank the Reviewer for suggesting additional explanations of the choice bias parameter to aid interpretation for non-computational readers; as per the Reviewer’s suggestion, we have now included additional explanations and visualisations (Figure 3) to make this as clear as possible. Please note also that, in response to one of the other Reviewers and after careful considerations, we have decided to rename the “choice bias” parameter to “motivational tendency”, hoping this will prove more intuitive.

      To aid with the understanding and interpretation of this and the other model parameters, we have added the following explanation to the main text.

      Lines 149 – 155:

      “The models posit efforts and rewards are joined into a subjective value (SV), weighed by individual effort (and reward sensitivity (parameters. The subjective value is then integrated with an individual motivational tendency (a) parameter to guide decision-making. Specifically, the motivational tendency parameter determines the range at which subjective values are translated to acceptance probabilities: the same subjective value will translate to a higher acceptance probability the higher the motivational tendency.”

      Additionally, we add the following explanation to the Methods section.

      Lines 698 – 709:

      First, a cost function transforms costs and rewards associated with an action into a subjective value (SV):

      with and for reward and effort sensitivity, and ℛ and 𝐸 for reward and effort. Higher effort and reward sensitivity mean the SV is more strongly influenced by changes in effort and reward, respectively (Fig. 3B-C). Hence, low effort and reward sensitivity mean the SV, and with that decision-making, is less guided by effort and reward offers, as would be in random decision-making.

      This SV is then transformed to an acceptance probability by a softmax function:

      with for the predicted acceptance probability and 𝛼 for the intercept representing motivational tendency. A high motivational tendency means a subjects has a tendency, or bias, to accept rather than reject offers (Fig. 3D).

      Our new figure (panels A-D in figure 3) visualizes the model. This demonstrates how the different model parameters come at play in the model (A), and how different values on each parameter affects the model (B-D).

      - The early and late chronotype groups have significant differences in ages and gender. Additional supplementary analysis here may mitigate any concerns from readers.

      The Reviewer is right to notice that our subsamples of early and late chronotypes differ significantly in age and gender, but it important to note that all our analyses comparing these two groups take this into account, statistically controlling for age and gender. We regret that this was previously only mentioned in the Methods section, so this information was not accessible where most relevant. To remedy this, we have amended the Results section as follows.

      Lines 317 – 323:

      “Bayesian GLMs, controlling for age and gender, predicting task parameters by time-of-day and chronotype showed effects of chronotype on reward sensitivity (i.e. those with a late chronotype had a higher reward sensitivity; M= 0.325, 95% HDI=[0.19,0.46]) and motivational tendency (higher in early chronotypes; M=-0.248, 95% HDI=[-0.37,-0.11]), as well as an interaction between chronotype and time-of-day on motivational tendency (M=0.309, 95% HDI=[0.15,0.48]).”

      (2) Recommendations for improving the writing and presentation.

      - I found the term 'overlapping' a little jarring. I think the authors use it to mean both neuropsychiatric symptoms and chronotypes affect task parameters, but they are are not tested to be 'separable', nor is an interaction tested. Perhaps being upfront about how interactions are not being tested here (in the introduction, and not waiting until the discussion) would give an opportunity to operationalize this term.

      We agree with the Reviewer that our previously-used term “overlapping” was not ideal: it may have been misleading, and was not necessarily reflective of the nature of our findings. We now state explicitly that we are not testing an interaction between neuropsychiatric symptoms and chronotypes in our primary analyses. Additionally, following suggestions made by Reviewer 3, we ran new exploratory analyses to investigate how the effects of neuropsychiatric symptoms and circadian measures on motivational tendency relate to one another. These results in fact show that all three symptom measures have separable effects from circadian measures on motivational tendency. This supports the Reviewer’s view that ‘overlapping’ was entirely the wrong word—although it nevertheless shows the important contribution of circadian rhythm as well as neuropsychiatric symptoms in effort-based decision-making. We have changed the manuscript throughout to better describe this important, more accurate interpretation of our findings, including replacing the term “overlapping”. We changed the title from “Overlapping effects of neuropsychiatric symptoms and circadian rhythm on effort-based decision-making” to “Both neuropsychiatric symptoms and circadian rhythm alter effort-based decision-making”.

      To clarify the intention of our primary analyses, we have added the following to the last paragraph of the introduction.

      Lines 107 – 112:

      “Next, we pre-registered a follow-up experiment to directly investigate how circadian preference interacts with time-of-day on motivational decision-making, using the same task and computational modelling approach. While this allows us to test how circadian effects on motivational decision-making compare to neuropsychiatric effects, we do not test for possible interactions between neuropsychiatric symptoms and chronobiology.”

      We detail our new analyses in the Methods section as follows.

      Lines 800 – 814:

      “4.5.2 Differentiating between the effects of neuropsychiatric symptoms and circadian measures on motivational tendency

      To investigate how the effects of neuropsychiatric symptoms on motivational tendency (2.3.1) relate to effects of chronotype and time-of-day on motivational tendency we conducted exploratory analyses. In the subsamples of participants with an early or late chronotype (including additionally collected data), we first ran Bayesian GLMs with neuropsychiatric questionnaire scores (SHAPS, DARS, AES respectively) predicting motivational tendency, controlling for age and gender. We next added an interaction term of chronotype and time-of-day into the GLMs, testing how this changes previously observed neuropsychiatric and circadian effects on motivational tendency. Finally, we conducted a model comparison using LOO, comparing between motivational tendency predicted by a neuropsychiatric questionnaire, motivational tendency predicted by chronotype and time-of-day, and motivational tendency predicted by a neuropsychiatric questionnaire and time-of-day (for each neuropsychiatric questionnaire, and controlling for age and gender).”

      Results of the outlined analyses are reported in the Results section as follows.

      Lines 356 – 383:

      “2.5.2.1 Neuropsychiatric symptoms and circadian measures have separable effects on motivational tendency

      Exploratory analyses testing for the effects of neuropsychiatric questionnaires on motivational tendency in the subsamples of early and late chronotypes confirmed the predictive value of the SHAPS (M=-0.24, 95% HDI=[-0.42,-0.06]), the DARS (M=-0.16, 95% HDI=[-0.31,-0.01]), and the AES (M=-0.18, 95% HDI=[-0.32,-0.02]) on motivational tendency.

      For the SHAPS, we find that when adding the measures of chronotype and time-of-day back into the GLMs, the main effect of the SHAPS (M=-0.26, 95% HDI=[-0.43,-0.07]), the main effect of chronotype (M=-0.11, 95% HDI=[-0.22,-0.01]), and the interaction effect of chronotype and time-of-day (M=0.20, 95% HDI=[0.07,0.34]) on motivational tendency remain. Model comparison by LOOIC reveals motivational tendency is best predicted by the model including the SHAPS, chronotype and time-of-day as predictors, followed by the model including only the SHAPS. Note that this approach to model comparison penalizes models for increasing complexity.

      Repeating these steps with the DARS, the main effect of the DARS is found numerically, but the 95% HDI just includes 0 (M=-0.15, 95% HDI=[-0.30,0.002]). The main effect of chronotype (M=-0.11, 95% HDI=[-0.21,-0.01]), and the interaction effect of chronotype and time-of-day (M=0.18, 95% HDI=[0.05,0.33]) on motivational tendency remain. Model comparison identifies the model including the DARS and circadian measures as the best model, followed by the model including only the DARS.

      For the AES, the main effect of the AES is found (M=-0.19, 95% HDI=[-0.35,-0.04]). For the main effect of chronotype, the 95% narrowly includes 0 (M=-0.10, 95% HDI=[-0.21,0.002]), while the interaction effect of chronotype and time-of-day (M=0.20, 95% HDI=[0.07,0.34]) on motivational tendency remains. Model comparison identifies the model including the AES and circadian measures as the best model, followed by the model including only the AES.”

      In addition to the title change, we edited our Discussion to discuss and reflect these new insights, including the following.

      Lines 399 – 402:

      “Various neuropsychiatric disorders are marked by disruptions in circadian rhythm, such as a late chronotype. However, research has rarely investigated how transdiagnostic mechanisms underlying neuropsychiatric conditions may relate to inter-individual differences in circadian rhythm.”

      Lines 475 – 480:

      “It is striking that the effects of neuropsychiatric symptoms on effort-based decision-making largely are paralleled by circadian effects on the same neurocomputational parameter. Exploratory analyses predicting motivational tendency by neuropsychiatric symptoms and circadian measures simultaneously indicate the effects go beyond recapitulating each other, but rather explain separable parts of the variance in motivational tendency.”

      Lines 528 – 532:

      “Our reported analyses investigating neuropsychiatric and circadian effects on effort-based decision-making simultaneously are exploratory, as our study design was not ideally set out to examine this. Further work is needed to disentangle separable effects of neuropsychiatric and circadian measures on effort-based decision-making.”

      Lines 543 – 550:

      “We demonstrate that neuropsychiatric effects on effort-based decision-making are paralleled by effects of circadian rhythm and time-of-day. Exploratory analyses suggest these effects account for separable parts of the variance in effort-based decision-making. It unlikely that effects of neuropsychiatric effects on effort-based decision-making reported here and in previous literature are a spurious result due to multicollinearity with chronotype. Yet, not accounting for chronotype and time of testing, which is the predominant practice in the field, could affect results.”

      - A minor point, but it could be made clearer that many neurotransmitters have circadian rhythms (and not just dopamine).

      We agree this should have been made clearer, and have added the following to the Introduction.

      Lines 83 – 84:

      “Bi-directional links between chronobiology and several neurotransmitter systems have been reported, including dopamine47.

      (47) Kiehn, J.-T., Faltraco, F., Palm, D., Thome, J. & Oster, H. Circadian Clocks in the Regulation of Neurotransmitter Systems. Pharmacopsychiatry 56, 108–117 (2023).”

      - Making reference to other studies which have explored circadian rhythms in cognitive tasks would allow interested readers to explore the broader field. One such paper is: Bedder, R. L., Vaghi, M. M., Dolan, R. J., & Rutledge, R. B. (2023). Risk taking for potential losses but not gains increases with time of day. Scientific reports, 13(1), 5534, which also includes references to other similar studies in the discussion.

      We thank the Reviewer for pointing out that we failed to cite this relevant work. We have now included it in the Introduction as follows.

      Lines 97 – 98:

      “A circadian effect on decision-making under risk is reported, with the sensitivity to losses decreasing with time-of-day66.

      (66) Bedder, R. L., Vaghi, M. M., Dolan, R. J. & Rutledge, R. B. Risk taking for potential losses but not gains increases with time of day. Sci Rep 13, 5534 (2023).”

      (3) Minor corrections to the text and figures.

      None, clearly written and structured. Figures are high quality and significantly aid understanding.

      Reviewer #2 (Recommendations For The Authors):

      I did have a few more minor comments:

      - The manuscript doesn't clarify whether trials had time limits - so that participants might fail to earn points - or instead they did not and participants had to continue exerting effort until they were done. This is important to know since it impacts on decision-strategies and behavioral outcomes that might be analyzed. For example, if there is no time limit, it might be useful to examine the amount of time it took participants to complete their effort - and whether that had any relationship to choice patterns or symptomatology. Or, if they did, it might be interesting to test whether the relationship between choices and exerted effort depended on symptoms. For example, someone with depression might be less willing to choose effort, but just as, if not more likely to successfully complete a trial once it is selected.

      We thank the Reviewer for pointing out this important detail in the task design, which we should have made clearer. The trials did indeed have a time limit which was dependent on the effort level. To clarify this in the manuscript, we have made changes to Figure 2 and the Methods section. We agree it would be interesting to explore whether the exerted effort in the task related to symptoms. We explored this in our data by predicting the participant average proportion of accepted but failed trials by SHAPS score (controlling for age and gender). We found no relationship: M=0.01, 95% HDI=[-0.001,0.02]. However, it should be noted that the measure of proportion of failed trials may not be suitable here, as there are only few accepted but failed trials (M = 1.3% trials failed, SD = 3.50). This results from several task design characteristics aimed at preventing subjects from failing accepted trials, to avoid confounding of effort discounting with risk discounting. As an alternative measure, we explored the extent to which participants went “above and beyond” the target in accepted trials. Specifically, considering only accepted and succeeded trials, we computed the factor by which the required number of clicks was exceeded (i.e., if a subject clicked 15 times when 10 clicks were required the factor would be 1.3), averaging across effort and reward level. We then conducted a Bayesian GLM to test whether this subject wise click-exceedance measure can be predicted by apathy or anhedonia, controlling for age and gender. We found neither the SHAPS (M=-0.14, 95% HDI=[-0.43,0.17]) nor the AES (M=0.07, 95% HDI=[-0.26,0.41]) had a predictive value for the amount to which subjects exert “extra effort”. We have now added this to the manuscript.

      In Figure 2, which explains the task design in the results section, we have added the following to the figure description.

      Lines 161 – 165:

      “Each trial consists of an offer with a reward (2,3,4, or 5 points) and an effort level (1,2,3, or 4, scaled to the required clicking speed and time the clicking must be sustained for) that subjects accept or reject. If accepted, a challenge at the respective effort level must be fulfilled for the required time to win the points.”

      In the Methods section, we have added the following.

      Lines 617 – 622:

      “We used four effort-levels, corresponding to a clicking speed at 30% of a participant’s maximal capacity for 8 seconds (level 1), 50% for 11 seconds (level 2), 70% for 14 seconds (level 3), and 90% for 17 seconds (level 4). Therefore, in each trial, participants had to fulfil a certain number of mouse clicks (dependent on their capacity and the effort level) in a specific time (dependent on the effort level).”

      In the Supplement, we have added the additional analyses suggested by the Reviewer.

      Lines 195 – 213:

      “3.2 Proportion of accepted but failed trials

      For each participant, we computed the proportion of trial in which an offer was accepted, but the required effort then not fulfilled (i.e., failed trials). There was no relationship between average proportion of accepted but failed trials and SHAPS score (controlling for age and gender): M=0.01, 95% HDI=[-0.001,0.02]. However, there are intentionally few accepted but failed trials (M = 1.3% trials failed, SD = 3.50). This results from several task design characteristics aimed at preventing subjects from failing accepted trials, to avoid confounding of effort discounting with risk discounting.”

      “3.3 Exertion of “extra effort”

      We also explored the extent to which participants went “above and beyond” the target in accepted trials. Specifically, considering only accepted and succeeded trials, we computed the factor by which the required number of clicks was exceeded (i.e., if a subject clicked 15 times when 10 clicks were required the factor would be 1.3), averaging across effort and reward level. We then conducted a Bayesian GLM to test whether this subject wise click-exceedance measure can be predicted by apathy or anhedonia, controlling for age and gender. We found neither the SHAPS (M=-0.14, 95% HDI=[-0.43,0.17]) nor the AES (M=0.07, 95% HDI=[-0.26,0.41]) had a predictive value for the amount to which subjects exert “extra effort”.”

      - Perhaps relatedly, there is evidence that people with depression show less of an optimism bias in their predictions about future outcomes. As such, they show more "rational" choices in probabilistic decision tasks. I'm curious whether the Authors think that a weaker choice bias among those with stronger depression/anhedonia/apathy might be related. Also, are choices better matched with actual effort production among those with depression?

      We think this is a very interesting comment, but unfortunately feel our manuscript cannot properly speak to it: as in our response to the previous comment, our exploratory analysis linking the proportion of accepted but failed trials to anhedonia symptoms (i.e. less anhedonic people making more optimistic judgments of their likelihood of success) did not show a relationship between the two. However, this null finding may be the result of our task design which is not laid out to capture such an effect (in fact to minimize trials of this nature). We have added to the Discussion section.

      Lines 442 – 445:

      “It is possible that a higher motivational tendency reflects a more optimistic assessment of future task success, in line with work on the optimism bias95; however our task intentionally minimized unsuccessful trials by titrating effort and reward; future studies should explore this more directly.

      (95) Korn, C. W., Sharot, T., Walter, H., Heekeren, H. R. & Dolan, R. J. Depression is related to an absence of optimistically biased belief updating about future life events. Psychological Medicine 44, 579–592 (2014).”

      - The manuscript does not clarify: How did the Authors ensure that each subject received each effort-reward combination at least once if a given subject always accepted or always rejected offers?

      We have made the following edit to the Methods section to better explain this aspect of our task design.

      Lines 642 – 655:

      “For each subject, trial-by-trial presentation of effort-reward combinations were made semi-adaptively by 16 randomly interleaved staircases. Each of the 16 possible offers (4 effort-levels x 4 reward-levels) served as the starting point of one of the 16 staircase. Within each staircase, after a subject accepted a challenge, the next trial’s offer on that staircase was adjusted (by increasing effort or decreasing reward). After a subject rejected a challenge, the next offer on that staircase was adjusted by decreasing effort or increasing reward. This ensured subjects received each effort-reward combination at least once (as each participant completed all 16 staircases), while individualizing trial presentation to maximize the trials’ informative value. Therefore, in practice, even in the case of a subject rejecing all offers (and hence the staircasing procedures always adapting by decreasing effort or increasing reward), the full range of effort-reward combinations will be represented in the task across the startingpoints of all staircases (and therefore before adaption takeplace).”

      - The word "metabolic" is misspelled in Table 1

      - Figure 2 is missing panel label "C"

      - The word "effort" is repeated on line 448.

      We thank the Reviewer for their attentive reading of our manuscript and have corrected the mistakes mentioned.

      Reviewer #3 (Recommendations For The Authors):

      It is a bit difficult to get a sense of people's discounting from the plots provided. Could the authors show a few example individuals and their fits (i.e., how steep was effort discounting on average and how much variance was there across individuals; maybe they could show the mean discount function or some examples etc)

      We appreciate very much the Reviewer's suggestion to visualise our parameter estimates within and across individuals. We have implemented this in Figure .S2

      It would be helpful if correlations between the various markers used as dependent variables (SHAPS, DARS, AES, chronotype etc) could plotted as part of each related figure (e.g., next to the relevant effects shown).

      We agree with the Reviewer that a visual representation of the various correlations between dependent variables would be a better and more assessable communication than our current paragraph listing the correlations. We have implemented this by adding a new figure plotting all correlations in a heat map, with asterisks indicating significance.

      The authors use the term "meaningful relationship" - how is this defined? If undefined, maybe consider changing (do they mean significant?)

      We understand how our use of the term “(no) meaningful relationship” was confusing here. As we conducted most analyses in a Bayesian fashion, this is a formal definition of ‘meaningful’: the 95% highest density interval does not span across 0. However, we do not want this to be misunderstood as frequentist “significance” and agree clarity can be improved here, To avoid confusion, we have amended the manuscript where relevant (i.e., we now state “we found a (/no) relationship / effect” rather than “we found a meaningful relationship”.

      The authors do not include an inverse temperature parameter in their discounting models-can they motivate why? If a participant chose nearly randomly, which set of parameter values would they get assigned?

      Our decision to not include an inverse temperature parameter was made after an extensive simulation-based investigation of different models and task designs. A series of parameter recovery studies including models with an inverse temperature parameter revealed the inverse temperature parameter could not be distinguished from the reward sensitivity parameter. Specifically, inverse temperature seemed to capture the variance of the true underlying reward sensitivity parameter, leading to confounding between the two. Hence, including both reward sensitivity and inverse temperature would not have allowed us to reliably estimate either parameter. As our pre-registered hypotheses related to the reward sensitivity parameter, we opted to include models with the reward sensitivity parameter rather than the inverse temperature parameter in our model space. We have now added these simulations to our supplement.

      Nevertheless, we believe our models can capture random decision-making. The parameters of effort and reward sensitivity capture how sensitive one is to changes in effort/reward level. Hence, random decision-making can be interpreted as low effort and reward sensitivity, such that one’s decision-making is not guided by changes in effort and reward magnitude. With low effort/reward sensitivity, the motivational tendency parameter (previously “choice bias”) would capture to what extend this random decision-making is biased toward accepting or rejecting offers.

      The simulation results are now detailed in the Supplement.

      Lines 25 – 46:

      “1.2.1 Parameter recoveries including inverse temperature

      In the process of task and model space development, we also considered models incorportating an inverse temperature paramater. To this end, we conducted parameter recoveries for four models, defined in Table S3.

      Parameter recoveries indicated that, parameters can be recovered reliably in model 1, which includes only effort sensitivity ( ) and inverse temperature as free parameters (on-diagonal correlations: .98 > r > .89, off-diagonal correlations: .04 > |r| > .004). However, as a reward sensitivity parameter is added to the model (model 2), parameter recovery seems to be compromised, as parameters are estimated less accurately (on-diagonal correlations: .80 > r > .68), and spurious correlations between parameters emerge (off-diagonal correlations: .40 > |r| > .17). This issue remains when motivational tendency is added to the model (model 4; on-diagonal correlations: .90 > r > .65; off-diagonal correlations: .28 > |r| > .03), but not when inverse temperature is modelled with effort sensitivity and motivational tendency, but not reward sensitivity (model 3; on-diagonal correlations: .96 > r > .73; off-diagonal correlations: .05 > |r| > .003).

      As our pre-registered hypotheses related to the reward sensitivity parameter, we opted to include models with the reward sensitivity parameter rather than the inverse temperature parameter in our model space.”

      And we now discuss random decision-making specifically in the Methods section.

      Lines 698 – 709:

      “First, a cost function transforms costs and rewards associated with an action into a subjective value (SV):

      with and for reward and effort sensitivity, and  and  for reward and effort. Higher effort and reward sensitivity mean the SV is more strongly influenced by changes in effort and reward, respectively (Fig. 3B-C). Hence, low effort and reward sensitivity mean the SV, and with that decision-making, is less guided by effort and reward offers, as would be in random decision-making.

      This SV is then transformed to an acceptance probability by a softmax function:

      with for the predicted acceptance probability and  for the intercept representing motivational tendency. A high motivational tendency means a subjects has a tendency, or bias, to accept rather than reject offers (Fig. 3D).”

      The pre-registration mentions effects of BMI and risk of metabolic disease-those are briefly reported the in factor loadings, but not discussed afterwards-although the authors stated hypotheses regarding these measures in their preregistration. Were those hypotheses supported?

      We reported these results (albeit only briefly) in the factor loadings resulting from our PLS regression and results from follow-up GLMs (see below). We have now amended the Discussion to enable further elaboration on whether they confirmed our hypotheses (this evidence was unclear, but we have subsequently followed up in a sample with type-2 diabetes, who also show reduced motivational tendency).

      Lines 258 – 261:

      “For the MEQ (95%HDI=[-0.09,0.06]), MCTQ (95%HDI=[-0.17,0.05]), BMI (95%HDI=[-0.19,0.01]), and FINDRISC (95%HDI=[-0.09,0.03]) no relationship with motivational tendency was found, consistent with the smaller magnitude of reported component loadings from the PLS regression.”

      We have added the following paragraph to our discussion.

      Lines 491 – 502:

      “To our surprise, we did not find statistical evidence for a relationship between effort-based decision-making and measures of metabolic health (BMI and risk for type-2 diabetes). Our analyses linking BMI to motivational tendency reveal a numeric effect in line with our hypothesis: a higher BMI relating to a lower motivational tendency. However, the 95% HDI for this effect narrowly included zero (95%HDI=[-0.19,0.01]). Possibly, our sample did not have sufficient variance in metabolic health to detect dimensional metabolic effects in a current general population sample. A recent study by our group investigates the same neurocomputational parameters of effort-based decision-making in participants with type-2 diabetes and non-diabetic controls matched by age, gender, and physical activity105. We report a group effect on the motivational tendency parameter, with type-2 diabetic patients showing a lower tendency to exert effort for reward.”

      “(105) Mehrhof, S. Z., Fleming, H. A. & Nord, C. A cognitive signature of metabolic health in effort-based decision-making. Preprint at https://doi.org/10.31234/osf.io/4bkm9 (2024).”

      R-values are indicated as a range (e.g., from 0.07-0.72 for the last one in 2.1 which is a large range). As mentioned above, the full correlation matrix should be reported in figures as heatmaps.

      We agree with the Reviewer that a heatmap is a better way of conveying this information – see Figure 1 in response to their previous comment.  

      The answer on whether data was already collected is missing on the second preregistration link. Maybe this is worth commenting on somewhere in the manuscript.

      This question appears missing because, as detailed in the manuscript, we felt that technically some data *was* already collected by the time our second pre-registration was posted. This is because the second pre-registration detailed an additional data collection, with the goal of extending data from the original dataset to include extreme chronotypes and increase precision of analyses. To avoid any confusion regarding the lack of reply to this question in the pre-registration, we have added the following disclaimer to the description of the second pre-registration:

      “Please note the lack of response to the question regarding already collected data. This is because the data collection in the current pre-registration extends data from the original dataset to increase the precision of analyses. While this original data is already collected, none of the data collection described here has taken place.”

      Some referencing is not reflective of the current state of the field (e.g., for effort discounting: Sugiwaka et al., 2004 is cited). There are multiple labs that have published on this since then including Philippe Tobler's and Sven Bestmann's groups (e.g., Hartmann et al., 2013; Klein-Flügge et al., Plos CB, 2015).

      We agree absolutely, and have added additional, more recent references on effort discounting.

      Lines 67 – 68:

      “Higher costs devalue associated rewards, an effect referred to as effort-discounting33–37.”

      (33) Sugiwaka, H. & Okouchi, H. Reformative self-control and discounting of reward value by delay or effort1. Japanese Psychological Research 46, 1–9 (2004).

      (34) Hartmann, M. N., Hager, O. M., Tobler, P. N. & Kaiser, S. Parabolic discounting of monetary rewards by physical effort. Behavioural Processes 100, 192–196 (2013).

      (35) Klein-Flügge, M. C., Kennerley, S. W., Saraiva, A. C., Penny, W. D. & Bestmann, S. Behavioral Modeling of Human Choices Reveals Dissociable Effects of Physical Effort and Temporal Delay on Reward Devaluation. PLOS Computational Biology 11, e1004116 (2015).

      (36) Białaszek, W., Marcowski, P. & Ostaszewski, P. Physical and cognitive effort discounting across different reward magnitudes: Tests of discounting models. PLOS ONE 12, e0182353 (2017).

      (37) Ostaszewski, P., Bąbel, P. & Swebodziński, B. Physical and cognitive effort discounting of hypothetical monetary rewards. Japanese Psychological Research 55, 329–337 (2013).

      There are lots of typos throughout (e.g., Supplementary martial, Mornignness etc)

      We thank the Reviewer for their attentive reading of our manuscript and have corrected our mistakes.

      In Table 1, it is not clear what the numbers given in parentheses are. The figure note mentions SD, IQR, and those are explicitly specified for some rows, but not all.

      After reviewing Table 1 we understand the comment regarding the clarity of the number in parentheses. In our original manuscript, for some variables, numbers were given per category (e.g. for gender and ethnicity), rather than per row, in which case the parenthetical statistic was indicated in the header row only. However, we now see that the clarity of the table would have been improved by adding the reported statistic for each row—we have corrected this.

      In Figure 1C, it would be much more helpful if the different panels were combined into one single panel (using differently coloured dots/lines instead of bars).

      We agree visualizing the proportion of accepted trials across effort and reward levels in one single panel aids interpretability. We have implemented it in the following plot (now Figure 2C).

      In Sections 2.2.1 and 4.2.1, the authors mention "mixed-effects analysis of variance (ANOVA) of repeated measures" (same in the preregistration). It is not clear if this is a standard RM-ANOVA (aggregating data per participant per condition) or a mixed-effects model (analysing data on a trial-by-trial level). This model seems to only include within-subjects variable, so it isn't a "mixed ANOVA" mixing within and between subjects effects.

      We apologise that our use of the term "mixed-effects analysis of variance (ANOVA) of repeated measures" is indeed incorrectly applied here. We aggregate data per participant and effort-by-reward combination, meaning there are no between-subject effects tested. We have corrected this to “repeated measures ANOVA”.

      In Section 2.2.2, the authors write "R-hats>1.002" but probably mean "R-hats < 1.002". ESS is hard to evaluate unless the total number of samples is given.

      We thank the Reviewer for noticing this mistake and have corrected it in the manuscript.

      In Section 2.3, the inference criterion is unclear. The authors first report "factor loadings" and then perform a permutation test that is not further explained. Which of these factors are actually needed for predicting choice bias out of chance? The permutation test suggests that the null hypothesis is just "none of these measures contributes anything to predicting choice bias", which is already falsified if only one of them shows an association with choice bias. It would be relevant to know for which measures this is the case. Specifically, it would be relevant to know whether adding circadian measures into a model that already contains apathy/anhedonia improves predictive performance.

      We understand the Reviewer’s concerns regarding the detail of explanation we have provided for this part of our analysis, but we believe there may have been a misunderstanding regarding the partial least squares (PLS) regression. Rather than identifying a number of factors to predict the outcome variable, a PLS regression identifies a model with one or multiple components, with various factor loadings of differing magnitude. In our case, the PLS regression identified a model with one component to best predict our outcome variable (motivational tendency, which in our previous various we called choice bias). This one component had factor loadings of our questionnaire-based measures, with measures of apathy and anhedonia having highest weights, followed by lesser weighted factor loadings by measures of circadian rhythm and metabolic health. The permutation test tests whether this component (consisting of the combination of factor loadings) can predict the outcome variable out of sample.

      We hope we have improved clarity on this in the manuscript by making the following edits to the Results section.

      Lines 248 – 251:

      “Permutation testing indicated the predictive value of the resulting component (with factor loadings described above) was significant out-of-sample (root-mean-squared error [RMSE]=0.203, p=.001).”

      Further, we hope to provide a more in-depth explanation of these results in the Methods section.

      Lines 755 – 759:

      “Statistical significance of obtained effects (i.e., the predictive accuracy of the identified component and factor loadings) was assessed by permutation tests, probing the proportion of root-mean-squared errors (RMSEs) indicating stronger or equally strong predictive accuracy under the null hypothesis.”

      In Section 2.5, the authors simply report "that chronotype showed effects of chronotype on reward sensitivity", but the direction of the effect (higher reward sensitivity in early vs. late chronotype) remains unclear.

      We thank the Reviewer for pointing this out. While we did report the direction of effect, this was only presented in the subsequent parentheticals and could have been made much clearer. To assist with this, we have made the following addition to the text.

      Lines 317 – 320:

      “Bayesian GLMs, controlling for age and gender, predicting task parameters by time-of-day and chronotype showed effects of chronotype on reward sensitivity (i.e. those with a late chronotype had a higher reward sensitivity; M= 0.325, 95% HDI=[0.19,0.46])”

      In Section 4.2, the authors write that they "implemented a previously-described procedure using Prolific pre-screeners", but no reference to this previous description is given.

      We thank the Reviewer for bringing our attention to this missing reference, which has now been added to the manuscript.

      In Supplementary Table S2, only the "on-diagonal correlations" are given, but off-diagonal correlations (indicative of trade-offs between parameters) would also be informative.

      We agree with the Reviewer that off-diagonal correlations between underlying and recovered parameters are crucial to assess confounding between parameters during model estimation. We reported this in figure S1D, where we present the full correlation matric between underlying and recovered parameters in a heatmap. We have now noticed that this plot was missing axis labels, which have been added now.

      I found it somewhat difficult to follow the results section without having read the methods section beforehand. At the beginning of the Results section, could the authors briefly sketch the outline of their study? Also, given they have a pre-registration, could the authors introduce each section with a statement of what they expected to find, and close with whether the data confirmed their expectations? In the current version of the manuscript, many results are presented without much context of what they mean.

      We agree a brief outline of the study procedure before reporting the results would be beneficial to following the subsequently text and have added the following to the end of our Introduction.

      Lines 101 – 106:

      “Here, we tested the relationship between motivational decision-making and three key neuropsychiatric syndromes: anhedonia, apathy, and depression, taking both a transdiagnostic and categorical (diagnostic) approach. To do this, we validate a newly developed effort-expenditure task, designed for online testing, and gamified to increase engagement. Participants completed the effort-expenditure task online, followed by a series of self-report questionnaires.”

      We have added references to our pre-registered hypotheses at multiple points in our manuscript.

      Lines 185 – 187:

      “In line with our pre-registered hypotheses, we found significant main effects for effort (F(1,14367)=4961.07, p<.0001) and reward (F(1,14367)=3037.91, p<.001), and a significant interaction between the two (F(1,14367)=1703.24, p<.001).”

      Lines 215 – 221:

      “Model comparison by out-of-sample predictive accuracy identified the model implementing three parameters (motivational tendency a, reward sensitivity , and effort sensitivity ), with a parabolic cost function (subsequently referred to as the full parabolic model) as the winning model (leave-one-out information criterion [LOOIC; lower is better] = 29734.8; expected log posterior density [ELPD; higher is better] = -14867.4; Fig. 31ED). This was in line with our pre-registered hypotheses.”

      Lines 252 – 258:

      “Bayesian GLMs confirmed evidence for psychiatric questionnaire measures predicting motivational tendency (SHAPS: M=-0.109; 95% highest density interval (HDI)=[-0.17,-0.04]; AES: M=-0.096; 95%HDI=[-0.15,-0.03]; DARS: M=-0.061; 95%HDI=[-0.13,-0.01]; Fig. 4A). Post-hoc GLMs on DARS sub-scales showed an effect for the sensory subscale (M=-0.050; 95%HDI=[-0.10,-0.01]). This result of neuropsychiatric symptoms predicting a lower motivational tendency is in line with our pre-registered hypothesis.”

      Lines 258 – 263:

      “For the MEQ (95%HDI=[-0.09,0.06]), MCTQ (95%HDI=[-0.17,0.05]), BMI (95%HDI=[-0.19,0.01]), and FINDRISC (95%HDI=[-0.09,0.03]) no meaningful relationship with choice biasmotivational tendency was found, consistent with the smaller magnitude of reported component loadings from the PLS regression. This null finding for dimensional measures of circadian rhythm and metabolic health was not in line with our pre-registered hypotheses.”

      Lines 268 – 270:

      “For reward sensitivity, the intercept-only model outperformed models incorporating questionnaire predictors based on RMSE. This result was not in line with our pre-registered expectations.”

      Lines 295 – 298:

      “As in our transdiagnostic analyses of continuous neuropsychiatric measures (Results 2.3), we found evidence for a lower motivational tendency parameter in the MDD group compared to HCs (M=-0.111, 95% HDI=[ -0.20,-0.03]) (Fig. 4B). This result confirmed our pre-registered hypothesis.”

      Lines 344 – 355:

      “Late chronotypes showed a lower motivational tendency than early chronotypes (M=-0.11, 95% HDI=[-0.22,-0.02])—comparable to effects of transdiagnostic measures of apathy and anhedonia, as well as diagnostic criteria for depression. Crucially, we found motivational tendency was modulated by an interaction between chronotype and time-of-day (M=0.19, 95% HDI=[0.05,0.33]): post-hoc GLMs in each chronotype group showed this was driven by a time-of-day effect within late, rather than early, chronotype participants (M=0.12, 95% HDI=[0.02,0.22], such that late chronotype participants showed a lower motivational tendency in the morning testing sessions, and a higher motivational tendency in the evening testing sessions; early chronotype: 95% HDI=[-0.16,0.04]) (Fig. 5A). These results of a main effect and an interaction effect of chronotype on motivational tendency confirmed our pre-registered hypothesis.”

      Lines 390 – 393:

      “Participants with an early chronotype had a lower reward sensitivity parameter than those with a late chronotype (M=0.27, 95% HDI=[0.16,0.38]). We found no effect of time-of-day on reward sensitivity (95%HDI=[-0.09,0.11]) (Fig. 5B). These results were in line with our pre-registered hypotheses.”

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #1 (Public Review):

      Comments on revisions:

      This revision addressed all my previous comments.

      Reviewer #3 (Public Review):

      Comments on revisions:

      The authors addressed my comments and it is ready for publication.

      We are grateful for the reviewers’ effort and are encouraged by their generally positive assessment of our manuscript.

      Reviewer #1 (Recommendations For The Authors):

      This revision addressed all my previous comments. The only new issue concerns the authors’ response to the following comment of reviewer 3:

      (2) Authors note ”monovalent positive salt ions such as Na+ can be attracted, somewhat counterintuitively, into biomolecular condensates scaffolded by positively-charged polyelectrolytic IDRs in the presence of divalent counterions”. This may be due to the fact that the divalent negative counterions present in the dense phase (as seen in the ternary phase diagrams) also recruit a small amount of Na+.

      Author reply: The reviewer’s comment is valid, as a physical explanation for this prediction is called for. Accordingly, the following sentence is added to p. 10, lines 27-29: ...

      Here are my comments on this issue. Most IDPs with a net positive charge still have negatively charged residues, which in theory can bind cations. In fact, Caprin1 has 3 negatively charged residues (same as A1-LCD). All-atom simulations of MacAinsh et al (ref 72) have shown that these negatively charged residues bind Na+; I assume this effect can be captured by the coarsegrained models in the present study. Moreover, all-atom simulations showed that Na+ has a strong tendency to be coordinated by backbone carbonyls, which of course are present on all residues. Suggestions:

      (a) The authors may want to analyze the binding partners of Na+. Are they predominantly the3 negatively charged residues, or divalent counterions, or both?

      (b) The authors may want to discuss the potential underestimation of Na+ inside Caprin1 condensates due to the lack of explicit backbone carbonyls that can coordinate Na+ in their models. A similar problem applies to backbone amides that can coordinate anions, but to a lesser extent (see Fig. 3A of ref 72).

      The reviewer’s comments are well taken. Regarding the statement in the revised manuscript “This phenomenon arises because the positively charge monovalent salt ions are attracted to the negatively charged divalent counterions in the protein-condensed phase.”, it should be first noted that the statement was inferred from the model observation that Na+ is depleted in condensed Caprin1 (Fig. 2a) when the counterion is monovalent (an observation that was stated almost immediately preceding the quoted statement). To make this logical connection clearer as well as to address the reviewer’s point about the presence of negatively charged residues in Caprin1, we have modified this statement in the Version of Record (VOR) as follows:

      “This phenomenon most likely arises from the attraction of the positively charge monovalent salt ions to the negatively charged divalent counterions in the proteincondensed phase because although the three negatively charged D residues in Caprin1 can attract Na+, it is notable that Na+ is depleted in condensed Caprin1 when the counterion is monovalent (Fig. 2a).”

      The reviewer’s suggestion (a) of collecting statistics of Na+ interactions in the Caprin1 condensate is valuable and should be attempted in future studies since it is beyond the scope of the present work. Thus far, our coarse-grained molecular dynamics has considered only monovalent Cl− counterions. We do not have simulation data for divalent counterions.

      Following the reviewer’s suggestion (b), we have now added the following sentence in Discussion under the subheading “Effects of salt on biomolecular LLPS”:

      “In this regard, it should be noted that positively and negatively charged salt ions can also coordinate with backbone carbonyls and amides, respectively, in addition to coordinating with charged amino acid sidechains (MacAinsh et al., eLife 2024). The impact of such effects, which are not considered in the present coarse-grained models, should be ascertained by further investigations using atomic simulations (MacAinsh et al., eLife 2024; Rauscher & Pom`es, eLife 2017; Zheng et al., J Phys Chem B 2020).”

      Here we have added a reference to Rauscher & Pom`es, eLife 2017 to more accurately reflect progress made in atomic simulations of biomolecular condensates.

      More generally, regarding the reviewer’s comments on the merits of coarse-grained versus atomic approaches, we re-emphasize, as stated in our paper, that these approaches are complementary. Atomic approaches undoubtedly afford structurally and energetically high-resolution information. However, as it stands, simulations of the assembly-disassembly process of biomolecular condensate are nonideal because of difficulties in achieving equilibration even for a small model system with < 10 protein chains (MacAinsh et al., eLife 2024) although well-equilibrated simulations are possible for a reasonably-sized system with ∼ 30 chains when the main focus is on the condensed phase (Rauscher & Pom`es, eLife 2017). In this context, coarse-grained models are valuable for assessing the energetic role of salt ions in the thermodynamic stability of biomolecular condensates of physically reasonable sizes under equilibrium conditions.

      In addition to the above minor additions, we have also added citations in the VOR to two highly relevant recent papers: Posey et al., J Am Chem Soc 2024 for salt-dependent biomolecular condensation (mentioned in Dicussion under subheadings “Tielines in protein-salt phase diagrams” and “Counterion valency” together with added references to Hribar et al., J Am Chem Soc 2002 and Nostro & Ninham, Chem Rev 2012 for the Hofmeister phenomena discussed by Posey et al.) and Zhu et al., J Mol Cell Biol 2024 for ATP-modulated reentrant behavior (mentioned in Introduction). We have also added back a reference to our previous work Lin et al., J Mol Liq 2017 to provide more background information for our formulation.

      Reviewer #2 (Recommendations For The Authors):

      The authors have done a great job addressing previous comments.

      We thank this reviewer for his/her effort and are encouraged by the positive assessment of our revised manuscript.

      ---

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The authors used multiple approaches to study salt effects in liquid-liquid phase separation (LLPS). Results on both wild-type Caprin1 and mutants and on different types of salts contribute to a comprehensive understanding.

      Strengths:

      The main strength of this work is the thoroughness of investigation. This aspect is highlighted by the multiple approaches used in the study, and reinforced by the multiple protein variants and different salts studied.

      We are encouraged by this positive overall assessment.

      Weaknesses: (1) The multiple computational approaches are a strength, but they’re cruder than explicit-solvent all-atom molecular dynamics (MD) simulations and may miss subtle effects of salts. In particular, all-atom MD simulations demonstrate that high salt strengthens pi-types of interactions (ref. 42 and MacAinsh et al, https://www.biorxiv.org/content/10.1101/2024.05.26.596000v3).

      The relative strengths and limitations of coarse-grained vs all-atom simulation are now more prominently discussed beginning at the bottom of p. 5 through the first 8 lines of p. 6 of the revised manuscript (page numbers throughout this letter refer to those in the submitted pdf file of the revised manuscript), with MacAinsh et al. included in this added discussion (cited as ref. 72 in the revised manuscript). The fact that coarse-grained simulation may not provide insights into more subtle structural and energetic effects afforded by all-atom simulations with regard to π-related interaction is now further emphasized on p. 11 (lines 23–30), with reference to MacAinsh et al. as well as original ref. 42 (Krainer et al., now ref. 50 in the revised manuscript).

      (2) The paper can be improved by distilling the various results into a simple set of conclusions. By example, based on salt effects revealed by all-atom MD simulations, MacAinsh et al. presented a sequence-based predictor for classes of salt dependence. Wild-type Caprin1 fits right into the “high net charg”e class, with a high net charge and a high aromatic content, showing no LLPS at 0 NaCl and an increasing tendency of LLPS with increasing NaCl. In contrast, pY-Caprin1 belongs to the “screening” class, with a high level of charged residues and showing a decreasing tendency of LLPS.

      This is a helpful suggestion. We have now added a subsection with heading “Overview of key observations from complementary approaches” at the beginning of the “Results” section on p. 6 (lines 18–37) and the first line of p. 7. In the same vein, a few concise sentences to summarize our key results are added to the first paragraph of “Discussion” (p. 18, lines 23– 26). In particular, the relationship of Caprin1 and pY-Caprin1 with the recent classification by MacAinsh et al. (ref. 72) in terms of “high net charge” and “screening” classes is now also stated, as suggested by this reviewer, on p. 18 under “Discussion” (lines 26–30).

      (3) Mechanistic interpretations can be further simplified or clarified. (i) Reentrant salt effects (e.g., Fig. 4a) are reported but no simple explanation seems to have been provided. Fig. 4a,b look very similar to what has been reported as strong-attraction promotor and weak-attraction suppressor, respectively (ref. 50; see also PMC5928213 Fig. 2d,b). According to the latter two studies, the “reentrant” behavior of a strong-attraction promotor, CL- in the present case, is due to Cl-mediated attraction at low to medium [NaCl] and repulsion between Cl- ions at high salt. Do the authors agree with this explanation? If not, could they provide another simple physical explanation? (ii) The authors attributed the promotional effect of Cl- to counterionbridged interchain contacts, based on a single instance. There is another simple explanation, i.e., neutralization of the net charge on Caprin1. The authors should analyze their simulation results to distinguish net charge neutralization and interchain bridging; see MacAinsh et al.

      The relationship of Cl− in bridging and neutralizing configurations, respectively, with the classification of “strong-attraction promoter” and “weak-attraction suppressor” by Zhou and coworkers is now stated on p. 13 (lines 29–31), with reference to original ref. 50 by Ghosh, Mazarakos & Zhou (now ref. 59 in the revised manuscript) as well as the earlier patchy particle model study PMC5928213 by Nguemaha & Zhou, now cited as ref. 58 in the revised manuscript. After receiving this referee report, we have conducted an extensive survey of our coarse-grained MD data to provide a quantitative description of the prevalence of counterion (Cl−) bridging interactions linking positively charged arginines (Arg+s) on different Caprin1 chains in the condensed phase (using the [Na+] = 0 case as an example). The newly compiled data is reported under a new subsection heading “Explicit-ion MD offers insights into counterion-mediated interchain bridging interactions among condensed Caprin1 molecules” on p. 12 (last five lines)–p. 14 (first 10 lines) [∼ 1_._5 additional page] as well as a new Fig. 6 to depict the statistics of various Arg+–Cl−–Arg+ configurations, with the conclusion that a vast majority (at least 87%) of Cl− counterions in the Caprin1-condensed phase engage in favorable condensation-driving interchain bridging interactions.

      (4) The authors presented ATP-Mg both as a single ion and as two separate ions; there is no explanation of which of the two versions reflects reality. When presenting ATP-Mg as a single ion, it’s as though it forms a salt with Na+. I assume NaCl, ATP, and MgCl2 were used in the experiment. Why is Cl- not considered? Related to this point, it looks like ATP is just another salt ion studied and much of the Results section is on NaCl, so the emphasis of ATP (“Diverse Roles of ATP” in the title is somewhat misleading.

      We model ATP and ATP-Mg both as single-bead ions (in rG-RPA) and also as structurally more realistic short multiple-bead polymers (in field-theoretic simulation, FTS). We have now added discussions to clarify our modeling rationale in using and comparing different models for ATP and ATP-Mg, as follows:

      p. 8 (lines 19–36):

      “The complementary nature of our multiple methodologies allows us to focus sharply on the electrostatic aspects of hydrolysis-independent role of ATP in biomolecular condensation by comparing ATP’s effects with those of simple salt. Here, Caprin1 and pY-Caprin1 are modeled minimally as heteropolymers of charged and neutral beads in rG-RPA and FTS. ATP and ATP-Mg are modeled as simple salts (singlebead ions) in rG-RPA whereas they are modeled with more structural complexity as short charged polymers (multiple-bead chains) in FTS, though the latter models are still highly coarse-grained. Despite this modeling difference, rG-RPA and FTS both rationalize experimentally observed ATP- and NaCl-modulated reentrant LLPS of Caprin1 and a lack of a similar reentrance for pY-Caprin1 as well as a prominent colocalization of ATP with the Caprin1 condensate. Consistently, the same contrasting trends in the effect of NaCl on Caprin1 and pY-Caprin1 are also seen in our coarse-grained MD simulations, though polymer field theories tend to overestimate LLPS propensity [99]. The robustness of the theoretical trends across different modeling platforms underscores electrostatics as a significant component in the diverse roles of ATP in the context of its well-documented ability to modulate biomolecular LLPS via hydrophobic and π-related effects [63, 65, 67].”

      Here, the last sentence quoted above addresses this reviewer’s question about our intended meaning in referring to “diverse roles of ATP” in the title of our paper. To make this point even clearer, we have also added the following sentence to the Abstract (p. 2, lines 12–13):

      “... The electrostatic nature of these features complements ATP’s involvement in π-related interactions and as an amphiphilic hydrotrope, ...”

      Moreover, to enhance readability, we have now added pointers in the rG-RPA part of our paper to anticipate the structurally more complex ATP and ATP-Mg models to be introduced subsequently in the FTS part, as follows:

      p. 9 (lines 13–15):

      “As mentioned above, in the present rG-RPA formulation, (ATP-Mg)<sup>2−</sup> and ATP<sup>4−</sup> are modeled minimally as a single-bead ion. They are represented by charged polymer models with more structural complexity in the FTS models below.”

      p. 11 (lines 8–11):

      These observations from analytical theory will be corroborated by FTS below with the introduction of structurally more realistic models of (ATP-Mg) <sup>2−</sup>, ATP<sup>4−</sup> together with the possibility of simultaneous inclusion of Na<sup>+</sup>, Cl−, and Mg<sup>2+</sup> in the FTS models of Caprin1/pY-Caprin1 LLPS systems.

      Reviewer #2 (Public Review):

      Summary:

      In this paper, Lin and colleagues aim to understand the role of different salts on the phase behavior of a model protein of significant biological interest, Caprin1, and its phosphorylated variant, pY-Caprin1. To achieve this, the authors employed a variety of methods to complement experimental studies and obtain a molecular-level understanding of ion partitioning inside biomolecular condensates. A simple theory based on rG-RPA is shown to capture the different salt dependencies of Caprin1 and pY-Caprin1 phase separation, demonstrating excellent agreement with experimental results. The application of this theory to multivalent ions reveals many interesting features with the help of multicomponent phase diagrams. Additionally, the use of CG model-based MD simulations and FTS provides further clarity on how counterions can stabilize condensed phases.

      Strengths:

      The greatest strength of this study lies in the integration of various methods to obtain complementary information on thermodynamic phase diagrams and the molecular details of the phase separation process. The authors have also extended their previously proposed theoretical approaches, which should be of significant interest to other researchers. Some of the findings reported in this paper, such as bridging interactions, are likely to inspire new studies using higher-resolution atomistic MD simulations.

      Weaknesses:

      The paper does not have any major issues.

      We are very encouraged by this reviewer’s positive assessment of our work.

      Reviewer #3 (Public Review):

      Authors first use rG-RPA to reproduce two observed trends. Caprin1 does not phase separate at very low salt but then undergoes LLPS with added salt while further addition of salt reduces its propensity to LLPS. On the other hand pY-Caprin1 exhibits a monotonic trend where the propensity to phase separate decreases with the addition of salt. This distinction is captured by a two component model and also when salt ions are explicitly modeled as a separate species with a ternary phase diagram. The predicted ternary diagrams (when co and counter ions are explicitly accounted for) also predict the tendency of ions to co-condense or exclude proteins in the dense phase. Predicted trends are generally in line with the measurement for Cparin1 [sic]. Next, the authors seek to explain the observed difference in phase separation when Arginines are replaced by Lysines creating different variants. In the current rG-RPA type models both Arginine (R) and Lysine (K) are treated equally since non-electrostatic effects are only modeled in a meanfield manner that can be fitted but not predicted. For this reason, coarse grain MD simulation is suitable. Moreover, MD simulation affords structural features of the condensates. They used a force field that is capable of discriminating R and K. The MD predicted degrees of LLPS of these variants again is consistent with the measurement. One additional insight emerges from MD simulations that a negative ion can form a bridge between two positively charged residues on the chain. These insights are not possible to derive from rG-RPA. Both rG-RPA and MD simulation become cumbersome when considering multiple types of ions such as Na, Cl, [ATP] and [ATP-Mg] all present at the same time. FTS is well suited to handle this complexity. FTS also provides insights into the co-localization of ions and proteins that is consistent with NMR. By using different combinations of ions they confirm the robustness of the prediction that Caprin1 shows salt-dependent reentrant behavior, adding further support that the differential behavior of Caprin1, and pY-Caprin1 is likely to be mediated by charge-charge interactions.

      We are encouraged by this reviewer’s positive assessment of our manuscript.

      Reviewer #1 (Recommendations For The Authors):

      Analysis:

      Analyze the simulation results to distinguish net charge neutralization and interchain bridging; see MacAinsh et al.

      Please see response above to points (3) and (4) under “Weaknesses” in this reviewer’s public review. We have now added a 1.5-page subsection starting from the bottom of p. 12 to the top of p. 14 to discuss a new extensive analysis of Arg<sup>+</sup>–Cl<sup>−</sup>–Arg<sup>+</sup> configurations to identify bridging interactions, with key results reported in a new Fig. 6 (p. 42). Recent results from MacAinsh, Dey & Zhou (cited now as ref. 72) are included in the added discussion. Relevant advances made in MacAinsh et al., including clarification and classification of salt-mediated interactions in the phase separation of A1-LCD are now mentioned multiple times in the revised manuscript (p. 5, lines 19–20; p. 6, lines 2–5; p. 11, line 30; p. 14, line 10; p. 18, lines 28–29; and p. 20, line 4).

      Writing and presentation

      (1) Cite subtle effects that may be missed by the coarser approaches in this study

      Please see response above to point (1) under “Weaknesses” in this reviewer’s public review.

      (2) Try to distill the findings into a simple set of conclusions

      Please see response above to point (2) under “Weaknesses” in this reviewer’s public review.

      (3) Clarify and simplify physical interpretations

      Please see response above to point (2) under “Weaknesses” in this reviewer’s public review.

      (4) Explain the treatment of ATP-Mg as either a single ion or two separate ions; reconsider modifying the reference to ATP in the title

      Please see response above to point (4) under “Weaknesses” in this reviewer’s public review.

      (5) Minor points:

      p. 4, citation of ref 56: this work shows ATP is a driver of LLPS, not merely a regulator (promotor or suppressor)

      This citation to original ref. 56 (now ref. 63) on p. 4 is now corrected (bottom line of p. 4).

      p. 7 and throughout: “using bulk [Caprin1]” – I assume this is the initial overall Caprin1 concentration. It would avoid confusion to state such concentrations as “initial” or “initial overall”

      We have now added “initial overall concentration” in parentheses on p. 8 (line 4) to clarify the meaning of “bulk concentration”.

      p. 7 and throughout: both mM (also uM) and mg/ml have been used as units of protein concentration and that can cause confusion. Indeed, the authors seem to have confused themselves on p. 9, where 400 (750) mM is probably 400 (750) mg/ml. The same with the use of mM and M for salt concentrations (400 mM Mg2+ but 0.1 and 1.0 M Na+)

      Concentrations are now given in both molarity and mass density in Fig. 1 (p. 37), Fig. 2 (p. 38), Fig. 4 (p. 40), and Fig. 7 (p. 43), as noted in the text on p. 8 (lines 4–5). Inconsistencies and errors in quoting concentrations are now corrected (p. 10, line 18, and p. 11, line 2).

      p. 7, “LCST-like”: isn’t this more like a case of a closed coexistence curve that contains both UCST and LCST?

      The discussion on p. 8 around this observation from Fig. 1d is now expanded, including alluding to the theoretical possibility of a closed co-existence curve mentioned by this reviewer, as follows:

      “Interestingly, the decrease in some of the condensed-phase [pY-Caprin1]s with decreasing T (orange and green symbols for ≲ 20◦C in Fig. 1d trending toward slightly lower [pY-Caprin1]) may suggest a hydrophobicity-driven lower critical solution temperature (LCST)-like reduction of LLPS propensity as temperature approaches ∼ 0◦C as in cold denaturation of globular proteins [7,23] though the hypothetical LCST is below 0◦C and therefore not experimentally accessible. If that is the case, the LLPS region would resemble those with both an UCST and a LCST [4]. As far as simple modeling is concerned, such a feature may be captured by a FH model wherein interchain contacts are favored by entropy at intermediate to low temperatures and by enthalpy at high temperatures, thus entailing a heat capacity contribution in χ(T), with [7,109,110] beyond the temperature-independent ϵ<sub>h</sub> and ϵ<sub>s</sub> used in Fig. 1c,d and Fig. 2. Alternatively, a reduction in overall condensed-phase concentration can also be caused by formation of heterogeneous locally organized structures with large voids at low temperatures even when interchain interactions are purely enthalpic (Fig. 4 of ref. [111]).”

      p. 8 “Caprin1 can undergo LLPS without the monovalent salt (Na+) ions (LLPS regions extend to [Na+] = 0 in Fig. 2e,f”: I don’t quite understand what’s going on here. Is the effect caused by a small amount of counterion (ATP-Mg) that’s calculated according to eq 1 (with z s set to 0)?

      The discussion of this result in Fig. 2e,f is now clarified as follows (p. 10, lines 8–14 in the revised manuscript):

      “The corresponding rG-RPA results (Fig. 2e–h) indicate that, in the present of divalent counterions (needed for overall electric neutrality of the Caprin1 solution), Caprin1 can undergo LLPS without the monvalent salt (Na+) ions (LLPS regions extend to [Na+] = 0 in Fig. 2e,f; i.e., ρs \= 0, ρc > 0 in Eq. (1)), because the configurational entropic cost of concentrating counterions in the Caprin1 condensed phase is lesser for divalent (zc \= 2) than for monovalent (zc \= 1) counterions as only half of the former are needed for approximate electric neutrality in the condensed phase.”

      p. 9 “Despite the tendency for polymer field theories to overestimate LLPS propensity and condensed-phase concentrations”: these limitations should be mentioned earlier, along with the very high concentrations (e.g., 1200 mg/ml) in Fig. 2

      This sentence (now on p. 11, lines 11–18) is now modified to clarify the intended meaning as suggested by this reviewer:

      “Despite the tendency for polymer field theories to overestimate LLPS propensity and condensed-phase concentrations quantitatively because they do not account for ion condensation [99]—which can be severe for small ions with more than ±1 charge valencies as in the case of condensed [Caprin1] ≳ 120 mM in Fig. 2i–l, our present rG-RPA-predicted semi-quantitative trends are consistent with experiments indicating “

      In addition, this limitation of polymer field theories is also mentioned earlier in the text on p. 6, lines 30–31.

      Reviewer #2 (Recommendations For The Authors):

      (1) he current version of the paper goes through many different methodologies, but how these methods complement or overlap in terms of their applicability to the problem at hand may not be so clear. This can be especially difficult for readers not well-versed in these methods. I suggest the authors summarize this somewhere in the paper.

      As mentioned above in response to Reviewer #1, we have now added a subsection with heading “Overview of key observations from complementary approaches” at the beginning of the “Results” section on p. 6 (lines 18–37) and the first line of p. 7 to make our paper more accessible to readers who might not be well-versed in the various theoretical and computational techniques. A few sentences to summarize our key results are added as well to the first paragraph of “Discussion” (p. 18, lines 23–26).

      (2) It wasn’t clear if the authors obtained LCST-type behavior in Figure 1d or if another phenomenon is responsible for the non-monotonic change in dense phase concentrations. At the very least, the authors should comment on the possibility of observing LCST behavior using the rG-RPA model and if modifications are needed to make the theory more appropriate for capturing LCST.

      As mentioned above in response to Reviewer #1, the discussion regarding possible LCSTtype behanvior in Fig. 1d is now expanded to include two possible physical origins: (i) hydrophobicity-like temperature-dependent effective interactions, and (ii) formation of heterogeneous, more open structures in the condensed phase at low temperatures. Three additional references [109, 110, 111] (from the Dill, Chan, and Panagiotopoulos group respectively) are now included to support the expanded discussion. Again, the modified discussion is as follows:

      “Interestingly, the decrease in some of the condensed-phase [pY-Caprin1]s with decreasing T (orange and green symbols for ≲ 20◦C in Fig. 1d trending toward slightly lower [pY-Caprin1]) may suggest a hydrophobicity-driven lower critical solution temperature (LCST)-like reduction of LLPS propensity as temperature approaches ∼ 0◦C as in cold denaturation of globular proteins [7,23] though the hypothetical LCST is below 0◦C and therefore not experimentally accessible. If that is the case, the LLPS region would resemble those with both an UCST and a LCST [4]. As far as simple modeling is concerned, such a feature may be captured by a FH model wherein interchain contacts are favored by entropy at intermediate to low temperatures and by enthalpy at high temperatures, thus entailing a heat capacity contribution in χ(T), with [7,109,110] beyond the temperature-independent ϵ<sub>h</sub> and ϵ<sub>s</sub> used in Fig. 1c,d and Fig. 2. Alternatively, a reduction in overall condensed-phase concentration can also be caused by formation of heterogeneous locally organized structures with large voids at low temperatures even when interchain interactions are purely enthalpic (Fig. 4 of ref. [111]).”

      (3) In Figures 4c and 4d, ionic density profiles could be shown as a separate zoomed-in version to make it easier to see the results.

      This is an excellent suggestion. Two such panels are now added to Fig. 4 (p. 40) as parts (g) and (h).

      Reviewer #3 (Recommendations For The Authors):

      I would suggest authors make some minor edits as noted here.

      (1) Please note down the chi values that were used when fitting experimental phase diagrams with rG-RPA theory in Figure 2a,b. At present there aren’t too many such values available in the literature and reporting these would help to get an estimate of effective chi values when electrostatics is appropriately modeled using rG-RPA.

      The χ(T) values and their enthalpic and entropic components ϵh and ϵs used to fit the experimental data in Fig. 1c,d are now stated in the caption for Fig. 1 (p. 37). Same fitted χ(T) values are used in Fig. 2 (p. 38) as it is now stated in the revised caption for Fig. 2. Please note that for clarity we have now changed the notation from ∆h and ∆s in our originally submitted manuscript to ϵh and ϵs in the revised text (p. 7, last line) as well as in the revised figure captions to conform to the notation in our previous works [18, 71].

      (2) Authors note “monovalent positive salt ions such as Na+ can be attracted, somewhat counterintuitively, into biomolecular condensates scaffolded by positively-charged polyelectrolytic IDRs in the presence of divalent counterions”. This may be due to the fact that the divalent negative counterions present in the dense phase (as seen in the ternary phase diagrams) also recruit a small amount of Na+.

      The reviewer’s comment is valid, as a physical explanation for this prediction is called for. Accordingly, the following sentence is added to p. 10, lines 27–29:

      “This phenomenon arises because the positively charge monovalent salt ions are attracted to the negatively charged divalent counterions in the protein-condensed phase.”

      (3) In the discussion where authors contrast the LLPS propensity of Caprin1 against FUS, TDP43, Brd4, etc, they correctly note majority of these other proteins have low net charge and possibly higher non-electrostatic interaction that can promote LLPS at room temperature even in the absence of salt. It is also worth noting if some of these proteins were forced to undergo LLPS with crowding which is sometimes typical. A quick literature search will make this clear.

      A careful reading of the work in question (Krainer et al., ref. 50) does not suggest that crowders were used to promote LLPS for the proteins the authors studied. Nonetheless, the reviewer’s point regarding the potential importance of crowder effects is well taken. Accordingly, crowder effects are now mentioned briefly in the Introduction (p. 4, line 13), with three additional references on the impact of crowding on LLPS added [30–32] (from the Spruijt, Mukherjee, and Rakshit groups respectively). In this connection, to provide a broader historical context to the introductory discussion of electrostatics effects in biomolecular processes in general, two additional influential reviews (from the Honig and Zhou groups respectively) are now cited as well [15, 16].

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The authors used structural and biophysical methods to provide insight into Parkin regulation. The breadth of data supporting their findings was impressive and generally well-orchestrated. Still, the impact of their results builds on recent structural studies and the stated impact is based on these prior works.

      Strengths:

      (1) After reading through the paper, the major findings are:

      - RING2 and pUbl compete for binding to RING0.

      - Parkin can dimerize.

      - ACT plays an important role in enzyme kinetics.

      (2) The use of molecular scissors in their construct represents a creative approach to examining inter-domain interactions.

      (3) From my assessment, the experiments are well-conceived and executed.

      We thank the reviewer for their positive remark and extremely helpful suggestions.

      Weaknesses:

      The manuscript, as written, is NOT for a general audience. Admittedly, I am not an expert on Parkin structure and function, but I had to do a lot of homework to try to understand the underlying rationale and impact. This reflects, I think, that the work generally represents an incremental advance on recent structural findings.

      To this point, it is hard to understand the impact of this work without more information highlighting the novelty. There are several structures of Parkin in various auto-inhibited states, and it was hard to delineate how this is different.

      For the sake of the general audience, we have included all the details of Parkin structures and conformations seen (Extended Fig. 1). The structures in the present study are to validate the biophysical/biochemical experiments, highlighting key findings. For example, we solved the phospho-Parkin (complex with pUb) structure after treatment with 3C protease (Fig. 2C), which washes off the pUbl-linker, as shown in Fig 2B. The structure of the pUbl-linker depleted phospho-Parkin-pUb complex showed that RING2 returned to the closed state (Fig. 2C), which is confirmation of the SEC assay in Fig. 2B. Similarly, the structure of the pUbl-linker depleted phospho-Parkin R163D/K211N-pUb complex (Fig. 3C), was done to validate the SEC data showing displacement of pUbl-linker is independent of pUbl interaction with the basic patch on RING0 (Fig. 3B). In addition, the latter structure also revealed a new donor ubiquitin binding pocket in the linker (connecting REP and RING2) region of Parkin (Fig. 9). Similarly, trans-complex structure of phospho-Parkin (Fig. 4D) was done to validate the biophysical data (Fig. 4A-C, Fig. 5A-D) showing trans-complex between phospho-Parkin and native Parkin. The latter also confirmed that the trans-complex was mediated by interactions between pUbl and the basic patch on RING0 (Fig. 4D). Furthermore, we noticed that the ACT region was disordered in the trans-complex between phospho-Parkin (1-140 + 141-382 + pUb) (Fig. 8A) which had ACT from the trans molecule, indicating ACT might be present in the cis molecule. The latter was validated from the structure of trans-complex between phospho-Parkin with cis ACT (1-76 + 77-382 + pUb) (Fig. 8C), showing the ordered ACT region. The structural finding was further validated by biochemical assays (Fig. 8 D-F, Extended Data Fig. 9C-E).

      The structure of TEV-treated R0RBR (TEV) (Extended Data Fig. 4C) was done to ensure that the inclusion of TEV and treatment with TEV protease did not perturb Parkin folding, an important control for our biophysical experiments.

      As noted, I appreciated the use of protease sites in the fusion protein construct. It is unclear how the loop region might affect the protein structure and function. The authors worked to demonstrate that this did not introduce artifacts, but the biological context is missing.

      We thank the reviewer for appreciating the use of protease sites in the fusion protein construct.  Protease sites were used to overcome the competing mode of binding that makes interactions very transient and beyond the detection limit of methods such as ITC or SEC. While these interactions are quite transient in nature, they could still be useful for the activation of various Parkin isoforms that lack either the Ubl domain or RING2 domain (Extended Data Fig. 6, Fig. 10). Also, our Parkin localization assays also suggest an important role of these interactions in the recruitment of Parkin molecules to the damaged mitochondria (Fig. 6).

      While it is likely that the binding is competitive between the Ubl and RING2 domains, the data is not quantitative. Is it known whether the folding of the distinct domains is independent? Or are there interactions that alter folding? It seems plausible that conformational rearrangements may invoke an orientation of domains that would be incompatible. The biological context for the importance of this interaction was not clear to me.

      This is a great point. In the revised manuscript, we have included quantitative data between phospho-Parkin and untethered ∆Ubl-Parkin (TEV) (Fig. 5B) showing similar interactions using phospho-Parkin K211N and untethered ∆Ubl-Parkin (TEV) (Fig. 4B). Folding of Ubl domain or various combinations of RING domains lacking Ubl seems okay. Also, folding of the RING2 domain on its own appears to be fine. However, human Parkin lacking the RING2 domain seems to have some folding issues, majorly due to exposure of hydrophobic pocket on RING0, also suggested by previous efforts (Gladkova et al.ref. 24, Sauve et al. ref. 29).  The latter could be overcome by co-expression of RING2 lacking Parkin construct with PINK1 (Sauve et al. ref. 29) as phospho-Ubl binds on the same hydrophobic pocket on RING0 where RING2 binds. A drastic reduction in the melting temperature of phospho-Parkin (Gladkova et al.ref. 24), very likely due to exposure of hydrophobic surface between RING0 and RING2, correlates with the folding issues of RING0 exposed human Parkin constructs.

      From the biological context, the competing nature between phospho-Ubl and RING2 domains could block the non-specific interaction of phosphorylated-ubiquitin-like proteins (phospho-Ub or phospho-NEDD8) with RING0 (Lenka et al. ref. 33), during Parkin activation. 

      (5) What is the rationale for mutating Lys211 to Asn? Were other mutations tried? Glu? Ala? Just missing the rationale. I think this may have been identified previously in the field, but not clear what this mutation represents biologically.

      Lys211Asn is a Parkinson’s disease mutation; therefore, we decided to use the same mutation for biophysical studies.  

      I was confused about how the phospho-proteins were generated. After looking through the methods, there appear to be phosphorylation experiments, but it is unclear what the efficiency was for each protein (i.e. what % gets modified). In the text, the authors refer to phospho-Parkin (T270R, C431A), but not clear how these mutations might influence this process. I gather that these are catalytically inactive, but it is unclear to me how this is catalyzing the ubiquitination in the assay.

      This is an excellent question. Because different phosphorylation statuses would affect the analysis, we ensured complete phosphorylation status using Phos-Tag SDS-PAGE, as shown below.

      Author response image 1.

      Our biophysical experiments in Fig. 5C show that trans complex formation is mediated by interactions between the basic patch (comprising K161, R163, K211) on RING0 and phospho-Ubl domain in trans. These interactions result in the displacement of RING2 (Fig. 5C). Parkin activation is mediated by displacement of RING2 and exposure of catalytic C431 on RING2. While phospho-Parkin T270R/C431A is catalytically dead, the phospho-Ubl domain of phospho-Parkin T270R/C431would bind to the basic patch on RING0 of WT-Parkin resulting in activation of WT-Parkin as shown in Fig. 5E. A schematic figure is shown below to explain the same.

      Author response image 2.

      (7) The authors note that "ACT can be complemented in trans; however, it is more efficient in cis", but it is unclear whether both would be important or if the favored interaction is dominant in a biological context.

      First, this is an excellent question about the biological context of ACT and needs further exploration. While due to the flexible nature of ACT, it can be complemented both in cis and trans, we can only speculate cis interactions between ACT and RING0 could be more relevant from the biological context as during protein synthesis and folding, ACT would be translated before RING2, and thus ACT would occupy the small hydrophobic patch on RING0 in cis. Unpublished data shows the replacement of the ACT region by Biogen compounds to activate Parkin (https://doi.org/10.21203/rs.3.rs-4119143/v1). The latter finding further suggests the flexibility in this region.        

      (8) The authors repeatedly note that this study could aid in the development of small-molecule regulators against Parkin to treat PD, but this is a long way off. And it is not clear from their manuscript how this would be achieved. As stated, this is conjecture.

      As suggested by this reviewer, we have removed this point in the revised manuscript.

      Reviewer #2 (Public Review):

      This manuscript uses biochemistry and X-ray crystallography to further probe the molecular mechanism of Parkin regulation and activation. Using a construct that incorporates cleavage sites between different Parkin domains to increase the local concentration of specific domains (i.e., molecular scissors), the authors suggest that competitive binding between the p-Ubl and RING2 domains for the RING0 domain regulates Parkin activity. Further, they demonstrate that this competition can occur in trans, with a p-Ubl domain of one Parkin molecule binding the RING0 domain of a second monomer, thus activating the catalytic RING1 domain. In addition, they suggest that the ACT domain can similarly bind and activate Parkin in trans, albeit at a lower efficiency than that observed for p-Ubl. The authors also suggest from crystal structure analysis and some biochemical experiments that the linker region between RING2 and repressor elements interacts with the donor ubiquitin to enhance Parkin activity.<br /> Ultimately this manuscript challenges previous work suggesting that the p-Ubl domain does not bind to the Parkin core in the mechanism of Parkin activation. The use of the 'molecular scissors' approach to probe these effects is an interesting approach to probe this type of competitive binding. However, there are issues with the experimental approach manuscript that detract from the overall quality and potential impact of the work.

      We thank the reviewer for their positive remark and constructive suggestions.

      The competitive binding between p-Ubl and RING2 domains for the Parkin core could have been better defined using biophysical and biochemical approaches that explicitly define the relative affinities that dictate these interactions. A better understanding of these affinities could provide more insight into the relative bindings of these domains, especially as it relates to the in trans interactions.

      This is an excellent point regarding the relative affinities of pUbl and RING2 for the Parkin core (lacking Ubl and RING2). While we could purify p-Ubl, we failed to purify human Parkin (lacking RING2 and phospho-Ubl). The latter folding issues were likely due to the exposure of a highly hydrophobic surface on RING0 (as shown below) in the absence of pUbl and RING2 in the R0RB construct. Also, RING2 with an exposed hydrophobic surface would be prone to folding issues, which is not suitable for affinity measurements. A drastic reduction in the melting temperature of phospho-Parkin (Gladkova et al.ref. 24) also highlights the importance of a hydrophobic surface between RING0 and RING2 on Parkin folding/stability. A separate study would be required to try these Parkin constructs from different species and ensure proper folding before using them for affinity measurements.

      Author response image 3.

      I also have concerns about the results of using molecular scissors to 'increase local concentrations' and allow for binding to be observed. These experiments are done primarily using proteolytic cleavage of different domains followed by size exclusion chromatography. ITC experiments suggest that the binding constants for these interactions are in the µM range, although these experiments are problematic as the authors indicate in the text that protein precipitation was observed during these experiments. This type of binding could easily be measured in other assays. My issue relates to the ability of a protein complex (comprising the core and cleaved domains) with a Kd of 1 µM to be maintained in an SEC experiment. The off-rates for these complexes must be exceeding slow, which doesn't really correspond to the low µM binding constants discussed in the text. How do the authors explain this? What is driving the Koff to levels sufficiently slow to prevent dissociation by SEC? Considering that the authors are challenging previous work describing the lack of binding between the p-Ubl domain and the core, these issues should be better resolved in this current manuscript. Further, it's important to have a more detailed understanding of relative affinities when considering the functional implications of this competition in the context of full-length Parkin. Similar comments could be made about the ACT experiments described in the text.

      This is a great point. In the revised manuscript, we repeated ITC measurements in a different buffer system, which gave nice ITC data. In the revised manuscript, we have also performed ITC measurements using native phospho-Parkin. Phospho-Parkin and untethered ∆Ubl-Parkin (TEV) (Fig. 5B) show similar affinities as seen between phospho-Parkin K211N and untethered ∆Ubl-Parkin (TEV) (Fig. 4B). However, Kd values were consistent in the range of 1.0 ± 0.4 µM which could not address the reviewer’s point regarding slow off-rate. The crystal structure of the trans-complex of phospho-Parkin shows several hydrophobic and ionic interactions between p-Ubl and Parkin core, suggesting a strong interaction and, thus, justifying the co-elution on SEC. Additionally, ITC measurements between E2-Ub and P-Parkin-pUb show similar affinity (Kd = 0.9 ± 0.2 µM) (Kumar et al., 2015, EMBO J.), and yet they co-elute on SEC (Kumar et al., 2015, EMBO J.).

      Ultimately, this work does suggest additional insights into the mechanism of Parkin activation that could contribute to the field. There is a lot of information included in this manuscript, giving it breadth, albeit at the cost of depth for the study of specific interactions. Further, I felt that the authors oversold some of their data in the text, and I'd recommend being a bit more careful when claiming an experiment 'confirms' a specific model. In many cases, there are other models that could explain similar results. For example, in Figure 1C, the authors state that their crystal structure 'confirms' that "RING2 is transiently displaced from the RING0 domain and returns to its original position after washing off the p-Ubl linker". However, it isn't clear to me that RING2 ever dissociated when prepared this way. While there are issues with the work that I feel should be further addressed with additional experiments, there are interesting mechanistic details suggested by this work that could improve our understanding of Parkin activation. However, the full impact of this work won't be fully appreciated until there is a more thorough understanding of the regulation and competitive binding between p-Ubl and RIGN2 to RORB both in cis and in trans.

      We thank the reviewer for their positive comment. In the revised manuscript, we have included the reviewer’s suggestion. The conformational changes in phospho-Parkin were established from the SEC assay (Fig. 2A and Fig. 2B), which show displacement/association of phospho-Ubl or RING2 after treatment of phospho-Parkin with 3C and TEV, respectively. For crystallization, we first phosphorylated Parkin, where RING2 is displaced due to phospho-Ubl (as shown in SEC), followed by treatment with 3C protease, which led to pUbl wash-off. The Parkin core separated from phospho-Ubl on SEC was used for crystallization and structure determination in Fig. 2C, where RING2 returned to the RING0 pocket, which confirms SEC data (Fig. 2B).

      Reviewer #3 (Public Review):

      Summary:

      In their manuscript "Additional feedforward mechanism of Parkin activation via binding of phospho-UBL and RING0 in trans", Lenka et al present data that could suggest an "in trans" model of Parkin ubiquitination activity. Parkin is an intensely studied E3 ligase implicated in mitophagy, whereby missense mutations to the PARK2 gene are known to cause autosomal recessive juvenile parkinsonism. From a mechanistic point of view, Parkin is extremely complex. Its activity is tightly controlled by several modes of auto-inhibition that must be released by queues of mitochondrial damage. While the general overview of Parkin activation has been mapped out in recent years, several details have remained murky. In particular, whether Parkin dimerizes as part of its feed-forward signaling mechanism, and whether said dimerization can facilitate ligase activation, has remained unclear. Here, Lenka et al. use various truncation mutants of Parkin in an attempt to understand the likelihood of dimerization (in support of an "in trans" model for catalysis).

      Strengths:

      The results are bolstered by several distinct approaches including analytical SEC with cleavable Parkin constructs, ITC interaction studies, ubiquitination assays, protein crystallography, and cellular localization studies.

      We thank the reviewer for their positive remark.

      Weaknesses:

      As presented, however, the storyline is very confusing to follow and several lines of experimentation felt like distractions from the primary message. Furthermore, many experiments could only indirectly support the author's conclusions, and therefore the final picture of what new features can be firmly added to the model of Parkin activation and function is unclear.

      We thank the reviewer for their constructive criticism, which has helped us to improve the quality of this manuscript.

      Major concerns:

      (1) This manuscript solves numerous crystal structures of various Parkin components to help support their idea of in trans transfer. The way these structures are presented more resemble models and it is unclear from the figures that these are new complexes solved in this work, and what new insights can be gleaned from them.

      The structures in the present study are to validate the biophysical/biochemical experiments highlighting key findings. For example, we solved the phospho-Parkin (complex with pUb) structure after treatment with 3C protease (Fig. 2C), which washes off the pUbl-linker, as shown in Fig. 2B. The structure of pUbl-linker depleted phospho-Parkin-pUb complex showed that RING2 returned to the closed state (Fig. 2C), which is confirmation of the SEC assay in Fig. 2B. Similarly, the structure of the pUbl-linker depleted phospho-Parkin R163D/K211N-pUb complex (Fig. 3C), was done to validate the SEC data showing displacement of pUbl-linker is independent of pUbl interaction with the basic patch on RING0 (Fig. 3B). In addition, the latter structure also revealed a new donor ubiquitin binding pocket in the linker (connecting REP and RING2) region of Parkin (Fig. 9). Similarly, trans-complex structure of phospho-Parkin (Fig. 4D) was done to validate the biophysical data (Fig. 4A-C, Fig. 5A-D) showing trans-complex between phospho-Parkin and native Parkin. The latter also confirmed that the trans-complex was mediated by interactions between pUbl and the basic patch on RING0 (Fig. 4D). Furthermore, we noticed that the ACT region was disordered in the trans-complex between phospho-Parkin (1-140 + 141-382 + pUb) (Fig. 8A) which had ACT from the trans molecule, indicating ACT might be present in the cis molecule. The latter was validated from the structure of trans-complex between phospho-Parkin with cis ACT (1-76 + 77-382 + pUb) (Fig. 8C), showing the ordered ACT region. The structural finding was further validated by biochemical assays (Fig. 8 D-F, Extended Data Fig. 9C-E).

      The structure of TEV-treated R0RBR (TEV) (Extended Data Fig. 4C) was done to ensure that the inclusion of TEV and treatment with TEV protease did not perturb Parkin folding, an important control for our biophysical experiments.

      (2) There are no experiments that definitively show the in trans activation of Parkin. The binding experiments and size exclusion chromatography are a good start, but the way these experiments are performed, they'd be better suited as support for a stronger experiment showing Parkin dimerization. In addition, the rationale for an in trans activation model is not convincingly explained until the concept of Parkin isoforms is introduced in the Discussion. The authors should consider expanding this concept into other parts of the manuscript.

      We thank the reviewer for appreciating the Parkin dimerization. Our biophysical data in Fig. 5C shows that Parkin dimerization is mediated by interactions between phospho-Ubl and RING0 in trans, leading to the displacement of RING2. However, Parkin K211N (on RING0) mutation perturbs interaction with phospho-Parkin and leads to loss of Parkin dimerization and loss of RING2 displacement (Fig. 5C). The interaction between pUbl and K211 pocket on RING0 leads to the displacement of RING2 resulting in Parkin activation as catalytic residue C431 on RING2 is exposed for catalysis. The biophysical experiment is further confirmed by a biochemical experiment where the addition of catalytically in-active phospho-Parkin T270R/C431A activates autoinhibited WT-Parkin in trans using the mechanism as discussed (a schematic representation also shown in Author response image 2).

      We thank this reviewer regarding Parkin isoforms. In the revised manuscript, we have included Parkin isoforms in the results section, too.

      (2a) For the in trans activation experiment using wt Parkin and pParkin (T270R/C431A) (Figure 3D), there needs to be a large excess of pParkin to stimulate the catalytic activity of wt Parkin. This experiment has low cellular relevance as these point mutations are unlikely to occur together to create this nonfunctional pParkin protein. In the case of pParkin activating wt Parkin (regardless of artificial point mutations inserted to study specifically the in trans activation), if there needs to be much more pParkin around to fully activate wt Parkin, isn't it just more likely that the pParkin would activate in cis?

      To test phospho-Parkin as an activator of Parkin in trans, we wanted to use the catalytically inactive version of phospho-Parkin to avoid the background activity of p-Parkin. While it is true that a large excess of pParkin (T270R/C431A) is required to activate WT-Parkin in the in vitro set-up, it is not very surprising as in WT-Parkin, the unphosphorylated Ubl domain would block the E2 binding site on RING1. Also, due to interactions between pParkin (T270R/C431A) molecules, the net concentration of pParkin (T270R/C431A) as an activator would be much lower. However, the Ubl blocking E2 binding site on RING1 won’t be an issue between phospho-Parkin molecules or between Parkin isoforms (lacking Ubl domain or RING2).

      (2ai) Another underlying issue with this experiment is that the authors do not consider the possibility that the increased activity observed is a result of increased "substrate" for auto-ubiquitination, as opposed to any role in catalytic activation. Have the authors considered looking at Miro as a substrate in order to control for this?

      This is quite an interesting point. However, this will be only possible if Parkin is ubiquitinated in trans, as auto-ubiquitination is possible with active Parkin and not with catalytically dead (phospho-Parkin T270R, C431A) or autoinhibited (WT-Parkin). Also, in the previous version of the manuscript, where we used only phospho-Ubl as an activator of Parkin in trans, we tested Miro1 ubiquitination and auto-ubiquitination, and the results were the same (Author response image 4).

      Author response image 4.

      (2b) The authors mention a "higher net concentration" of the "fused domains" with RING0, and use this to justify artificially cleaving the Ubl or RING2 domains from the Parkin core. This fact should be moot. In cells, it is expected there will only be a 1:1 ratio of the Parkin core with the Ubl or RING2 domains. To date, there is no evidence suggesting multiple pUbls or multiple RING2s can bind the RING0 binding site. In fact, the authors here even show that either the RING2 or pUbl needs to be displaced to permit the binding of the other domain. That being said, there would be no "higher net concentration" because there would always be the same molar equivalents of Ubl, RING2, and the Parkin core.

      We apologize for the confusion. “Higher net concentration” is with respect to fused domains versus the domain provided in trans. Due to the competing nature of the interactions between pUbl/RING2 and RING0, the interactions are too transient and beyond the detection limit of the biophysical techniques. While the domains are fused (for example, RING0-RING2 in the same polypeptide) in a polypeptide, their effective concentrations are much higher than those (for example, pUbl) provided in trans; thus, biophysical methods fail to detect the interaction. Treatment with protease solves the above issue due to the higher net concentration of the fused domain, and trans interactions can be measured using biophysical techniques. However, the nature of these interactions and conformational changes is very transient, which is also suggested by the data. Therefore, Parkin molecules will never remain associated; rather, Parkin will transiently interact and activate Parkin molecules in trans.

      (2c) A larger issue remaining in terms of Parkin activation is the lack of clarity surrounding the role of the linker (77-140); particularly whether its primary role is to tether the Ubl to the cis Parkin molecule versus a role in permitting distal interactions to a trans molecule. The way the authors have conducted the experiments presented in Figure 2 limits the possible interactions that the activated pUbl could have by (a) ablating the binding site in the cis molecule with the K211N mutation; (b) further blocking the binding site in the cis molecule by keeping the RING2 domain intact. These restrictions to the cis parkin molecule effectively force the pUbl to bind in trans. A competition experiment to demonstrate the likelihood of cis or trans activation in direct comparison with each other would provide stronger evidence for trans activation.

      This is an excellent point. In the revised manuscript, we have performed experiments using native phospho-Parkin (Revised Figure 5), and the results are consistent with those in Figure 2 ( Revised Figure 4), where we used the K211N mutation.

      (3) A major limitation of this study is that the authors interpret structural flexibility from experiments that do not report directly on flexibility. The analytical SEC experiments report on binding affinity and more specifically off-rates. By removing the interdomain linkages, the accompanying on-rate would be drastically impacted, and thus the observations are disconnected from a native scenario. Likewise, observations from protein crystallography can be consistent with flexibility, but certainly should not be directly interpreted in this manner. Rigorous determination of linker and/or domain flexibility would require alternative methods that measure this directly.

      We also agree with the reviewer that these methods do not directly capture structural flexibility. Also, rigorous determination of linker flexibility would require alternative methods that measure this directly. However, due to the complex nature of interactions and technical limitations, breaking the interdomain linkages was the best possible way to capture interactions in trans. Interestingly, all previous methods that report cis interactions between pUbl and RING0 also used a similar approach (Gladkova et al.ref. 24, Sauve et al. ref. 29).  

      (4) The analysis of the ACT element comes across as incomplete. The authors make a point of a competing interaction with Lys48 of the Ubl domain, but the significance of this is unclear. It is possible that this observation could be an overinterpretation of the crystal structures. Additionally, the rationale for why the ACT element should or shouldn't contribute to in trans activation of different Parkin constructs is not clear. Lastly, the conclusion that this work explains the evolutionary nature of this element in chordates is highly overstated.

      We agree with the reviewer that the significance of Lys48 is unclear. We have presented this just as one of the observations from the crystal structure. As the reviewer suggested, we have removed the sentence about the evolutionary nature of this element from the revised manuscript.

      (5) The analysis of the REP linker element also seems incomplete. The authors identify contacts to a neighboring pUb molecule in their crystal structure, but the connection between this interface (which could be a crystallization artifact) and their biochemical activity data is not straightforward. The analysis of flexibility within this region using crystallographic and AlphaFold modeling observations is very indirect. The authors also draw parallels with linker regions in other RBR ligases that are involved in recognizing the E2-loaded Ub. Firstly, it is not clear from the text or figures whether the "conserved" hydrophobic within the linker region is involved in these alternative Ub interfaces. And secondly, the authors appear to jump to the conclusion that the Parkin linker region also binds an E2-loaded Ub, even though their original observation from the crystal structure seems inconsistent with this. The entire analysis feels very preliminary and also comes across as tangential to the primary storyline of in trans Parkin activation.

      We agree with the reviewer that crystal structure data and biochemical data are not directly linked. In the revised manuscript, we have also highlighted the conserved hydrophobic in the linker region at the ubiquitin interface (Fig. 9C and Extended Data Fig. 11A), which was somehow missed in the original manuscript. We want to add that a very similar analysis and supporting experiments identified donor ubiquitin-binding sites on the IBR and helix connecting RING1-IBR (Kumar et al., Nature Str. and Mol. Biol., 2017), which several other groups later confirmed. In the mentioned study, the Ubl domain of Parkin from the symmetry mate Parkin molecule was identified as a mimic of “donor ubiquitin” on IBR and helix connecting RING1-IBR.

      In the present study, a neighboring pUb molecule in the crystal structure is identified as a donor ubiquitin mimic (Fig. 9C) by supporting biophysical/biochemical experiments. First, we show that mutation of I411A in the REP linker of Parkin perturbs Parkin interaction with E2~Ub (donor) (Fig. 9F). Another supporting experiment was performed using a Ubiquitin-VS probe assay, which is independent of E2. Assays using Ubiquitin-VS show that I411A mutation in the REP-RING2 linker perturbs Parkin charging with Ubiquitin-VS (Extended Data Fig. 11 B). Furthermore, the biophysical data showing loss of Parkin interaction with donor ubiquitin is further supported by ubiquitination assays. Mutations in the REP-RING2 linker perturb the Parkin activity (Fig. 9E), confirming biophysical data. This is further confirmed by mutations (L71A or L73A) on ubiquitin (Extended Data Fig. 11C), resulting in loss of Parkin activity. The above experiments nicely establish the role of the REP-RING2 linker in interaction with donor ubiquitin, which is consistent with other RBRs (Extended Data Fig. 11A).

      While we agree with the reviewer that this appears tangential to the primary storyline in trans-Parkin activation, we decided to include this data because it could be of interest to the field.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) For clarity, a schematic of the domain architecture of Parkin would be helpful at the outset in the main figures. This will help with the introduction to better understand the protein organization. This is lost in the Extended Figure in my opinion.

      We thank the reviewer for suggesting this, which we have included in Figure 1 of the revised manuscript.

      (2) Related to the competition between the Ubl and RING2 domains, can competition be shown through another method? SPR, ITC, etc? ITC was used in other experiments, but only in the context of mutations (Lys211Asn)? Can this be done with WT sequence?

      This is an excellent suggestion. In the revised Figure 5, we have performed ITC experiment using WT Parkin, and the results are consistent with what we observed using Lys211Asn Parkin.

      (3) The authors also note that "the AlphaFold model shows a helical structure in the linker region of Parkin (Extended Data Figure 10C), further confirming the flexible nature of this region"... but the secondary structure would not be inherently flexible. This is confusing.

      The flexibility is in terms of the conformation of this linker region observed under the open or closed state of Parkin. In the revised manuscript, we have explained this point more clearly.

      (4) The manuscript needs extensive revision to improve its readability. Minor grammatical mistakes were prevalent throughout.

      We thank the reviewer for pointing out this and we have corrected these in the revised manuscript.

      (5) The confocal images are nice, but inset panels may help highlight the regions of interest (ROIs).

      This is corrected in the revised manuscript.

      (6) Trans is misspelled ("tans") towards the end of the second paragraph on page 16.

      This is corrected in the revised manuscript.

      (7) The schematics are helpful, but some of the lettering in Figure 2 is very small.

      This is corrected in the revised manuscript.

      Reviewer #3 (Recommendations For The Authors):

      (1) A significant portion of the results section refers to the supplement, making the overall readability very difficult.

      We accept this issue as a lot of relevant data could not be added to the main figures and thus ended up in the supplement.  In the revised manuscript, we have moved some of the supplementary figures to the main figures.

      (2) Interpretation of the experiments utilizing many different Parkin constructs and cleavage scenarios (particularly the SEC and crystallography experiments) is extremely difficult. The work would benefit from a layout of the Parkin model system, highlighting cleavage sites, key domain terminology, and mutations used in the study, presented together and early on in the manuscript. Using this to identify a simpler system of referencing Parkin constructs would also be a large improvement.

      This is a great suggestion. We have included these points in the revised manuscript, which has improved the readability.

      (3) Lines 81-83; the authors say they "demonstrate the conformational changes in Parkin during the activation process", but fail to show any actual conformational changes. Further, much of what is demonstrated in this work (in terms of crystal structures) corroborates existing literature. The authors should use caution not to overstate their original conclusions in light of the large body of work in this area.

      We thank the reviewer for pointing out this. We have corrected the above statement in the revised manuscript to indicate that we meant it in the context of trans conformational changes.

      (4) Line 446 and 434; there is a discrepancy about which amino acid is present at residue 409. Is this a K408 typo? The authors also present mutational work on K416, but this residue is not shown in the structure panel.

      We thank the reviewer for pointing out this. In the revised manuscript, we have corrected these typos.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer 1 (Public Review):

      I want to reiterate my comment from the first round of reviews: that I am insufficiently familiar with the intricacies of Maxwell’s equations to assess the validity of the assumptions and the equations being used by WETCOW. The work ideally needs assessing by someone more versed in that area, especially given the potential impact of this method if valid.

      We appreciate the reviewer’s candor. Unfortunately, familiarity with Maxwell’s equations is an essential prerequisite for assessing the veracity of our approach and our claims.

      Effort has been made in these revisions to improve explanations of the proposed approach (a lot of new text has been added) and to add new simulations. However, the authors have still not compared their method on real data with existing standard approaches for reconstructing data from sensor to physical space. Refusing to do so because existing approaches are deemed inappropriate (i.e. they “are solving a different problem”) is illogical.

      Without understanding the importance of our model for brain wave activity (cited in the paper) derived from Maxwell’s equations in inhomogeneous and anisotropic brain tissue, it is not possible to critically evaluate the fundamental difference between our method and the standard so-called “source localization” method which the Reviewer feels it is important to compare our results with. Our method is not “source localization” which is a class of techniques based on an inappropriate model for static brain activity (static dipoles sprinkled sparsely in user-defined areas of interest). Just because a method is “standard” does not make it correct. Rather, we are reconstructing a whole brain, time dependent electric field potential based upon a model for brain wave activity derived from first principles. It is comparing two methods that are “solving different problems” that is, by definition, illogical.

      Similarly, refusing to compare their method with existing standard approaches for spatio-temporally describing brain activity, just because existing approaches are deemed inappropriate, is illogical.

      Contrary to the Reviewer’s assertion, we do compare our results with three existing methods for describing spatiotemporal variations of brain activity.

      First, Figures 1, 2, and 6 compare the spatiotemporal variations in brain activity between our method and fMRI, the recognized standard for spatiotemporal localization of brain activity. The statistical comparison in Fig 3 is a quantitative demonstration of the similarity of the activation patterns. It is important to note that these data are simultaneous EEG/fMRI in order to eliminate a variety of potential confounds related to differences in experimental conditions.

      Second, Fig 4 (A-D) compares our method with the most reasonable “standard” spatiotemporal localization method for EEG: mapping of fields in the outer cortical regions of the brain detected at the surface electrodes to the surface of the skull. The consistency of both the location and sign of the activity changes detected by both methods in a “standard” attention paradigm is clearly evident. Further confirmation is provided by comparison of our results with simultaneous EEG/fMRI spatial reconstructions (E-F) where the consistency of our reconstructions between subjects is shown in Fig 5.

      Third, measurements from intra-cranial electrodes, the most direct method for validation, are compared with spatiotemporal estimates derived from surface electrodes and shown to be highly correlated.

      For example, the authors say that “it’s not even clear what one would compare [between the new method and standard approaches]”. How about:

      (1) Qualitatively: compare EEG activation maps. I.e. compare what you would report to a researcher about the brain activity found in a standard experimental task dataset (e.g. their gambling task). People simply want to be able to judge, at least qualitatively on the same data, what the most equivalent output would be from the two approaches. Note, both approaches do not need to be done at the same spatial resolution if there are constraints on this for the comparison to be useful.

      (2) Quantitatively: compare the correlation scores between EEG activation maps and fMRI activation maps

      These comparison were performed and already in the paper.

      (1) Fig 4 compares the results with a standard attention paradigm (data and interpretation from Co-author Dr Martinez, who is an expert in both EEG and attention). Additionally, Fig 12 shows detected regions of increased activity in a well-known brain circuit from an experimental task (’reward’) with data provided by Co-author Dr Krigolson, an expert in reward circuitry.

      (2) Correlation scores between EEG and fMRI are shown in Fig 3.

      (3) Very high correlation between the directly measured field from intra-cranial electrodes in an epilepsy patient and those estimated from only the surface electrodes is shown in Fig 9.

      There are an awful lot of typos in the new text in the paper. I would expect a paper to have been proof read before submitting.

      We have cleaned up the typos.

      The abstract claims that there is a “direct comparison with standard state-of-the-art EEG analysis in a well-established attention paradigm”, but no actual comparison appears to have been completed in the paper.

      On the contrary, as mentioned above, Fig 4 compares the results of our method with the state-of-the-art surface spatial mapping analysis, with the state-of-the-art time-frequency analysis, and with the state-of-the-art fMRI analysis

      Reviewer 2 (Public Review):

      This is a major rewrite of the paper. The authors have improved the discourse vastly.

      There is now a lot of didactics included but they are not always relevant to the paper.

      The technique described in the paper does in fact leverage several novel methods we have developed over the years for analyzing multimodal space-time imaging data. Each of these techniques has been described in detail in separate publications cited in the current paper. However, the Reviewers’ criticisms stated that the methods were non-standard and they were unfamiliar with them. In lieu of the Reviewers’ reading the original publications, we added a significant amount of text indeed intended to be didactic. However, we can assume the Reviewer that nothing presented was irrelevant to the paper. We certainly had no desire to make the paper any longer than it needed to be.

      The section on Maxwell’s equation does a disservice to the literature in prior work in bioelectromagnetism and does not even address the issues raised in classic text books by Plonsey et al. There is no logical “backwardness” in the literature. They are based on the relative values of constants in biological tissues.

      This criticism highlights the crux of our paper. Contrary to the assertion that we have ignored the work of Plonsey, we have referenced it in the new additional text detailing how we have constructed Maxwell’s Equations appropriate for brain tissue, based on the model suggested by Plonsey that allows the magnetic field temporal variations to be ignored but not the time-dependence electric fields.

      However, the assumption ubiquitous in the vast prior literature of bioelectricity in the brain that the electric field dynamics can be “based on the relative values of constants in biological tissues”, as the Reviewer correctly summarizes, is precisely the problem. Using relative average tissue properties does not take into account the tissue anisotropy necessary to properly account for correct expressions for the electric fields. As our prior publications have demonstrated in detail, taking into account the inhomogeneity and anisotropy of brain tissue in the solution to Maxwell’s Equations is necessary for properly characterizing brain electrical fields, and serves as the foundation of our brain wave theory. This led to the discovery of a new class of brain waves (weakly evanescent transverse cortical waves, WETCOW).

      It is this brain wave model that is used to estimate the dynamic electric field potential from the measurements made by the EEG electrode array. The standard model that ignores these tissue details leads to the ubiquitous “quasi-static approximation” that leads to the conclusion that the EEG signal cannot be spatial reconstructed. It is indeed this critical gap in the existing literature that is the central new idea in the paper.

      There are reinventions of many standard ideas in terms of physics discourses, like Bayesian theory or PCA etc.

      The discussion of Bayesian theory and PCA is in response to the Reviewer complaint that they were unfamiliar with our entropy field decomposition (EFD) method and the request that we compare it with other “standard” methods. Again, we have published extensively on this method (as referenced in the manuscript) and therefore felt that extensive elaboration was unnecessary. Having been asked to provide such elaboration and then being pilloried for it therefore feels somewhat inappropriate in our view. This is particularly disappointing as the Reviewer claims we are presenting “standard” ideas when in fact the EFD is new general framework we developed to overcome the deficiencies in standard “statistical” and probabilistic data analysis methods that are insufficient for characterizing non-linear, nonperiodic, interacting fields that are the rule, rather than the exception, in complex dynamical systems, such as brain electric fields (or weather, or oceans, or ....).

      The EFD is indeed a Bayesian framework, as this is the fundamental starting point for probability theory, but it is developed in a unique and more general fashion than previous data analysis methods. (Again, this is detailed in several references in the papers bibliography. The Reviewer’s requested that an explanation be included in the present paper, however, so we did so). First, Bayes Theorem is expressed in terms of a field theory that allows an arbitrary number of field orders and coupling terms. This generality comes with a penalty, which is that it’s unclear how to assess the significance of the essentially infinite number of terms. The second feature is the introduction of a method by which to determine the significant number of terms automatically from the data itself, via the our theory of entropy spectrum pathways (ESP), which is also detailed in a cited publication, and which produces ranked spatiotemporal modes from the data. Rather than being “reinventions of many standard ideas” these are novel theoretical and computational methods that are central to the EEG reconstruction method presented in the paper.

      I think that the paper remains quite opaque and many of the original criticisms remain, especially as they relate to multimodal datasets. The overall algorithm still remains poorly described. benchmarks.

      It’s not clear how to assess the criticisms that the algorithm is poorly described yet there is too much detail provided that is mistakenly assessed as “standard”. Certainly the central wave equations that are estimated from the data are precisely described, so it’s not clear exactly what the Reviewer is referring to.

      The comparisons to benchmark remain unaddressed and the authors state that they couldn’t get Loreta to work and so aborted that. The figures are largely unaltered, although they have added a few more, and do not clearly depict the ideas. Again, no benchmark comparisons are provided to evaluate the results and the performance in comparison to other benchmarks.

      As we have tried to emphasize in the paper, and in the Response to Reviewers, the standard so-called “source localization” methods are NOT a benchmark, as they are solving an inappropriate model for brain activity. Once again, static dipole “sources” arbitrarily sprinkled on pre-defined regions of interest bear little resemblance to observed brain waves, nor to the dynamic electric field wave equations produced by our brain wave theory derived from a proper solution to Maxwell’s equations in the anisotropic and inhomogeneous complex morphology of the brain.

      The comparison with Loreta was not abandoned because we couldn’t get it to work, but because we could not get it to run under conditions that were remotely similar to whole brain activity described by our theory, or, more importantly, by an rationale theory of dynamic brain activity that might reproduce the exceedingly complex electric field activity observed in numerous neuroscience experiments.

      We take issue with the rather dismissive mention of “a few more” figures that “do not clearly depict the idea” when in fact the figures that have been added have demonstrated additional quantitative validation of the method.


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer 1 (Public Review):

      The paper proposes a new source reconstruction method for electroencephalography (EEG) data and claims that it can provide far superior spatial resolution than existing approaches and also superior spatial resolution to fMRI. This primarily stems from abandoning the established quasi-static approximation to Maxwell’s equations.<br /> The proposed method brings together some very interesting ideas, and the potential impact is high. However, the work does not provide the evaluations expected when validating a new source reconstruction approach. I cannot judge the success or impact of the approach based on the current set of results. This is very important to rectify, especially given that the work is challenging some long- standing and fundamental assumptions made in the field.

      We appreciate the Reviewer’s efforts in reviewing this paper and have included a significant amount of new text to address their concerns.

      I also find that the clarity of the description of the methods, and how they link to what is shown in the main results hard to follow.

      We have added significantly more detail on the methods, including more accessible explanations of the technical details, and schematic diagrams to visualize the key processing components.

      I am insufficiently familiar with the intricacies of Maxwell’s equations to assess the validity of the assumptions and the equations being used by WETCOW. The work therefore needs assessing by someone more versed in that area. That said, how do we know that the new terms in Maxwell’s equations, i.e. the time-dependent terms that are normally missing from established quasi-static-based approaches, are large enough to need to be considered? Where is the evidence for this?

      The fact that the time-dependent terms are large enough to be considered is essentially the entire focus of the original papers [7,8]. Time-dependent terms in Maxwell’s equations are generally not important for brain electrodynamics at physiological frequencies for homogeneous tissues, but this is not true for areas with stroung inhomogeneity and ansisotropy.

      I have not come across EFD, and I am not sure many in the EEG field will have. To require the reader to appreciate the contributions of WETCOW only through the lens of the unfamiliar (and far from trivial) approach of EFD is frustrating. In particular, what impact do the assumptions of WETCOW make compared to the assumptions of EFD on the overall performance of SPECTRE?

      We have added an entire new section in the Appendix that provides a very basic introduction to EFD and relates it to more commonly known methods, such as Fourier and Independent Components Analyses.

      The paper needs to provide results showing the improvements obtained when WETCOW or EFD are combined with more established and familiar approaches. For example, EFD can be replaced by a first-order vector autoregressive (VAR) model, i.e. y<sub>t</sub> = Ay<sub>t−1</sub> + e<sub>t</sub> (where y<sub>t</sub> is [num<sub>gridpoints</sub> ∗ 1] and A is [num<sub>gridpoints</sub> ∗ num<sub>gridpoints</sub>] of autoregressive parameters).

      The development of EFD, which is independent of WETCOW, stemmed from the necessity of developing a general method for the probabilistic analysis of finitely sampled non-linear interacting fields, which are ubiquitous in measurements of physical systems, of which functional neuroimaging data (fMRI, EEG) are excellent examples. Standard methods (such as VAR) are inadequate in such cases, as discussed in great detail in our EFD publications (e.g., [12,37]). The new appendix on EFD reviews these arguments. It does not make sense to compare EFD with methods which are inappropriate for the data.

      The authors’ decision not to include any comparisons with established source reconstruction approaches does not make sense to me. They attempt to justify this by saying that the spatial resolution of LORETA would need to be very low compared to the resolution being used in SPECTRE, to avoid compute problems. But how does this stop them from using a spatial resolution typically used by the field that has no compute problems, and comparing with that? This would be very informative. There are also more computationally efficient methods than LORETA that are very popular, such as beamforming or minimum norm.

      he primary reason for not comparing with ’source reconstruction’ (SR) methods is that we are are not doing source reconstruction. Our view of brain activity is that it involves continuous dynamical non-linear interacting fields througout the entire brain. Formulating EEG analysis in terms of reconstructing sources is, in our view, like asking ’what are the point sources of a sea of ocean waves’. It’s just not an appropriate physical model. A pre-chosen limited distribution of static dipoles is just a very bad model for brain activity, so much so that it’s not even clear what one would compare. Because in our view, as manifest in our computational implementation, one needs to have a very high density of computational locations throughout the entire brain, including white matter, and the reconstructed modes are waves whose extent can be across the entire brain. Our comments about the low resolution of computational methods for SR techniques really is expressing the more overarching concern that they are not capable of, or even designed for, detecting time-dependent fields of non-linear interacting waves that exist everywhere througout the brain. Moreover, the SR methods always give some answer, but in our view the initial conditions upon which those methods are based (pre-selected regions of activity with a pre-selected number of ’sources’) is a highly influential but artificial set of strong computational constraints that will almost always provide an answer consist with (i.e., biased toward) the expectations of the person formlating the problem, and is therefore potentially misleading.

      In short, something like the following methods needs to be compared:

      (1) Full SPECTRE (EFD plus WETCOW)

      (2) WETCOW + VAR or standard (“simple regression”) techniques

      (3) Beamformer/min norm plus EFD

      (4) Beamformer/min norm plus VAR or standard (“simple regression”) techniques

      The reason that no one has previously ever been able to solve the EEG inverse problem is due to the ubiquitous use of methods that are too ’simple’, i.e., are poor physical models of brain activity. We have spent a decade carefully elucidating the details of this statement in numerous highly technical and careful publications. It therefore serves no purpose to return to the use of these ’simple’ methods for comparison. We do agree, however, that a clearer overview of the advantages of our methods is warranted and have added significant additional text in this revision towards that purpose.

      This would also allow for more illuminating and quantitative comparisons of the real data. For example, a metric of similarity between EEG maps and fMRI can be computed to compare the performance of these methods. At the moment, the fMRI-EEG analysis amounts to just showing fairly similar maps.

      We disagree with this assessment. The correlation coefficient between the spatially localized activation maps is a conservative sufficient statistic for the measure of statistically significant similarity. These numbers were/are reported in the caption to Figure 5, and have now also been moved to, and highlighted in, the main text.

      There are no results provided on simulated data. Simulations are needed to provide quantitative comparisons of the different methods, to show face validity, and to demonstrate unequivocally the new information that SPECTRE can ’potentially’ provide on real data compared to established methods. The paper ideally needs at least 3 types of simulations, where one thing is changed at a time, e.g.:

      (1) Data simulated using WETCOW plus EFD assumptions

      (2) Data simulated using WETCOW plus e.g. VAR assumptions

      (3) Data simulated using standard lead fields (based on the quasi-static Maxwell solutions) plus e.g. VAR assumptions

      These should be assessed with the multiple methods specified earlier. Crucially the assessment should be quantitative showing the ability to recover the ground truth over multiple realisations of realistic noise. This type of assessment of a new source reconstruction method is the expected standard

      We have now provided results on simulated data, along with a discussion on what entails a meaningful simulation comparison. In short, our original paper on the WETCOW theory included a significant number of simulations of predicted results on several spatial and temporal scales. The most relevant simulation data to compare with the SPECTRE imaging results are the cortical wave loop predicted by WETCOW theory and demonstrated via numerical simulation in a realistic brain model derived from high resolution anatomical (HRA) MRI data. The most relevant data with which to compare these simulations are the SPECTRE recontruction from the data that provides the closest approximation to a “Gold Standard” - reconstructions from intra-cranial EEG (iEEG). We have now included results (new Fig 8) that demonstrate the ability of SPECTRE to reconstruct dynamically evolving cortical wave loops in iEEG data acquired in an epilepsy patient that match with the predicted loop predicted theoretically by WETCOW and demonstrated in realistic numerical simulations.

      The suggested comparison with simple regression techniques serves no purpose, as stated above, since that class of analysis techniques was not designed for non-linear, non-Gaussian, coupled interacting fields predicted by the WETCOW model. The explication of this statement is provided in great detail in our publications on the EFD approach and in the new appendix material provided in this revision. The suggested simulation of the dipole (i.e., quasi-static) model of brain activity also serves no purpose, as our WETCOW papers have demonstrated in great detail that is is not a reasonable model for dynamic brain activity.

      Reviewer 2 (Public Review):

      Strengths:

      If true and convincing, the proposed theoretical framework and reconstruction algorithm can revolutionize the use of EEG source reconstructions.

      Weaknesses:

      There is very little actual information in the paper about either the forward model or the novel method of reconstruction. Only citations to prior work by the authors are cited with absolutely no benchmark comparisons, making the manuscript difficult to read and interpret in isolation from their prior body of work.

      We have now added a significant amount of material detailing the forward model, our solution to the inverse problem, and the method of reconstruction, in order to remedy this deficit in the previous version of the paper.

      Recommendations for the authors:

      Reviewer 1 (Recommendations):

      It is not at all clear from the main text (section 3.1) and the caption, what is being shown in the activity patterns in Figures 1 and 2. What frequency bands and time points etc? How are the values shown in the figures calculated from the equations in the methods?

      We have added detailed information on the frequency bands reconstructed and the activity pattern generation and meaning. Additional information on the simultaneous EEG/fMRI acquisition details has been added to the Appendix.

      How have the activity maps been thresholded? Where are the color bars in Figures 1 and 2?

      We have now included that information in new versions of the figures. In addition, the quantitative comparison between fMRI and EEG are presented is now presented in a new Figure 2 (now Figure 3).

      P30 “This term is ignored in the current paper”. Why is this term ignored, but other (time-dependent) terms are not?

      These terms are ignored because they represent higher order terms that complicate the processing (and intepretation) but do not substatially change the main results. A note to this effect has been added to the text.

      The concepts and equations in the EFD section are not very accessible (e.g. to someone unfamiliar with IFT).

      We have added a lengthy general and more accessible description of the EFD method in the Appendix.

      Variables in equation 1, and the following equation, are not always defined in a clear, accessible manner. What is ?

      We have added additional information on how Eqn 1 (now Eqn 3) is derived, and the variables therein.

      In the EFD section, what do you mean conceptually by α, i.e. “the coupled parameters α”?

      This sentence has been eliminated, as it was superfluous and confusing.

      How are the EFD and WETCOW sections linked mathematically? What is ψ (in eqn 2) linked to in the WETCOW section (presumably ϕ<sub>ω</sub>?) ?

      We have added more introductory detail at the beginning of the Results to describe the WETCOW theory and how this is related to the inverse problem for EEG.

      What is the difference between data d and signal s in section 6.1.3? How are they related?

      We have added a much more detailed Appendix A where this (and other) details are provided.

      What assumptions have been made to get the form for the information Hamiltonian in eqn3?

      Eq 3 (now Eqn A.5) is actually very general. The approximations come in when constructing the interaction Hamiltonian H<sub>i</sub>.

      P33 “using coupling between different spatio-temporal points that is available from the data itself” I do not understand what is meant by this.

      This was a poorly worded sentence, but this section has now been replaced by Appendix A, which now contains the sentence that prior information “is contained within the data itself”. This refers to the fact that the prior information consists of correlations in the data, rather than some other measurements independent of the original data. This point is emphasized because in many Bayesian application, prior information consists of knowledge of some quantity that were acquired independently from the data at hand (e.g., mean values from previous experiments)

      Reviewer 2 (Recommendations):

      Abstract

      The first part presents validation from simultaneous EEG/fMRI data, iEEG data, and comparisons with standard EEG analyses of an attention paradigm. Exactly what constitutes adequate validation or what metrics were used to assess performance is surprisingly absent.

      Subsequently, the manuscript examines a large cohort of subjects performing a gambling task and engaging in reward circuits. The claim is that this method offers an alternative to fMRI.

      Introduction

      Provocative statements require strong backing and evidence. In the first paragraph, the “quasi-static” assumption which is dominant in the field of EEG and MEG imaging is questioned with some classic citations that support this assumption. Instead of delving into why exactly the assumption cannot be relaxed, the authors claim that because the assumption was proved with average tissue properties rather than exact, it is wrong. This does not make sense. Citations to the WETCOW papers are insufficient to question the quasi-static assumption.

      The introduction purports to validate a novel theory and inverse modeling method but poorly outlines the exact foundations of both the theory (WETCOW) and the inverse modeling (SPECTRE) work.

      We have added a new introductory subsection (“A physical theory of brain waves”) to the Results section that provides a brief overview of the foundations of the WETCOW theory and an explicit description of why the quasi-static approximation can be abandoned. We have expanded the subsequent subsection (“Solution to the inverse EEG problem”) to more clearly detail the inverse modeling (SPECTRE) method.

      Section 3.2 Validation with fMRI

      Figure 1 supposedly is a validation of this promising novel theoretical approach that defies the existing body of literature in this field. Shockingly, a single subject data is shown in a qualitative manner with absolutely no quantitative comparison anywhere to be found in the manuscript. While there are similarities, there are also differences in reconstructions. What to make out of these discrepancies? Are there distortions that may occur with SPECTRE reconstructions? What are its tradeoffs? How does it deal with noise in the data?

      It is certainly not the case that there are no quantitative comparisons. Correlation coefficients, which are the sufficient statistics for comparison of activation regions, are given in Figure 5 for very specific activation regions. Figure 9 (now Figure 11) shows a t-statistic demonstrating the very high significance of the comparison between multiple subjects. And we have now added a new Figure 7 demonstrating the strongly correlated estimates for full vs surface intra-cranial EEG reconstructions. To make this more clear, we have added a new section “Statistical Significance of the Results”.

      We note that a discussion of the discrepancies between fMRI and EEG was already presented in the Supplementary Material. Therein we discuss the main point that fMRI and EEG are measuring different physical quantities and so should not be expected to be identical. We also highlight the fact that fMRI is prone to significant geometrical distortions for magnetic field inhomogeities, and to physiological noise. To provide more visibility for this important issue, we have moved this text into the Discussion section.

      We do note that geometric distortions in fMRI data due to suboptimal acquisitions and corrections is all too common. This, coupled with the paucity of open source simultaneous fMRI-EEG data, made it difficult to find good data for comparison. The data on which we performed the quantitative statistical comparison between fMRI and EEG (Fig 5) was collected by co-author Dr Martinez, and was of the highest quality and therefore sufficient for comparison. The data used in Fig 1 and 2 was a well publicized open source dataset but had significant fMRI distortions that made quantitative comparison (i.e., correlation coefficents between subregions in the Harvard-Oxford atlas) suboptimal. Nevertheless, we wanted to demonstrate the method in more than one source, and feel that visual similarity is a reasonble measure for this data.

      Section 3.2 Validation with fMRI

      Figure 2 Are the sample slices being shown? How to address discrepancies? How to assume that these are validations when there are such a level of discrepancies?

      It’s not clear what “sample slices” means. The issue of discrepancies is addressed in the response to the previous query.

      Section 3.2 Validation with fMRI

      Figure 3 Similar arguments can be made for Figure 3. Here too, a comparison with source localization benchmarks is warranted because many papers have examined similar attention data.

      Regarding the fMRI/EEG comparison, these data are compared quantitatively in the text and in Figure 5.

      Regarding the suggestion to perform standard ’source localization’ analysis, see responses to Reviewer 1.

      Section 3.2 Validation with fMRI

      Figure 4 While there is consistency across 5 subjects, there are also subtle and not-so-subtle differences.

      What to make out of them?

      Discrepancies in activations patterns between individuals is a complex neuroscience question that we feel is well beyond the scope of this paper.

      Section 3.2 Validation with fMRI

      Figures 5 & 6 Figure 5 is also a qualitative figure from two subjects with no appropriate quantification of results across subjects. The same is true for Figure 6.

      On the contrary, Figure 5 contains a quantitative comparison, which is now also described in the text. A quantitative comparison for the epilepsy data in Fig 6 (and C.4-C.6) is now shown in Fig 7.

      Section 3.2 Validation with fMRI

      Given the absence of appropriate “validation” of the proposed model and method, it is unclear how much one can trust results in Section 4.

      We believe that the quantitative comparisons extant in the original text (and apparently missed by the Reviewer) along with the additional quantitative comparisons are sufficient to merit trust in Section 4.

      Section 3.2 Validation with fMRI

      What are the thresholds used in maps for Figure 7? Was correction for multiple comparisons performed? The final arguments at the end of section 4 do not make sense. Is the claim that all results of reconstructions from SPECTRE shown here are significant with no reason for multiple comparison corrections to control for false positives? Why so?

      We agree that the last line in Section 4 is misleading and have removed it.

      Section 3.2 Validation with fMRI

      Discussion is woefully inadequate in addition to the inconclusive findings presented here.

      We have added a significant amount of text to the Discussion to address the points brought up by the Reviewer. And, contrary to the comments of this Reviewer, we believe the statistically significant results presented are not “inconclusive”.

      Supplementary Materials

      This reviewer had an incredibly difficult time understanding the inverse model solution. Even though this has been described in a prior publication by the authors, it is important and imperative that all details be provided here to make the current manuscript complete. The notation itself is so nonstandard. What is Σ<sup>ij</sup>, δ<sup>ij</sup>? Where is the reference for equation (1)? What about the equation for <sup>ˆ</sup>(R)? There are very few details provided on the exact implementation details for the Fourier-space pseudo-spectral approach. What are the dimensions of the problem involved? How were different tissue compartments etc. handled? Equation 1 holds for the entire volume but the measurements are only made on the surface. How was this handled? What is the WETCOW brain wave model? I don’t see any entropy term defined anywhere - where is it?

      We have added more detail on the theoretical and numerical aspects of the inverse problem in two new subsections “Theory” and “Numerical Implementation” in the new section “Solution to the inverse EEG problem”.

      Supplementary Materials

      So, how can one understand even at a high conceptual level what is being done with SPECTRE?

      We have added a new subsection “Summary of SPECTRE” that provides a high conceptual level overview of the SPECTRE method outlined in the preceding sections.

      Supplementary Materials

      In order to understand what was being presented here, it required the reader to go on a tour of the many publications by the authors where the difficulty in understanding what they actually did in terms of inverse modeling remains highly obscure and presents a huge problem for replicability or reproducibility of the current work.

      We have now included more basic material from our previous papers, and simplified the presentation to be more accessible. In particular, we have now moved the key aspects of the theoretic and numerical methods, in a more readable form, from the Supplementary Material to the main text, and added a new Appendix that provides a more intuitive and accessible overview of our estimation procedures.

      Supplementary Materials

      How were conductivity values for different tissue types assigned? Is there an assumption that the conductivity tensor is the same as the diffusion tensor? What does it mean that “in the present study only HRA data were used in the estimation procedure?” Does that mean that diffusion MRI data was not used? What is SYMREG? If this refers to the MRM paper from the authors in 2018, that paper does not include EEG data at all. So, things are unclear here.

      The conductivity tensor is not exactly the same as the diffusion tensor in brain tissues, but they are closely related. While both tensors describe transport properties in brain tissue, they represent different physical processes. The conductivity tensor is often assumed to share the same eigenvectors as the diffusion tensor. There is a strong linear relationship between the conductivity and diffusion tensor eigenvalues, as supported by theoretical models and experimental measurements. For the current study we only used the anatomical data for estimatition and assignment of different tissue types and no diffusion MRI data was used. To register between different modalities, including MNI, HRA, function MRI, etc., and to transform the tissue assignment into an appropriate space we used the SYMREG registration method. A comment to the effect has been added to the text.

      Supplementary Materials

      How can reconstructed volumetric time-series of potential be thought of as the EM equivalent of an fMRI dataset? This sentence doesn’t make sense.

      This sentence indeed did not make sense and has been removed.

      Supplementary Materials

      Typical Bayesian inference does not include entropy terms, and entropy estimation doesn’t always lend to computing full posterior distributions. What is an “entropy spectrum pathway”? What is µ∗? Why can’t things be made clear to the reader, instead of incredible jargon used here? How does section 6.1.2 relate back to the previous section?

      That is correct that Bayesian inference typically does not include entropy terms. We believe that their introduction via the theory of entropy spectrum pathways (ESP) is a significant advance in Bayesian estimation as it provides highly relevent prior information from within the data itself (and therefore always available in spatiotemporal data) that facilitates a practical methodology for the analysis of complex non-linear dynamical system, as contained in the entropy field decomposition (EFD).

      Section 6.1.3 has now been replaced by a new Appendix A that discusses ESP in a much more intuitive and conceptual manner.

      Supplementary Materials

      Section 6.1.3 describes entropy field decomposition in very general terms. What is “non-period”? This section is incomprehensible. Without reference to exactly where in the process this procedure is deployed it is extremely difficult to follow. There seems to be an abuse of notation of using ϕ for eigenvectors in equation (5) and potentials earlier. How do equations 9-11 relate back to the original problem being solved in section 6.1.1? What are multiple modalities being described here that require JESTER?

      Section 6.1.3 has now been replaced by a new Appendix A that covers this material in a much more intuitive and conceptual manner.

      Supplementary Materials

      Section 6.3 discusses source localization methods. While most forward lead-field models assume quasistatic approximations to Maxwell’s equations, these are perfectly valid for the frequency content of brain activity being measured with EEG or MEG. Even with quasi-static lead fields, the solutions can have frequency dependence due to the data having frequency dependence. Solutions do not have to be insensitive to detailed spatially variable electrical properties of the tissues. For instance, if a FEM model was used to compute the forward model, this model will indeed be sensitive to the spatially variable and anisotropic electrical properties. This issue is not even acknowledged.

      The frequency dependence of the tissue properties is not the issue. Our theoretical work demonstrates that taking into account the anisotropy and inhomogeneity of the tissue is necessary in order to derive the existence of the weakly evanescent transverse cortical waves (WETCOW) that SPECTRE is detecting. We have added more details about the WETCOW model in the new Section “A physical theory of brain wave” to emphasize this point.

      Supplementary Materials

      Arguments to disambiguate deep vs shallow sources can be achieved with some but not all source localization algorithms and do not require a non-quasi-static formulation. LORETA is not even the main standard algorithm for comparison. It is disappointing that there are no comparisons to source localization and that this is dismissed away due to some coding issues.

      Again, we are not doing ’source localization’. The concept of localized dipole sources is anathema to our brain wave model, and so in our view comparing SPECTRE to such methods only propagates the misleading idea that they are doing the same thing. So they are definitely not dismissed due to coding issues. However, because of repeated requests to do compare SPECTRE with such methods, we attempted to run a standard source localization method with parameters that would at least provide the closest approximation to what we were doing. This attempt highlighted a serious computational issue in source localization methods that is a direct consequence of the fact that they are not attempting to do what SPECTRE is doing - describing a time-varying wave field, in the technical definition of a ’field’ as an object that has a value at every point in space-time.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Summary: 

      Bennion and colleagues present a careful examination of how an earlier set of memories can either interfere with or facilitate memories formed later. This impressive work is a companion piece to an earlier paper by Antony and colleagues (2022) in which a similar experimental design was used to examine how a later set of memories can either interfere with or facilitate memories formed earlier. This study makes contact with an experimental literature spanning 100 years, which is concerned with the nature of forgetting, and the ways in which memories for particular experiences can interact with other memories. These ideas are fundamental to modern theories of human memory, for example, paired-associate studies like this one are central to the theoretical idea that interference between memories is a much bigger contributor to forgetting than any sort of passive decay. 

      Strengths: 

      At the heart of the current investigation is a proposal made by Osgood in the 1940s regarding how paired associates are learned and remembered. In these experiments, one learns a pair of items, A-B (cue-target), and then later learns another pair that is related in some way, either A'-B (changing the cue, delta-cue), or A-B' (changing the target, delta-target), or A'-B' (changing both, delta-both), where the prime indicates that item has been modified, and may be semantically related to the original item. The authors refer to the critical to-be-remembered pairs as base pairs. Osgood proposed that when the changed item is very different from the original item there will be interference, and when the changed item is similar to the original item there will be facilitation. Osgood proposed a graphical depiction of his theory in which performance was summarized as a surface, with one axis indicating changes to the cue item of a pair and the other indicating changes to the target item, and the surface itself necessary to visualize the consequences of changing both. 

      In the decades since Osgood's proposal, there have been many studies examining slivers of the proposal, e.g., just changing targets in one experiment, just changing cues in another experiment. Because any pair of experiments uses different methods, this has made it difficult to draw clear conclusions about the effects of particular manipulations. 

      The current paper is a potential landmark, in that the authors manipulate multiple fundamental experimental characteristics using the same general experimental design. Importantly, they manipulate the semantic relatedness of the changed item to the original item, the delay between the study experience and the test, and which aspect of the pair is changed. Furthermore, they include both a positive control condition (where the exact same pair is studied twice), and a negative control condition (where a pair is only studied once, in the same phase as the critical base pairs). This allows them to determine when the prior learning exhibits an interfering effect relative to the negative control condition and also allows them to determine how close any facilitative effects come to matching the positive control. 

      The results are interpreted in terms of a set of existing theories, most prominently the memory-for-change framework, which proposes a mechanism (recursive reminding) potentially responsible for the facilitative effects examined here. One of the central results is the finding that a stronger semantic relationship between a base pair and an earlier pair has a facilitative effect on both the rate of learning of the base pair and the durability of the memory for the base pair. This is consistent with the memory-for-change framework, which proposes that this semantic relationship prompts retrieval of the earlier pair, and the two pairs are integrated into a common memory structure that contains information about which pair was studied in which phase of the experiment. When semantic relatedness is lower, they more often show interference effects, with the idea being that competition between the stored memories makes it more difficult to remember the base pair. 

      This work represents a major methodological and empirical advance for our understanding of paired-associates learning, and it sets a laudably high bar for future work seeking to extend this knowledge further. By manipulating so many factors within one set of experiments, it fills a gap in the prior literature regarding the cognitive validity of an 80-year-old proposal by Osgood. The reader can see where the observed results match Osgood's theory and where they are inconclusive. This gives us insight, for example, into the necessity of including a long delay in one's experiment, to observe potential facilitative effects. This point is theoretically interesting, but it is also a boon for future methodological development, in that it establishes the experimental conditions necessary for examining one or another of these facilitation or interference effects more closely. 

      We thank the reviewer for their thorough and positive comments -- thank you so much!

      Weaknesses: 

      One minor weakness of the work is that the overarching theoretical framing does not necessarily specify the expected result for each and every one of the many effects examined. For example, with a narrower set of semantic associations being considered (all of which are relatively high associations) and a long delay, varying the semantic relatedness of the target item did not reliably affect the memorability of that pair. However, the same analysis showed a significant effect when the wider set of semantic associations was used. The positive result is consistent with the memory-for-change framework, but the null result isn't clearly informative to the theory. I call this a minor weakness because I think the value of this work will grow with time, as memory researchers and theorists use it as a benchmark for new theory development. For example, the data from these experiments will undoubtedly be used to develop and constrain a new generation of computational models of paired-associates learning. 

      We thank the reviewer for this constructive critique. We agree that the experiments with a narrower set of semantic associations are less informative; in fact, we thought about removing these experiments from the current study, but given that we found results in the ΔBoth condition in Antony et al. (2022) using these stimuli that we did NOT find in the wider set, we thought it was worth including for a thorough comparison. We hope that the analyses combining the two experiment sets (Fig 6-Supp 1) are informative for contextualizing the results in the ‘narrower’ experiments and, as the reviewer notes, for informing future researchers.

      Reviewer #2 (Public Review): 

      Summary: 

      The study focuses on how relatedness with existing memories affects the formation and retention of new memories. Of core interest were the conditions that determine when prior memories facilitate new learning or interfere with it. Across a set of experiments that varied the degree of relatedness across memories as well as retention interval, the study compellingly shows that relatedness typically leads to proactive facilitation of new learning, with interference only observed under specific conditions and immediate test and being thus an exception rather than a rule. 

      Strengths: 

      The study uses a well-established word-pair learning paradigm to study interference and facilitation of overlapping memories. However it goes more in-depth than a typical interference study in the systematic variation of several factors: (1) which elements of an association are overlapping and which are altered (change target, change cue, change both, change neither); (2) how much the changed element differs from the original (word relatedness, with two ranges of relatedness considered); (3) retention period (immediate test, 2-day delay). Furthermore, each experiment has a large N sample size, so both significant effects as well as null effects are robust and informative. 

      The results show the benefits of relatedness, but also replicate interference effects in the "change target" condition when the new target is not related to the old target and when the test is immediate. This provides a reconciliation of some existing seemingly contradictory results on the effect of overlap on memory. Here, the whole range of conditions is mapped to convincingly show how the direction of the effect can flip across the surface of relatedness values. 

      Additional strength comes from supporting analyses, such as analyses of learning data, demonstrating that relatedness leads to both better final memory and also faster initial learning. 

      More broadly, the study informs our understanding of memory integration, demonstrating how the interdependence of memory for related information increases with relatedness. Together with a prior study or retroactive interference and facilitation, the results provide new insights into the role of reminding in memory formation. 

      In summary, this is a highly rigorous body of work that sets a great model for future studies and improves our understanding of memory organization. 

      We thank their reviewer for their thorough summary and very supportive words!

      Weaknesses: 

      The evidence for the proactive facilitation driven by relatedness is very convincing. However, in the finer scale results, the continuous relationship between the degree of relatedness and the degree of proactive facilitation/interference is less clear. This could be improved with some additional analyses and/or context and discussion. In the narrower range, the measure used was AS, with values ranging from 0.03-0.98, where even 0.03 still denotes clearly related words (pious - holy). Within this range from "related" to "related a lot", no relationship to the degree of facilitation was found. The wider range results are reported using a different scale, GloVe, with values from -0.14 to 0.95, where the lower end includes unrelated words (sap - laugh). It is possible that any results of facilitation/interference observed in the wider range may be better understood as a somewhat binary effect of relatedness (yes or no) rather than the degree of relatedness, given the results from the narrower condition. These two options could be more explicitly discussed. The report would benefit from providing clearer information about these measures and their range and how they relate to each other (e.g., not a linear transformation). It would be also helpful to know how the values reported on the AS scale would end up if expressed in the GloVe scale (and potentially vice-versa) and how that affects the results. Currently, it is difficult to assess whether the relationship between relatedness and memory is qualitative or quantitative. This is less of a problem with interdependence analyses where the results converge across a narrow and wider range. 

      We thank the reviewer for this point. While other analyses do show differences across the range of AS values we used, we agree in the case of the memorability analysis in the narrower stimulus set, 48-hr experiment (or combining across the narrower and wider stimulus sets), there could be a stronger influence of binary (yes/no) relatedness. We have now made this point explicitly (p. 26):

      “Altogether, these results show that PI can still occur with low relatedness, like in other studies finding PI in ΔTarget (A-B, A-D) paradigms (for a review, see Anderson & Neely, 1996), but PF occurs with higher relatedness. In fact, the absence of low relatedness pairs in the narrower stimulus set likely led to the strong overall PF in this condition across all pairs (positive y-intercept in the upper right of Fig 3A). In this particular instance, there may have been a stronger influence of a binary factor (whether they are related or not), though this remains speculative and is not the case for other analyses in our paper.”

      Additionally, we have also emphasized that the two relatedness metrics are not linear transforms of each other. Finally, as in addressing both your and reviewer #3’s comment below, we now graph relatedness values under a common GloVe metric in Fig 1-Supp 1C (p. 9):

      “Please note that GloVe is an entirely different relatedness metric and is not a linear transformation of AS (see Fig 1-Supp 1C for how the two stimulus sets compare using the common GloVe metric).”

      A smaller weakness is generalizability beyond the word set used here. Using a carefully crafted stimulus set and repeating the same word pairings across participants and conditions was important for memorability calculations and some of the other analyses. However, highlighting the inherently noisy item-by-item results, especially in the Osgood-style surface figures, makes it challenging to imagine how the results would generalize to new stimuli, even within the same relatedness ranges as the current stimulus sets. 

      We thank the reviewer for this critique. We have added this caveat in the limitations to suggest that future studies should replicate these general findings with different stimulus sets (p. 28):

      “Finally, future studies could ensure these effects are not limited to these stimuli and generalize to other word stimuli in addition to testing other domains (Baek & Papaj, 2024; Holding, 1976).”

      Reviewer #3 (Public Review): 

      Summary: 

      Bennion et al. investigate how semantic relatedness proactively benefits the learning of new word pairs. The authors draw predictions from Osgood (1949), which posits that the degree of proactive interference (PI) and proactive facilitation (PF) of previously learned items on to-be-learned items depends on the semantic relationships between the old and new information. In the current study, participants learn a set of word pairs ("supplemental pairs"), followed by a second set of pairs ("base pairs"), in which the cue, target, or both words are changed, or the pair is identical. Pairs were drawn from either a narrower or wider stimulus set and were tested after either a 5-minute or 48-hour delay. The results show that semantic relatedness overwhelmingly produces PF and greater memory interdependence between base and supplemental pairs, except in the case of unrelated pairs in a wider stimulus set after a short delay, which produced PI. In their final analyses, the authors compare their current results to previous work from their group studying the analogous retroactive effects of semantic relatedness on memory. These comparisons show generally similar, if slightly weaker, patterns of results. The authors interpret their results in the framework of recursive reminders (Hintzman, 2011), which posits that the semantic relationships between new and old word pairs promote reminders of the old information during the learning of the new to-be-learned information. These reminders help to integrate the old and new information and result in additional retrieval practice opportunities that in turn improve later recall. 

      Strengths: 

      Overall, I thought that the analyses were thorough and well-thought-out and the results were incredibly well-situated in the literature. In particular, I found that the large sample size, inclusion of a wide range of semantic relatedness across the two stimulus sets, variable delays, and the ability to directly compare the current results to their prior results on the retroactive effects of semantic relatedness were particular strengths of the authors' approach and make this an impressive contribution to the existing literature. I thought that their interpretations and conclusions were mostly reasonable and included appropriate caveats (where applicable). 

      We thank the reviewer for this kind, effective summary and highlight of the paper’s strengths!

      Weaknesses: 

      Although I found that the paper was very strong overall, I have three main questions and concerns about the analyses. 

      My first concern lies in the use of the narrow versus wider stimulus sets. I understand why the initial narrow stimulus set was defined using associative similarity (especially in the context of their previous paper on the retroactive effects of semantic similarity), and I also understand their rationale for including an additional wider stimulus set. What I am less clear on, however, is the theoretical justification for separating the datasets. The authors include a section combining them and show in a control analysis that there were no directional effects in the narrow stimulus set. The authors seem to imply in the Discussion that they believe there are global effects of the lower average relatedness on differing patterns of PI vs PF across stimulus sets (lines 549-553), but I wonder if an alternative explanation for some of their conflicting results could be that PI only occurs with pairs of low semantic relatedness between the supplemental and base pair and that because the narrower stimulus set does not include the truly semantically unrelated pairs, there was no evidence of PI. 

      We agree with the reviewer’s interpretation here, and we have now directly stated this in the discussion section (p. 26):

      “Altogether, these results show that PI can still occur with low relatedness, like in other studies finding PI in ΔTarget (A-B, A-D) paradigms (for a review see, Anderson & Neely, 1996), but PF occurs with higher relatedness. In fact, the absence of low relatedness pairs in the narrower stimulus set likely led to the strong overall PF in this condition across all pairs (positive y-intercept in the upper right of Fig 3A).”

      As for the remainder of this concern, please see our response to your elaboration on the critique below.

      My next concern comes from the additive change in both measures (change in Cue + change in Target). This measure is simply a measure of overall change, in which a pair where the cue changes a great deal but the target doesn't change is treated equivalently to a pair where the target changes a lot, but the cue does not change at all, which in turn are treated equivalently to a pair where the cue and target both change moderate amounts. Given that the authors speculate that there are different processes occurring with the changes in cue and target and the lack of relationship between cue+target relatedness and memorability, it might be important to tease apart the relative impact of the changes to the different aspects of the pair. 

      We thank the reviewer for this great point. First, we should clarify that we only added cue and target similarity values in the ΔBoth condition, which means that all instances of equivalence relate to non-zero values for both cue and target similarity. However, it is certainly possible cue and target similarity separately influence memorability or interdependence. We have now run this analysis separately for cue and target similarity (but within the ΔBoth condition). For memorability, neither cue nor target similarity independently predicted memorability within the ΔBoth condition in any of the four main experiments (all p > 0.23). Conversely, there were some relationships with interdependence. In the narrower stimulus set, 48-hr delay experiment, both cue and target similarity significantly or marginally predicted base-secondary pair interdependence (Cue: r = 0.30, p = 0.04; Target: r = 0.29, p = 0.054). Notably, both survived partial correlation analyses partialing out the other factor (Cue: r = 0.33, p = 0.03; Target: r = 0.32, p = 0.04). In the wider stimulus set, 48-hr delay experiment, only target similarity predicted interdependence (Cue: r = 0.09, p = 0.55; Target: r = 0.34, p = 0.02), and target similarity also predicted interdependence after partialing out cue similarity (r = 0.34, p = 0.02). Similarly, in the narrower stimulus set, 5-min delay experiment, only target similarity predicted interdependence (Cue: r = 0.01, p = 0.93; Target: r = 0.41, p = 0.005), and target similarity also predicted interdependence after partialing out cue similarity (r = 0.42, p = 0.005). Neither predicted interdependence in the wider stimulus set, 5-min delay experiment (Cue: r = -0.14, p = 0.36; Target: r = 0.09, p = 0.54). We have opted to leave this out of the paper for now, but we could include it if the reviewer believes it is worthwhile.

      Note that we address the multiple regression point raised by the reviewer in the critique below.

      Finally, it is unclear to me whether there was any online spell-checking that occurred during the free recall in the learning phase. If there wasn't, I could imagine a case where words might have accidentally received additional retrieval opportunities during learning - take for example, a case where a participant misspelled "razor" as "razer." In this example, they likely still successfully learned the word pair but if there was no spell-checking that occurred during the learning phase, this would not be considered correct, and the participant would have had an additional learning opportunity for that pair. 

      We did not use online spell checking. We agree that misspellings would be considered successful instances of learning (meaning that for those words, they would essentially have successful retrieval more than once). However, we do not have a reason to think that this would meaningfully differ across conditions, so the main learning results would still hold. We have included this in the Methods (p. 29-30):

      “We did not use spell checking during learning, meaning that in some cases pairs could have been essentially retrieved more than once. However, we do not believe this would differ across conditions to affect learning results.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      In terms of the framing of the paper, I think the paper would benefit from a clearer explication of the different theories at play in the introductory section. There are a few theories being examined. Memory-for-change is described in most detail in the discussion, it would help to describe it more deliberately in the intro. The authors refer to a PI account, and this is contrasted with the memory-for-change account, but it seems to me that these theories are not mutually exclusive. In the discussion, several theories are mentioned in passing without being named, e.g., I believe the authors are referring to the fan effect when they mention the difference between delta-cue and delta-target conditions. Perhaps this could be addressed with a more detailed account of the theory underlying Osgood's predictions, which I believe arise from an associative account of paired-associates memory. Osgood's work took place when there was a big debate between unlearning and interference. The current work isn't designed to speak directly to that old debate. But it may be possible to develop the theory a bit more in the intro, which would go a long way towards scaffolding the many results for the reader, by giving them a better sense up front of the theoretical implications. 

      We thank the reviewer for this comment and the nudge to clarify these points. First, we have now made the memory-for-change and remindings accounts more explicit in the introduction, as well as the fact that we are combining the two in forming predictions for the current study (p. 3):

      “Conversely, in favor of the PF account, we consider two main, related theories. The first is the importance of “remindings” in memory, which involve reinstating representations from an earlier study phase during later learning (Hintzman, 2011). This idea centers study-phase retrieval, which involves being able to mentally recall prior information and is usually applied to exact repetitions of the same material (Benjamin & Tullis, 2010; Hintzman et al., 1975; Siegel & Kahana, 2014; Thios & D’Agostino, 1976; Zou et al., 2023). However, remindings can occur upon the presentation of related (but not identical) material and can result in better memory for both prior and new information when memory for the linked events becomes more interdependent (Hintzman, 2011; Hintzman et al., 1975; McKinley et al., 2019; McKinley & Benjamin, 2020; Schlichting & Preston, 2017; Tullis et al., 2014; Wahlheim & Zacks, 2019). The second is the memory-for-change framework, which builds upon these ideas and argues that humans often retrieve prior experiences during new learning, either spontaneously by noticing changes from what was learned previously or by instruction (Jacoby et al., 2015; Jacoby & Wahlheim, 2013). The key advance of this framework is that recollecting changes is necessary for PF, whereas PI occurs without recollection. This framework has been applied to paradigms including stimulus changes, including common paired associate paradigms (e.g., A-B, A-D) that we cover extensively later. Because humans may be more likely to notice and recall prior information when it is more related to new information, these two accounts would predict that semantic relatedness instead promotes successful remindings, which would create PF and interdependence among the traces.”

      Second, as the reviewer suggests, we were referring to the fan effect in the discussion, and we have now made that more explicit (p. 26):

      “We believe these effects arise from the competing processes of impairments between competing responses at retrieval that have not been integrated versus retrieval benefits when that integration has occurred (which occurs especially often with high target relatedness). These types of competing processes appear operative in various associative learning paradigms such as retrieval-induced forgetting (Anderson & McCulloch, 1999; Carroll et al., 2007), and the fan effect (Moeser, 1979; Reder & Anderson, 1980).”

      Finally, our reading of Osgood’s proposal is as an attempt to summarize the qualitative effects of the scattered literature (as of 1949) and did not discuss many theories. For this reason, we generally focus on the directional predictions relating to Osgood’s surface, but we couch it in theories proposed since then.

      It strikes me that the advantage seen for items in the retroactive study compared to the proactive study is consistent with classic findings examining spontaneous recovery. These classic studies found that first-learned materials tended to recover to a level above second-learned materials as time passed. This could be consistent with the memory-for-change proposal presented in the text. The memory-for-change proposal provides a potential cognitive mechanism for the effect, here I'm just suggesting a connection that could be made with the spontaneous recovery literature. 

      We thank the reviewer for this suggestion. Indeed, we agree there is a meaningful point of connection here. We have added the following to the Discussion (p. 27):

      “Additionally, these effects partially resemble those on spontaneous recovery, whereby original associations tend to face interference after new, conflicting learning, but slowly recover over time (either absolutely or relative to the new learning) and often eventually eclipse memory for the new information (Barnes & Underwood, 1959; Postman et al., 1969; Wheeler, 1995). In both cases, original associations appear more robust to change over time, though it is unclear whether these similar outcomes stem from similar mechanisms.”

      Minor recommendations 

      Line 89: relative existing -> relative to existing. 

      Line 132: "line from an unrelated and identical target" -> from an unrelated to identical target (take a look, just needs rephrasing). 

      Line 340: (e.g. peace-shaverazor) I wasn't clear whether this was a typographical error, or whether the intent was to typographically indicate a unified representation. <br /> Line 383: effects on relatedness -> effects of relatedness. 

      We think the reviewer for catching these errors. We have fixed them, and for the third comment, we have clarified that we indeed meant to indicate a unified representation (p. 12):

      “[e.g., peace-shaverazor (written jointly to emphasize the unification)]”

      Page 24: Figure 8. I think the statistical tests in this figure are just being done between the pairs of the same color? Like in the top left panel, delta-cue pro and delta-target retro are adjacent and look equivalent, but there is no n.s. marking for this pair. Could consider keeping the connecting line between the linked conditions and removing the connecting lines that span different conditions. 

      Indeed, we were only comparing conditions with the same color. We have changed the connecting lines to reflect this.

      Page 26 line 612: I think this is the first mention that the remindings account is referred to as the memory-for-change framework, consider mentioning this in the introduction. 

      Thank you – we have now mentioned this in the introduction.

      Lines 627-630. Is this sentence referring to the fan effect? If so it could help the reader to name it explicitly. 

      We have now named this explicitly.

      Reviewer #2 (Recommendations For The Authors): 

      This is a matter of personal preference, but I would prefer PI and PF spelled out instead of the abbreviations. This was also true for RI and RF which are defined early but then not used for 20 pages before being re-used again. In contrast, the naming of the within-subject conditions was very intuitive. 

      We appreciate this perspective. However, we prefer to keep the terms PI and PF for the sake of brevity. We now re-introduce terms that do not return until later in the manuscript.

      Osgood surface in Figure 1A could be easier to read if slightly reformatted. For example, target and cue relatedness sides are very disproportional and I kept wondering if that was intentional. The z-axis could be slightly more exaggerated so it's easier to see the critical messages in that figure (e.g., flip from + to - effect along the one dimension). The example word pairs were extremely helpful. 

      Figures 1C and 1D were also very helpful. It would be great if they could be a little bigger as the current version is hard to read. 

      Figure 1B took a while to decipher and could use a little more anticipation in the body of the text. Any reason to plot the x-axis from high to low on this figure? It is confusing (and not done in the actual results figures). I believe the supplemental GloVe equivalent in the supplement also has a confusing x-axis. 

      Thank the reviewer for this feedback. We have modified Figure 1A to reduce the disproportionality and accentuate the z-axis changes. We have also made the text in C and D larger. Finally, we have flipped around the x-axis in B and in the supplement.

      The description of relatedness values was rather confusing. It is not intuitive to accept that AS values from 0.03-0.96 are "narrow", as that seems to cover almost the whole theoretical range. I do understand that 0.03 is still a value showing relatedness, but more explanation would be helpful. It is also not clear how the GloVe values compare to the AS values. If I am understanding the measures and ranges correctly, the "narrow" condition could also be called "related only" while the "wide" condition could be called "related and unrelated". This is somewhat verbalized but could be clearer. In general, please provide a straightforward way for a reader to explicitly or implicitly compare those conditions, or even plot the "narrow" condition using both AS values and GloVe values so one can really compare narrow and wider conditions comparing apples with apples. 

      We thank the reviewer for this critique. First, we have now sought to clarify this in the Introduction (p. 11-12):

      “Across the first four experiments, we manipulated two factors: range of relatedness among the pairs and retention interval before the final test. The narrower range of relatedness used direct AS between pairs using free association norms, such that all pairs had between 0.03-0.96 association strength. Though this encompasses what appears to be a full range of relatedness values, pairs with even low AS are still related in the context of all possible associations (e.g., pious-holy has AS = 0.03 but would generally be considered related) (Fig 1B). The stimuli using a wider range of relatedness spanned the full range of global vector similarity (Pennington et al., 2014) that included many associations that would truly be considered unrelated (Fig 1-Supp 1A). One can see the range of the wider relatedness values in Fig 1-Supp 1B and comparisons between narrower and wider relatedness values in Fig 1-Supp 1C.”

      Additionally, as noted in the text above, we have added a new subfigure to Fig 1-Supp 1 that compares the relatedness values in the narrower and wider stimulus sets using the common GloVe metric.

      Considering a relationship other than linear may also be beneficial (e.g., the difference between AS of 0.03 and 0.13 may not be equal to AS of .83 and .93; same with GloVe). I am assuming that AS and GloVe are not linear transforms of each other. Thus, it is not clear whether one should expect a linear (rather than curvilinear or another monotonic) relationship with both of them. It could be as simple as considering rank-order correlation rather than linear correlation, but just wanted to put this out for consideration. The linear approach is still clearly fruitful (e.g., interdependence), but limits further the utility of having both narrow and wide conditions without a straightforward way to compare them. 

      We thank the reviewer for this point. Indeed, AS and GloVe are not linear transforms of each other, but metrics derived from different sources (AS comes from human free associations; GloVe comes from a learned vector space language model). (We noted this in the text and in our response to your above comment.) However, we do have the ability to put all the word pairs into the GloVe metric, which we do in the Results section, “Re-assessing proactive memory and interdependence effects using a common metric”. In this analysis, we used a linear correlation that combined data sets with a similar retention interval and replicated our main findings earlier in the paper (p. 5):

      “In the 48-hr delay experiment, correlations between memorability and cue relatedness in the ΔCue condition [r2(44) > 0.29, p < 0.001] and target relatedness in the ΔTarget condition [r2(44) = 0.2, p < 0.001] were significant, whereas cue+target relatedness in the ΔBoth condition was not [r2(44) = 0.01, p = 0.58]. In all three conditions, interdependence increased with relatedness [all r2(44) > 0.16, p < 0.001].”

      Following the reviewer suggestion to test things out using rank order, we also re-created the combined analysis using rank order based on GloVe values rather than the raw GloVe values. The ranks now span 1-90 (because there were 45 pairs in each of the narrower and wider stimulus sets). All results qualitatively held.

      Author response image 1.

      Rank order results.

      Author response image 2.

      And the raw results in Fig 6-Supp 1 (as a reference).

      Reviewer #3 (Recommendations For The Authors):

      In regards to my first concern, the authors could potentially test whether the stimulus sets are different by specifically looking at pairs from the wider stimulus set that overlap with the range of relatedness from the narrow set and see if they replicate the results from the narrow stimulus set. If the results do not differ, the authors could simplify their results section by collapsing across stimulus sets (as they did in the analyses presented in Figure 6 - Supplementary Figure 1). If the authors opt to keep the stimulus sets separate, it would be helpful to include a version of Figure 1b/Figure 1 - Supplementary Figure 1 where the coverage of the two stimulus sets are plotted on the same figure using GloVe similarity so it is easier to interpret the results. 

      We have conducted this analysis in two ways, though we note that we will eventually settle upon keeping the stimulus sets separate. First, we examined memorability between the data sets by removing one pair at a time from the wider stimulus set until there was no significant difference (p > 0.05). We did this at the long delay because that was more informative for most of our analyses. Even after reducing the wider stimulus set, the narrow stimulus set still had significantly or marginally higher memorability in all three conditions (p < 0.001 for ΔCue; p < 0.001 for ΔTarget; p = 0.08 for ΔBoth. We reasoned that this was likely because the AS values still differed (all, p < 0.001), which would present a clear way for participants to associate words that may not be as strongly similar in vector space (perhaps due to polysemy for individual words). When we ran the analysis a different way that equated AS, we no longer found significant memorability differences (p \= 0.13 for ΔCue; p = 0.50 for ΔTarget; p = 0.18 for ΔBoth). However, equating the two data sets in this analysis required us to drop so many pairs to equate the wider stimulus data set (because only a few only had a direct AS connection; there were 3, 5, and 1 pairs kept in the ΔCue, ΔTarget, and ΔBoth conditions) that we would prefer not to report this result.

      Additionally, we now plot the two stimulus sets on the same plot (Reviewer 2 also suggested this).

      In regards to my second concern, one potential way the authors could disambiguate the effects of change in cue vs change in target might be to run a multiple linear regression with change in Cue, change in Target, and the change in Cue*change in Target interaction (potentially with random effects of subject identity and word pair identity to combine experiments and control for pair memorability/counterbalancing), which has the additional bonus of potentially allowing the authors to include all word pairs in a single model and better describe the Osgood-style spaces in Figure 6.

      This is a very interesting idea. We set this analysis up as the reviewer suggested, using fixed effects for ΔCue, ΔTarget, and ΔCue*ΔTarget, and random effects for subject and word ID. Because we had a binary outcome variable, we used mixed effects logistic regression. For a given pair, if it had the same cue or target, the corresponding change column received a 0, and if it had a different cue or target, it received a graded value (1 - GloVe value between the new and old cue or target). For this analysis, because we designed this analysis to indicate a treatment away from a repeat (as in the No Δ condition, which had no change for either cues and targets), we omitted control items. For items in the ΔBoth condition, we initially used positive values in both the Cue and Target columns too, with the multiplied ΔCue*ΔTarget value in its own column. We focused these analyses on the 48-hr delay experiments. In both experiments, running it this way resulted in highly significant negative effects of ΔCue and ΔTarget (both p < 0.001), but positive effects of ΔCue*ΔTarget (p < 0.001), presumably because after accounting for the negative independent predictions of both ΔCue and ΔTarget, ΔCue*ΔTarget values actually were better than expected.

      We thought that those results were a little strange given that generally there did not appear to be interactions with ΔCue*ΔTarget values, and the positive result was simply due to the other predictors in the model. To show that this is the case, we changed the predictors so that items in the ΔBoth condition had 0 in ΔCue and ΔTarget columns alongside their ΔCue*ΔTarget value. In this case, all three factors negatively predicted memory (all p < 0.001).

      We don't necessarily see this second approach as better, partly because it seems clear to us that any direction you go from identity is just hurting memory, and we felt the need to drop the control condition. We next flipped around the analysis to more closely resemble how we ran the other analyses, using similarity instead of distance. Here, identity along any dimension indicated a 1, a change in any part of the pair involved using that pair’s GloVe value (rather than the 1 – the GloVe value from above), and the control condition simply had zeros in all the columns. In this case, if we code the cue and target similarity values as themselves in the ΔBoth condition, in both 48-hr experiments, cue and target similarity significantly positively predicted memory (narrower set: cue similarity had p = 0.006, target similarity had p < 0.001; wider set: both p < 0.001) and the interaction term negatively predicted memory (p < 0.001 in both). If we code cue and target similarity values as 0s in the ΔBoth condition, all three factors tend to be positive (narrower, Cue: p = 0.11, Target and Interaction: p < 0.001; wider, Cue and Target p < 0.001; Interaction: p = 0.07).

      Ultimately, we would prefer to leave this out of the manuscript in the interest of simplicity and because we largely find that these analyses support our prior conclusions. However, we could include them if the reviewer prefers.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review):

      In this study, Alejandro Rosell et al. uncovers the immunoregulation functions of RAS-p110α pathway in macrophages, including the extravasation of monocytes from the bloodstream and subsequent lysosomal digestion. Disrupting RAS-p110α pathway by mouse genetic tools or by pharmacological intervention, hampers the inflammatory response, leading to delayed resolution and more severe acute inflammatory reactions. The authors proposed that activating p110α using small molecules could be a promising approach for treating chronic inflammation. This study provides insights into the roles and mechanisms of p110α on macrophage function and the inflammatory response, while some conclusions are still questionable because of several issues described below. 

      (1) Fig. 1B showed that disruption of RAS-p110α causes the decrease in the activation of NF-κB, which is a crucial transcription factor that regulates the expression of proinflammatory genes. However, the authors observed that disruption of RAS-p110α interaction results in an exacerbated inflammatory state in vivo, in both localized paw inflammation and systemic inflammatory mediator levels. Also, the authors introduced that "this disruption leads to a change in macrophage polarization, favoring a more proinflammatory M1 state" in introduction according to reference 12. The conclusions drew from the signaling and the models seemed contradictory and puzzling. Besides, it is not clear why the protein level of p65 was decreased at 10' and 30'. Was it attributed to the degradation of p65 or experimental variation? 

      We thank the reviewer for this insightful comment and apologize for not previously explaining the implications of the observed decrease in NF-κB activation. We found a decrease in NF-κB activation in response to LPS + IFN-γ stimulation in macrophages lacking RAS-PI3K interaction. As the reviewer pointed out, NF-κB is a key transcription factor that regulates the expression of various proinflammatory genes. To better characterize whether the decrease in p-p65 would lead to a reduction in the expression of specific cytokines, we performed a cytokine array using unstimulated and LPS + IFN-γ stimulated macrophages. The results indicated a small number of cytokines with altered expression, validating that RAS-p110α activation of p-p65 regulates the expression of some inflammatory cytokines. These results have been added to the manuscript and to Figure 1 (panels C and D). In brief, the data suggest an impairment in recruitment factors and inflammatory regulators following the disruption of RAS-p110α signaling in macrophages, which aligns with the observed in vivo phenotype. 

      Our findings indicate that the disruption of RAS-p110α signaling has a complex and multifaceted role in BMDMs. Specifically, monocytes lacking RAS-PI3K are unable to reach the inflamed area due to an impaired ability to extravasate, caused by altered actin cytoskeleton dynamics. Consequently, inflammation is sustained over time, continuously releasing inflammatory mediators. Moreover, we have shown that macrophages deficient in RAS-p110α interaction fail to mount a full inflammatory response due to decreased activation of p-p65, leading to reduced production of a set of inflammatory regulators. Additionally, these macrophages are unable to effectively process phagocytosed material and activate the resolutive phase of inflammation. As a result of these defects, an exacerbated and sustained inflammatory response occurs. 

      Our in vivo data, showing an increase in systemic inflammatory mediators, might be a consequence of the accumulation of monocytes produced by bone marrow progenitors in response to sensed inflammatory stimuli, but unable to extravasate.

      Regarding the sentence in the introduction: "this disruption leads to a change in macrophage polarization, favoring a more proinflammatory M1 state" (reference 12), this was observed in an oncogenic context, which might differ from the role of RAS-p110α in a non-oncogenic situation, as analyzed in this work. We introduced these results as an example to establish the role of RAS-p110α in macrophages, demonstrating its participation in macrophage-dependent responses. Together with our study, these findings clearly indicate that p110α signaling is critical when analyzing full immune responses. Previously, little was known about the role of this PI3K isoform in immune responses. Our data, along with those presented by Murillo et al. (ref. 12), demonstrate that p110α plays a significant role in macrophage function in both oncogenic and inflammatory contexts. Additionally, our results suggest that this role is complex and multifaceted, warranting further investigation to fully understand the complexity of p110α signaling in macrophages.

      Regarding decreased levels of p65 at 10’ and 30’ in RBD cells we are still uncertain about the possible molecular mechanism leading to the observed decrease. No changes in p65 mRNA levels were observed after 30 minutes of LPS+IFNγ treatment as shown in Author response image 1.

      Author response image 1.

      Preliminary data not shown here suggest that treating macrophages with BYL exhibits a similar effect, indicating a potential pathway for investigation. Considering that the decrease in protein levels is not due to lower mRNA expression, we may infer that post-translational mechanisms are leading to early protein degradation in RAS-p110α deficient macrophages. This could explain the observed decrease in protein activation. However, the specific molecular mechanism responsible for this degradation remains unclear, and further research is necessary to elucidate it. 

      (2) In Fig 3, the authors used bone-marrow derived macrophages (BMDMs) instead of isolated monocytes to evaluate the ability of monocyte transendothelial migration, which is not sufficiently convincing. In Fig. 3B, the authors evaluated the migration in Pik3caWT/- BMDMs, and Pik3caWT/WT BMDMs treated with BYL-719'. Given that the dose effect of gene expression, the best control is Pik3caWT/- BMDMs treated with BYL-719. 

      We thank reviewer for this comment. While we agree that using BMDMs might not be the most conventional approach for studying monocyte migration, there were several reasons why we still considered them a valid method. While isolated monocytes are the initial cell type involved in transendothelial migration, bone marrow-derived macrophages (BMDMs) provide a relevant and practical model for studying this process. BMDMs are differentiated from the same bone marrow precursors as monocytes and retain the ability to respond to chemotactic signals, adhere to endothelial cells, and migrate through the endothelium. This makes them a suitable tool for examining the cellular and molecular mechanisms underlying monocyte migration and subsequent macrophage infiltration into tissues. Additionally, BMDMs offer experimental consistency and are easier to manipulate in vitro, enabling more controlled and reproducible studies. 

      In response to the comment regarding Fig. 3B, we appreciate the suggestion to use Pik3ca WT/- BMDMs treated with BYL-719 as a control. However, our rationale for using Pik3ca WT/WT BMDMs treated with BYL-719 was based on a conceptual approach rather than a purely experimental control. The BYL-719 treatment in Pik3ca WT/WT cells was intended to simulate the inhibition of p110α in a fully functional, wild-type context. This allows us to directly assess the impact of p110α inhibition under normal physiological conditions, which is more representative of what would occur in an organism where the full dose of Pik3ca is present. Using Pik3ca WT/- BMDMs treated with BYL-719 as a control may not accurately reflect the in vivo scenario, where any therapeutic intervention would likely occur in the context of a fully functional, wild-type background. Our approach aims to provide a clearer understanding of how p110α inhibition affects cell functionality in a wild-type setting, which is relevant for potential therapeutic applications. Therefore, we considered the use of Pik3ca WT/WT BMDMs with BYL-719 treatment to be a more appropriate control for testing the effects of p110α inhibition in normal conditions.

      (3) In Fig. 4E-4G, the authors observed that elevated levels of serine 3 phosphorylated Cofilin in Pik3caRBD/- BMDMs both in unstimulated and in proinflammatory conditions, and phosphorylation of Cofilin at Ser3 increase actin stabilization, it is not clear why disruption of RAS-p110α binding caused a decrease in the F-actin pool in unstimulated BMDMs? 

      We thank the reviewer for this insightful comment. During the review process, we have carefully quantified all the Western blots conducted. While we did observe an increase in phospho-Cofilin (Ser3) levels in RBD BMDMs, this increase did not reach statistical significance. As a result, we cannot confidently attribute the observed increase in F-actin to this proposed mechanism. We apologize for any confusion this may have caused. Consequently, we have removed these data from Figure 4G and the associated discussion.

      Unfortunately, we have not yet identified the underlying mechanism responsible for this phenotype. Future experiments will focus on exploring potential alterations in other actin-nucleating, regulating, and stabilizing proteins that could account for the observed changes in F-actin levels.

      Reviewer #2 (Public Review): 

      Summary: 

      Cell intrinsic signaling pathways controlling the function of macrophages in inflammatory processes, including in response to infection, injury or in the resolution of inflammation are incompletely understood. In this study, Rosell et al. investigate the contribution of RAS-p110α signaling to macrophage activity. p110α is a ubiquitously expressed catalytic subunit of PI3K with previously described roles in multiple biological processes including in epithelial cell growth and survival, and carcinogenesis. While previous studies have already suggested a role for RAS-p110α signaling in macrophages function, the cell intrinsic impact of disrupting the interaction between RAS and p110α in this central myeloid cell subset is not known. 

      Strengths: 

      Exploiting a sound previously described genetically mouse model that allows tamoxifen-inducible disruption of the RAS-p110α pathway and using different readouts of macrophage activity in vitro and in vivo, the authors provide data consistent with their conclusion that alteration in RAS-p110α signaling impairs the function of macrophages in a cell intrinsic manner. The study is well designed, clearly written with overall high-quality figures. 

      Weaknesses: 

      My main concern is that for many of the readouts, the difference between wild-type and mutant macrophages in vitro or between wild-type and Pik3caRBD mice in vivo is rather modest, even if statistically significant (e.g. Figure 1A, 1C, 2A, 2F, 3B, 4B, 4C). In other cases, such as for the analysis of the H&E images (Figure 1D-E, S1E), the images are not quantified, and it is hard to appreciate what the phenotype in samples from Pik3caRBD mice is or whether this is consistently observed across different animals. Also, the authors claim there is a 'notable decrease' in Akt activation but 'no discernible chance' in ERK activation based on the western blot data presented in Figure 1A. I do not think the data shown supports this conclusion. 

      We appreciate the reviewer's careful examination of our data and their observation regarding the modest differences between wild-type and mutant macrophages in vitro, as well as between wild-type and Pik3caRBD mice in vivo. While the differences observed in Figures 1A, 1C, 2A, 2F, 3B, 4B, and 4C are statistically significant but modest, our data demonstrate that they are biologically relevant and should be interpreted within the specific nature of our model. Our study focuses on the disruption of the RASp110α interaction, but it should be noted that alternative pathways for p110α activation, independent of RAS, remain functional in this model. Additionally, the model retains the expression of other p110 isoforms, such as p110β, p110γ, and p110δ, which are known to have significant roles in immune responses. Given the overlapping functions of these p110 isoforms, and the fact that our model involves a subtle modification that specifically affects the RAS-p110α interaction without completely abrogating p110α activity, it is understandable that only modest effects are observed in some readouts. The redundancy and compensation by other p110 isoforms likely mitigate the impact of disrupting RAS-mediated p110α activation.

      However, despite these modest in vitro differences, it is crucial to highlight that the in vivo effects on inflammation are both clear and consistent. The persistence of inflammation in our model suggests that the RAS-p110α interaction plays a specific, non-redundant role in resolving inflammation, which cannot be fully compensated by other signaling pathways or p110 isoforms. These findings underscore the importance of RAS-p110α signaling in immune homeostasis and suggest that even subtle disruptions in this pathway can lead to significant physiological consequences over time, particularly in the context of inflammation. The modest differences observed may represent early or subtle alterations that could lead to more pronounced phenotypes under specific stress or stimulation conditions. This could be tested across all the figures mentioned. For instance, in Fig. 1A, the Western blot for AKT has been quantified, demonstrating a significant decrease in AKT levels; in Fig. 1C, although the difference in paw inflammation was only a few millimeters in thickness, considering the size of a mouse paw, those millimeters were very noticeable by eye. Furthermore, pathological examination of the tissue consistently showed an increase in inflammation in RBD mice. Furthermore, the consistency of the observed differences across different readouts and experimental setups reinforces the reliability and robustness of our findings. Even modest changes that are consistently observed across different assays and conditions are indicative of genuine biological effects. The statistical significance of the differences indicates that they are unlikely to be due to random variation. This statistical rigor supports the conclusion that the observed effects, albeit modest, are real and warrant further exploration.

      Regarding the analysis of H&E images, we have now quantified the changes with the assistance of the pathologist, Mª Carmen García Macías, who has been added to the author list. We removed the colored arrows from the images and instead quantified fibrin and chromatin remnants as markers of inflammation staging. Loose chromatin, which increases as a consequence of cell death, is higher in the early phases of inflammation and decreases as macrophages phagocytose cell debris to initiate tissue healing. Chromatin content was scored on a scale from 1 to 3, where 1 represents the lowest amount and 3 the highest. The scoring was based on the area within the acute inflammatory abscess where chromatin could be found: 3 for less than 30%, 2 for 30-60%, and 1 for over 60%. Graphs corresponding to this quantification have now been added to Figure 1 and an explanation of the scale has been added to Material and Methods. 

      To further substantiate the extent of macrophage function alteration upon disruption of RAS-p110α signaling, the manuscript would benefit from testing macrophage activity in vitro and in vivo across other key macrophage activities such as bacteria phagocytosis, cytokine/chemokine production in response to titrating amounts of different PAMPs, inflammasome function, etc. This would be generally important overall but also useful to determine whether the defects in monocyte motility or macrophage lysosomal function are selectively controlled downstream of RAS-p110α signaling.  

      We thank reviewer #2 for this comment. In order to better address the role of RAS-PI3K in macrophage function, we have performed some additional experiments, some of which have been added to the revised version of the manuscript. 

      (1) We have performed cytokine microarrays of RAS-p110α deficient macrophages unstimulated and stimulated with LPS+IFN-g. Results have been added to the manuscript and to Supplementary Figure S1E and S1F. In brief, the data obtained suggest an impairment in recruitment factors, as well as in inflammatory regulators after disruption of RAS-p110α signaling in macrophages, which align with the in vivo observed phenotype. 

      (2) We also conducted phagocytosis assays to analyze the ability of RAS-p110α deficient macrophages to phagocytose 1 µm Sepharose beads, Borrelia burgdorferi, and apoptotic cells. The data reveal varied behavior of RAS-p110α deficient bone marrow-derived macrophages (BMDMs) depending on the target: 

      • Engulfment of Non-biological Particles: RAS-p110α deficient macrophages showed a decreased ability to engulf 1 µm Sepharose beads. This suggests that RAS-p110α signaling is important for the effective phagocytosis of non-biological particles. These findings have now been added to the text and figures have been added to supplementary Fig. S4A

      • Response to Bacterial Pathogens: When exposed to Borrelia burgdorferi, RAS-p110α deficient macrophages did not exhibit a change in bacterial uptake. This indicates that RAS-p110α may not play a critical role in the initial phagocytosis of this bacterial pathogen. The observed increase in the phagocytic index, although not statistically significant, might imply a compensatory mechanism or a more complex interaction that warrants further investigation. These findings have now been added to the text and figures have been added to supplementary Fig. S4B. These experiments were performed in collaboration with Dr. Anguita, from CICBioBune (Bilbao, Spain) and, as a consequence, he has been added as an author in the paper. 

      • Phagocytosis of Apoptotic Cells: There were no differences in the phagocytosis rate of apoptotic cells between RAS-p110α deficient and control macrophages at early time points. However, the accumulation of engulfed material at later time points suggests a possible delay in the processing and degradation of apoptotic cells in the absence of RAS-p110α signaling.

      These findings highlight the complexity of RAS-p110α's involvement in phagocytic processes and suggest that its role may vary with different types of phagocytic targets. 

      Furthermore, given the key role of other myeloid cells besides macrophages in inflammation and immunity it remains unclear whether the phenotype observed in vivo can be attributed to impaired macrophage function. Is the function of neutrophils, dendritic cells or other key innate immune cells not affected? 

      Thank you for this insightful comment. We understand the key role of other myeloid cells in inflammation and immunity. However, our study specifically focuses on the role of macrophages. Our data show that disruption of RAS-PI3K leads to a clear defect in macrophage extravasation, and our in vitro data demonstrate issues in macrophage cytoskeleton and phagocytosis, aligning with the in vivo phenotype.

      Experiments investigating the role of RAS-PI3K in neutrophils, dendritic cells, or other innate immune cells are beyond the scope of this study. Understanding these interactions would indeed require separate, comprehensive studies and the generation of new mouse models to disrupt RAS-PI3K exclusively in specific cell types.

      Furthermore, during paw inflammation experiments, polymorphonuclear cells were present from the initial phases of the inflammatory response. What caught our attention was the prolonged presence of these cells. In conversation with our in-house pathologist, she mentioned the lack of macrophages to remove dead polymorphonuclear cells in our RAS-PI3K mutant mice. Specific staining for macrophages confirmed the absence of macrophages in the inflamed node of mutant mice.

      We acknowledge that further research is necessary to elucidate the effects on other myeloid cells. However, our current findings provide clear evidence of a decrease in inflammatory monocytes and defective macrophage responses to inflammation, both in vivo and in vitro. We believe these results significantly contribute to understanding the role of RAS-PI3K in macrophage function during inflammation.

      Compelling proof of concept data that targeting RAS-p110α signalling constitutes indeed a putative approach for modulation of chronic inflammation is lacking. Addressing this further would increase the conceptual advance of the manuscript and provide extra support to the authors' suggestion that p110α inhibition or activation constitute promising approaches to manage inflammation. 

      We thank Reviewer #2 for this insightful comment. In our manuscript, we have demonstrated through multiple experiments that the inhibition of p110α, either by disrupting RAS-p110α signaling or through the use of Alpelisib (BYL-719), has a modulatory effect on inflammatory responses. However, we acknowledge that we have not activated the pathway due to the unavailability of a suitable p110α activator until the concluding phase of our study.

      We recognize the importance of this point and are eager about investigating both the inhibition and activation of p110α as potential approaches to managing inflammation in well-established inflammatory disease models. We believe that such comprehensive studies would significantly enhance the conceptual advance and translational relevance of our findings.

      However, it is essential to note that the primary aim of our current work was to demonstrate the role of RAS-p110α in the inflammatory responses of macrophages. We have successfully shown that RASp110α influences macrophage behavior and inflammatory signaling. Expanding the scope to include disease models and pathway activation studies would be an extensive project that goes beyond the current objectives of this manuscript. While our present study establishes the foundational role of RASp110α in macrophage-mediated inflammatory responses, we agree that further investigation into both p110α inhibition and activation in disease models is crucial. We are keen to pursue this line of research in future studies, which we believe will provide robust evidence supporting the therapeutic potential of targeting RAS-p110α signaling in chronic inflammation.

      Finally, the analysis by FACS should also include information about the total number of cells, not just the percentage, which is affected by the relative change in other populations. On this point, Figure S2B shows a substantial, albeit not significant (with less number of mice analysed), increase in the percentage of CD3+ cells. Is there an increase in the absolute number of T cells or does this apparent relative increase reflect a reduction in myeloid cells? 

      We thank the reviewer for this comment, which we have addressed in the revised version of the manuscript. Regarding the total number of cells analyzed, we have added to the Materials and Methods section that in all our studies, a total of 50,000 cells were analyzed (line 749). The percentages of cells are related to these 50,000 events. Additionally, we have increased the number of mice analyzed by including new mice for CD3+ cell analysis. Despite this, the results remain not significant.

      Recommendations for the authors:  

      Reviewer #1 (Recommendations For The Authors):   

      (1) It is recommended to provide a graphical abstract to summarize the multiple functions of RAS-p110α pathway in monocyte/macrophages that the authors proposed 

      We thank reviewer for this useful recommendation. A graphical abstract has now been added to the study. 

      (2) Western blots in this paper need quantification and a measure of reproducibility 

      We have now added a graph with the quantification of the western blots performed in this work as a measure of reproducibility. 

      (3) Representative flow data and gating strategy should be included

      We have now added the description of the gating strategy followed to material and methods section.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer 1:

      (1) Peptides were synthesized with fluorescein isothiocyanate (FITC) and Tat tag, and then PEGylated with methoxy PEG Succinimidyl Succinate.

      I have two concerns about the peptide design. First, FTIC was intended "for monitoring" (line 129), but was never used in the manuscript. Second, PEGylation targets the two lysine sidechains on the Tat, which would alter its penetration property.

      We conducted an analysis of the cellular trafficking of FITC-tagged peptides following their permeabilization into cells.

      Author response image 1.

      However, we did not include it in the main text because it is a basic result.

      (2) As can be seen in the figure above, after pegylation and permeabilization, the cells were stained with FITC. It appears that this does not affect the ability to penetrate into the cells.

      (2) "Superdex 200 increase 10/300 GL column" (line 437) was used to isolate mono/di PEGylated PDZ and separate them from the residual PEG and PDZ peptide. "m-PEG-succinimidyl succinate with an average molecular weight of 5000 Da" (lines 133 and 134).

      To my knowledge, the Superdex 200 increase 10/300 GL column is not suitable and is unlikely to produce traces shown in Figure 1B.

      As Superdex 200 increase 10/300 GL featrues a fractionation range of 10,000 to 600,000 Da, we used it to fractionate PEGylated products including DiPEGylated PDZ (approx. 15 kDa) and MonoPEGylated PDZ (approx. 10 kDa) from residuals (PDZ and PEG), demonstrating successful isolation of PEGylated products (Figure 1C). Considering the molecular weights of PDZ and PEG are approximately 4.1 kDa and and 5.0 kDa, respectively, the late eluting peaks from SEC were likely to represent a mixed absorbance of PDZ and PEG at 215 nm.

      However, as the reviewer pointed out, it could be unreasonable to annotate peaks representing PDZ and PEG, respectively, from mixed absorbance detected in a region (11-12 min) beyond the fractionation range.

      In our revised manuscript, therefore, multiple peaks in the late eluting volume (11-12 min) were labeled as 'Residuals' all together. As a reference, the revised figure 1B includes a chromatogram of pure PDZ-WT under the same analytic condition.

      Therefore, we changed Fig.1B to new results as followed:

      (3) "the in vivo survival effect of LPS and PDZ co-administration was examined in mice. The pretreatment with WT PDZ peptide significantly increased survival and rescued compared to LPS only; these effects were not observed with the mut PDZ peptide (Figure 2a)." (lines 159-160).

      Fig 2a is the weight curve only. The data is missing in the manuscript.

      We added the survived curve into Fig. 2A as followed:

      (4) Table 1, peptide treatment on ALT and AST appears minor.

      In mice treated with LPS, levels of ALT and AGT in the blood are elevated, but these levels decrease upon treatment with WT PDZ. However, the use of mut PDZ does not result in significant changes. Figure 3A shows inflammatory cells within the central vein, yet no substantial hepatotoxicity is observed during the 5-day treatment with LPS. Normally, the ranges of ALT and AGT in C57BL6 mice are 16 ~ 200 U/L and 46 ~ 221 U/L, respectively, according to UCLA Diagnostic Labs. Therefore, the values in all experiments fall within these normal ranges. In summary, a 5-day treatment with LPS induces inflammation in the liver but is too short a duration to induce hepatotoxicity, resulting in lower values.

      (5) MitoTraker Green FM shouldn't produce red images in Figure 6.

      We changed new results (GREEN one) into Figs 6A and B as followed:

      (6) Figure 5. Comparison of mRNA expression in PDZ-treated BEAS-2B cells. Needs a clearer and more detailed description both in the main text and figure legend. The current version is very hard to read.

      We changed Fig. 5A to new one to understand much easier and added more detailed results and figure legend as followed:

      Results Section in Figure 5:

      “…we performed RNA sequencing analysis. The results of RNA-seq analysis showed the expression pattern of 24,424 genes according to each comparison combination, of which the results showed the similarity of 51 genes overlapping in 4 gene categories and the similarity between each comparison combination (Figure 5a). As a result, compared to the control group, it was confirmed that LPS alone, WT PDZ+LPS, and mut PDZ+LPS were all upregulated above the average value in each gene, and when LPS treatment alone was compared with WT PDZ+LPS, it was confirmed that they were averaged or downregulated. When comparing LPS treatment alone and mut PDZ+LPS, it was confirmed that about half of the genes were upregulated. Regarding the similarity between comparison combinations, the comparison combination with LPS…”

      Figure 5 Legend Section:

      “Figure 5. Comparison of mRNA expression in PDZ-treated BEAS-2B cells.

      BEAS-2B cells were treated with wild-type PDZ or mutant PDZ peptide for 24 h and then incubated with LPS for 2 h, after which RNA sequencing analysis was performed. (a) The heat map shows the general regulation pattern of about 51 inflammation-related genes that are differentially expressed when WT PDZ and mut PDZ are treated with LPS, an inflammatory substance. All samples are RED = upregulated and BLUE = downregulated relative to the gene average. Each row represents a gene, and the columns represent the values of the control group treated only with LPS and the WT PDZ and mut PDZ groups with LPS. This was used by converting each log value into a fold change value. All genes were adjusted to have the same mean and standard deviation, the unit of change is the standard deviation from the mean, and the color value range of each row is the same. (b) Significant genes were selected using Gene category chat (Fold change value of 2.00 and normalized data (log2) value of 4.00). The above pie chart shows the distribution of four gene categories when comparing LPS versus control, WT PDZ+LPS/LPS, and mut PDZ+LPS/LPS. The bar graph below shows RED=upregulated, GREEN=downregulated for each gene category, and shows the number of upregulated and downregulated genes in each gene category. (c) The protein-protein interaction network constructed by the STRING database differentially displays commonly occurring genes by comparing WT PDZ+LPS/LPS, mut PDZ+LPS/LPS, and LPS. These nodes represent proteins associated with inflammation, and these connecting lines denote interactions between two proteins. Different line thicknesses indicate types of evidence used in predicting the associations.”

      Reviewer 2:

      (1) In this paper, the authors demonstrated the anti-inflammatory effect of PDZ peptide by inhibition of NF-kB signaling. Are there any results on the PDZ peptide-binding proteins (directly or indirectly) that can regulate LPS-induced inflammatory signaling pathway? Elucidation of the PDZ peptide-its binding partner protein and regulatory mechanisms will strengthen the author's hypothesis about the anti-inflammatory effects of PDZ peptide

      As mentioned in the Discussion section, we believe it is crucial to identify proteins that directly interact with PDZ and regulate it. This direct interaction can modulate intracellular signaling pathways, so we plan to express GST-PDZ and induce binding with cellular lysates, then characterize it using the LC-Mass/Mass method. We intend to further research these findings and submit them for publication.

      (2) The authors presented interesting insights into the therapeutic role of the PDZ motif peptide of ZO-1. PDZ domains are protein-protein interaction modules found in a variety of species. It has been thought that many cellular and biological functions, especially those involving signal transduction complexes, are affected by PDZ-mediated interactions. What is the rationale for selecting the core sequence that regulates inflammation among the PDZ motifs of ZO-1 shown in Figure 1A?

      The rationale for selecting the core sequence that regulates inflammation among the PDZ motifs of ZO-1, as shown in Figure 1A, is grounded in the specific roles these motifs play in signal transduction pathways that are crucial for inflammatory processes. PDZ domains are recognized for their ability to function as scaffolding proteins that organize signal transduction complexes, crucial for modulating cellular and biological functions. The chosen core sequence is particularly important because it is conserved across ZO-1, ZO-2, and ZO-3, indicating a fundamental role in maintaining cellular integrity and signaling pathways. This conservation suggests that the sequence’s involvement in inflammatory regulation is not only significant in ZO-1 but also reflects a broader biological function across the ZO family.

      (3) In Figure 3, the authors showed the representative images of IHC, please add the quantification analysis of Iba1 expression and PAS-positive cells using Image J or other software. To help understand the figure, an indication is needed to distinguish specifically stained cells (for example, a dotted line or an arrow).

      We added the semi-quantitative results into Figs. 4d,e,f as followed:

      Result section: “The specific physiological mechanism by which WT PDZ peptide decreases LPS-induced systemic inflammation in mice and the signal molecules involved remain unclear. These were confirmed by a semi-quantitative analysis of Iba-1 immunoreactivity and PAS staining in liver, kidney, and lung,respectively (Figures 4d, e, and f). To examine whether WT PDZ peptide can alter LPS-induced tissue damage in the kidney, cell toxicity assay was performed (Figure 3g). LPS induced cell damage in the kidney, however, WT PDZ peptide could significantly alleviate the toxicity, but mut PDZ peptide could not. Because cytotoxicity caused by LPS is frequently due to ROS production in the kidney (Su et al., 2023; Qiongyue et al., 2022), ROS production in the mitochondria was investigated in renal mitochondria cells harvested from kidney tissue (Figure 3h)....”

      Figure legend section: “Indicated scale bars were 20 μm. (d,e,f) Semi-quantitative analysis of each are positive for Iba-1 in liver and kidney, and positive cells of PAS in lung, respectively. (g) After the kidneys were harvested, tissue lysates were used for MTT assay. (h) After...”

      (4) In Figure 6G, H, the authors confirmed the change in expression of the M2 markers by PDZ peptide using the mouse monocyte cell line Raw264.7. It would be good to add an experiment on changes in M1 and M2 markers caused by PDZ peptides in human monocyte cells (for example, THP-1).

      We thank you for your comments. To determine whether PDZ peptide regulates M1/M2 polarization in human monocytes, we examined changes in M1 and M2 gene expression in THP-1 cells. As a result, wild-type PDZ significantly suppressed the expression of M1 marker genes (hlL-1β, hIL-6, hIL-8, hTNF-ɑ), while increasing the expression of M2 marker genes (hlL-4, hIL-10, hMRC-1). However, mutant PDZ did not affect M1/M2 polarization. These results suggest that PDZ peptide can suppress inflammation by regulating M1/M2 polarization of human monocyte cells. These results are for the reviewer's reference only and will not be included in the main content.

      Author response image 2.

      Author response image 3.

      Minor point:

      The use of language is appropriate, with good writing skills. Nevertheless, a thorough proofread would eliminate small mistakes such as:

      - line 254, " mut PDZ+LPS/LPS (45.75%) " → " mut PDZ+LPS/LPS (47.75%) "

      - line 296, " Figure 6f " → " Figure 6h "

      We changed these points into the manuscript.

    1. Author response:

      We have outlined a clear plan to revise and strengthen the manuscript by addressing key experimental concerns raised in the public reviews.

      Summary of Planned Revisions:

      We intend to address the following points through new experiments or additional analyses:

      Reviewer #1, Concern 2:<br /> “CRFR1 expression is largely confined to a subpopulation of striatal CINs in rats—Is this also true in mice?”

      To address this, we will obtaine CRFR1-GFP mice and perform immunohistochemistry for ChAT to assess the overlap between CRFR1-GFP+ neurons and CINs in the dorsal striatum. This will allow us to directly determine whether CRFR1 expression is similarly restricted in mice as it is in rats.

      Reviewer #1, Concern 3:<br /> “In rats, ~30% of CINs express CRFR1. Did a similar proportion of CINs in mice respond to CRF application?”

      We will revisit and re-analyze our electrophysiological dataset to calculate the percentage of recorded CINs in mice that respond to bath-applied CRF. Our preliminary analysis suggests a higher response rate (>90%), and we will reconcile this with expression data, discuss possible mechanisms (e.g., indirect effects or species-specific differences), and provide a clear explanation in the revised manuscript.

      Reviewer #2, Recommendation 5:<br /> “Can the authors quantify the onset delay of optogenetic responses from CRF+ axons onto CINs?”

      We initially performed this experiment in a single animal. To strengthen our conclusion of monosynaptic connectivity, we will increase the sample size (additional injections in CRF-Cre mice) and quantify the onset latency of optogenetically evoked responses in CINs.

      Reviewer #2, Recommendation 7:<br /> “Are CRFR1+ CINs equally distributed in DMS vs. DLS?”

      We will re-analyze existing immunohistochemical images from Figure 4 to compare the density (cells/µm²) of CRFR1+ CINs in the dorsomedial vs. dorsolateral striatum. This analysis will help clarify whether there is a regional bias in CRFR1 expression across striatal subdomains.

      Reviewer #3, Recommendation 1:<br /> “Test whether CRFR1 mediates the effect of optogenetic stimulation on CIN firing.”

      We will directly test CRFR1-dependence of optogenetically evoked CIN excitation by applying a CRFR1 antagonist during optical stimulation of CRF+ terminals and evaluating the effect on CIN firing. This will clarify whether the CRF effect is receptor-mediated and strengthen the interpretation of our functional findings.

      We may conduct more experiment to address other concerns. These targeted experiments will significantly enhance the rigor and mechanistic insight of our study.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      The aim of this paper is to develop a simple method to quantify fluctuations in the partitioning of cellular elements. In particular, they propose a flow-cytometry-based method coupled with a simple mathematical theory as an alternative to conventional imaging-based approaches.

      Strengths:

      The approach they develop is simple to understand and its use with flow-cytometry measurements is clearly explained. Understanding how the fluctuations in the cytoplasm partition vary for different kinds of cells is particularly interesting.

      Weaknesses:

      The theory only considers fluctuations due to cellular division events. This seems a large weakness because it is well known that fluctuations in cellular components are largely affected by various intrinsic and extrinsic sources of noise and only under particular conditions does partitioning noise become the dominant source of noise.

      We thank the Reviewer for her/his evaluation of our manuscript. The point raised is indeed a crucial one. In a cell division cycle, there are at least three distinct sources of noise that affect component numbers [1] : 

      (1) Gene expression and degradation, which determine component numbers fluctuations during cell growth.

      (2) Variability in cell division time, which depending on the underlying model may or may not be a function of protein level and gene expression.

      (3) Noise in the partitioning/inheritance of components between mother and daughter cells.

      Our approach specifically addresses the latter, with the goal of providing a quantitative measure of this noise source. For this reason, in the present work, we consider homogeneous cancer cell populations that could be considered to be stationary from a population point-of-view. By tracking the time evolution of the distribution of tagged components via live fluorescent markers, we aim at isolating partitioning noise effects. However, as noted by the Reviewer, other sources of noise are present, and depending on the considered system the relative contributions of the different sources may change. Thus, we agree that a quantification of the effect of the various noise sources on the accuracy of our measurements will improve the reliability of our method. 

      In this respect, assuming independence between noise sources, we reasoned that variability in cell cycle length would affect the timing of population emergence but not the intrinsic properties of those populations (e.g., Gaussian variance). To test this hypothesis, we conducted a preliminary set of simulations in which cell division times were drawn from an Erlang distribution (mean = 18 h, k=4k = 4k=4). The results, showing the behavior of the mean and variance of the component distributions across generations, are presented in Author response image 1. Under the assumption of independence between different noise sources, no significant effects were observed. Next, we plan to quantify the accuracy of our measurements in the presence of cross-talks between the various noise sources. As suggested, we will update the manuscript to include a more complete discussion on this topic and an evaluation of our model’s stability.

      Author response image 1.

      Variance and mean of the distribution of fluorescence intensity as a function of the generation for a time course dynamic with cell-cycle length variability. We repeated the same simulations as the one in figure 1 of the manuscript, but introducing a variable division time for each cell. The division time of each cell is extracted from an Erlang distribution (mean = 18 h and k = 4). As it is possible to observe in the plots, the results of our theoretical framework are not affected from the introduction of this variability. Hence, the Gaussian Mixture Model is still able to give the correct results  even in a noisy environment.

      (1) Soltani, Mohammad, et al. "Intercellular variability in protein levels from stochastic expression and noisy cell cycle processes." PLoS computational biology 12.8 (2016): e1004972.

      Reviewer #2 (Public review):

      Summary:

      The authors present a combined experimental and theoretical workflow to study partitioning noise arising during cell division. Such quantifications usually require time-lapse experiments, which are limited in throughput. To bypass these limitations, the authors propose to use flow-cytometry measurements instead and analyse them using a theoretical model of partitioning noise. The problem considered by the authors is relevant and the idea to use statistical models in combination with flow cytometry to boost statistical power is elegant. The authors demonstrate their approach using experimental flow cytometry measurements and validate their results using time-lapse microscopy. However, while I appreciate the overall goal and motivation of this work, I was not entirely convinced by the strength of this contribution. The approach focuses on a quite specific case, where the dynamics of the labelled component depend purely on partitioning. As such it seems incompatible with studying the partitioning noise of endogenous components that exhibit production/turnover. The description of the methods was partly hard to follow and should be improved. In addition, I have several technical comments, which I hope will be helpful to the authors.

      We are grateful to the Reviewer for her/his comments. Indeed, both partitioning and production turnover noise are in general fundamental processes. At present the only way to consider them together are time-consuming and costly transfection/microscopy/tracking experiments. In this work, we aimed at developing a method to effectively pinpoint the first component, i.e. partitioning noise thus we opted to separate the two different noise sources.  

      Below, we provide a point-by-point response that we hope will clarify all raised concerns.

      Comments:

      (1) In the theoretical model, copy numbers are considered to be conserved across generations. As a consequence, concentrations will decrease over generations due to dilution. While this consideration seems plausible for the considered experimental system, it seems incompatible with components that exhibit production and turnover dynamics. I am therefore wondering about the applicability/scope of the presented approach and to what extent it can be used to study partitioning noise for endogenous components. As presented, the approach seems to be limited to a fairly small class of experiments/situations.

      We see the Reviewer's point. Indeed, we are proposing a high-throughput and robust procedure to measure the partitioning/inheritance noise of cell components through flow cytometry time courses. By using live-cell staining of cellular compounds, we can track the effect of partitioning noise on fluorescence intensity distribution across successive generations. This specific procedure is purposely optimized to isolate partitioning noise from other sources and, as it is, can not track endogenous components or dyes that require fixation. While this certainly poses limits to the proposed approach, there are numerous contexts in which our methodology could be used to explore the role of asymmetric inheritance. Among others, (i) investigating how specific organelles are differentially partitioned and how this influences cellular behavior could provide deeper insights into fundamental biological processes: asymmetric segregation of organelles is a key factor in cell differentiation, aging, and stress response. During cell division, organelles such as mitochondria, the endoplasmic reticulum, lysosomes, peroxisomes, and centrosomes can be unequally distributed between daughter cells, leading to functional differences that influence their fate. For instance, Kajaitso et al. [1] proposed that asymmetric division of mitochondria in stem cells is associated with the retention of stemness traits in one daughter cell and differentiation in the other. As organisms age, stem cells accumulate damage, and to prevent exhaustion and compromised tissue function, cells may use asymmetric inheritance to segregate older or damaged subcellular components into one daughter cell. (ii) Asymmetric division has also been linked to therapeutic resistance in Cancer Stem Cells  [2]. Although the functional consequences are not yet fully determined, the asymmetric inheritance of mitochondria is recognized as playing a pivotal role [3]. Another potential application of our methodology may be (iii) the inheritance of lysosomes, which, together with mitochondria, appears to play a crucial role in determining the fate of human blood stem cells [4]. Furthermore, similar to studies conducted on liquid tumors [5][6], our approach could be extended to investigate cell growth dynamics and the origins of cell size homeostasis in adherent cells [7][8][9].  The aforementioned cases of study can be readily addressed using our approach that in general is applicable whenever live-cell dyes can be used. We will add a discussion of the strengths and limitations of the method in the Discussion section of the revised version of the manuscript. 

      (1) Katajisto, Pekka, et al. "Asymmetric apportioning of aged mitochondria between daughter cells is required for stemness." Science 348.6232 (2015): 340-343.

      (2) Hitomi, Masahiro, et al. "Asymmetric cell division promotes therapeutic resistance in glioblastoma stem cells." JCI insight 6.3 (2021): e130510.

      (3) García-Heredia, José Manuel, and Amancio Carnero. "Role of mitochondria in cancer stem cell resistance." Cells 9.7 (2020): 1693.

      (4) Loeffler, Dirk, et al. "Asymmetric organelle inheritance predicts human blood stem cell fate." Blood, The Journal of the American Society of Hematology 139.13 (2022): 2011-2023.

      (5) Miotto, Mattia, et al. "Determining cancer cells division strategy." arXiv preprint arXiv:2306.10905 (2023).

      (6) Miotto, Mattia, et al. "A size-dependent division strategy accounts for leukemia cell size heterogeneity." Communications Physics 7.1 (2024): 248.

      (7) Kussell, Edo, and Stanislas Leibler. "Phenotypic diversity, population growth, and information in fluctuating environments." Science 309.5743 (2005): 2075-2078.

      (8) McGranahan, Nicholas, and Charles Swanton. "Clonal heterogeneity and tumor evolution: past, present, and the future." Cell 168.4 (2017): 613-628.

      (9) De Martino, Andrea, Thomas Gueudré, and Mattia Miotto. "Exploration-exploitation tradeoffs dictate the optimal distributions of phenotypes for populations subject to fitness fluctuations." Physical Review E 99.1 (2019): 012417.

      (2) Similar to the previous comment, I am wondering what would happen in situations where the generations could not be as clearly identified as in the presented experimental system (e.g., due to variability in cell-cycle length/stage). In this case, it seems to be challenging to identify generations using a Gaussian Mixture Model. Can the authors comment on how to deal with such situations? In the abstract, the authors motivate their work by arguing that detecting cell divisions from microscopy is difficult, but doesn't their flow cytometry-based approach have a similar problem?

      The point raised is an important one, as it highlights the fundamental role of the gating strategy. The ability to identify the distribution of different generations using the Gaussian Mixture Model (GMM) strongly depends on the degree of overlap between distributions. The more the distributions overlap, the less capable we are of accurately separating them.

      The extent of overlap is influenced by the coefficients of variation (CV) of both the partitioning distribution function and the initial component distribution. Specifically, the component distribution at time t results from the convolution of the component distribution itself at time t−1 and the partitioning distribution function. Therefore, starting with a narrow initial component distribution allows for better separation of the generation peaks. The balance between partitioning asymmetry and the width of the initial component distribution is thus crucial.

      As shown in Author response image 2, increasing the CV of either distribution reduces the ability to distinguish between different generations.

      Author response image 2.

      Components distribution at varying CVs of initial components and partitioning distributions. Starting from a condition in which both division asymmetry and wideness of the initial components distribution are low and different generations are clearly separable, increasing either the CVs leads to distribution mixing and greater reconstruction difficulty.

      However, the variance of the initial distribution cannot be reduced arbitrarily. While selecting a narrow distribution facilitates a better reconstruction of the distributions, it simultaneously limits the number of cells available for the experiment. Therefore, for components exhibiting a high level of asymmetry, further narrowing of the initial distribution becomes experimentally impractical.

      In such cases, an approach previously tested on liquid tumors [1] involves applying the Gaussian Mixture Model (GMM) in two dimensions by co-staining another cellular component with lower division asymmetry.

      Regarding time-lapse fluorescence microscopy, the main challenge lies not in disentangling the interplay of different noise sources, but rather in obtaining sufficient statistical power from experimental data. While microscopy provides detailed insights into the division process and component partitioning, its low throughput limits large-scale statistical analyses. Current segmentation algorithms still perform poorly in crowded environments and with complex cell shapes, requiring a substantial portion of the image analysis pipeline to be performed manually, a process that is time-consuming and difficult to scale. In contrast, our cytometry-based approach bypasses this analysis bottleneck, as it enables a direct population-wide measurement of the system's evolution. We will provide a detailed discussion on these aspects in the revised version of the manuscript.

      (1) Peruzzi, Giovanna, et al. "Asymmetric binomial statistics explains organelle partitioning variance in cancer cell proliferation." Communications Physics 4.1 (2021): 188.

      (3) I could not find any formal definition of division asymmetry. Since this is the most important quantity of this paper, it should be defined clearly.

      We thank the Reviewer for the note. With division asymmetry we refer to a quantity that reflects how similar two daughter cells are likely to be in terms of inherited components after a division process. We opted to measure it via the coefficient of variation (root squared variance divided by the mean) of the partitioning fraction distribution. We will amend this lack of definition in the reviewed version of the manuscript. 

      (4) The description of the model is unclear/imprecise in several parts. For instance, it seems to me that the index "i" does not really refer to a cell in the population, but rather a subpopulation of cells that has undergone a certain number of divisions. Furthermore, why is the argument of Equation 11 suddenly the fraction f as opposed to the component number? I strongly recommend carefully rewriting and streamlining the model description and clearly defining all quantities and how they relate to each other.

      We are amending the text carefully to avoid double naming of variables and clarifying each computation passage. In equation 11 the variable f refers to the fluorescent intensity, but the notation will be changed to increase clarity. 

      (5) Similarly, I was not able to follow the logic of Section D. I recommend carefully rewriting this section to make the rationale, logic, and conclusions clear to the reader.

      We will update the manuscript clarifying the scope of section D and its results. In brief, Section A presents a general model to derive the variance of the partitioning distribution from flow cytometry time-course data without making any assumptions about the shape of the distribution itself. In Section D, our goal is to interpret the origin of asymmetry and propose a possible form for the partitioning distribution. Since the dyes used bind non-specifically to cytoplasmic amines, the tagged proteins are expected to be uniformly distributed throughout the cytoplasm and present in large numbers. Given these assumptions the least complex model for division follows the binomial distribution, with a parameter that measures the bias in the process. Therefore, we performed a similar computation to that in Section A, which allows us to estimate not only the variance but also the degree of biased asymmetry. Finally, we fitted the data to this new model and proposed an experimental interpretation of the results.

      (6) Much theoretical work has been done recently to couple cell-cycle variability to intracellular dynamics. While the authors neglect the latter for simplicity, it would be important to further discuss these approaches and why their simplified model is suitable for their particular experiments.

      We agree with the Reviewer, we will discuss this aspect in the revised version of the manuscript.

      (7) In the discussion the authors note that the microscopy-based estimates may lead to an overestimation of the fluctuations due to limited statistics. I could not follow that reasoning. Due to the gating in the flow cytometry measurements, I could imagine that the resulting populations are more stringently selected as compared to microscopy. Could that also be an explanation? More generally, it would be interesting to see how robust the results are in terms of different gating diameters.

      The Reviewer is right on the importance of the sorting procedure. As already discussed in a previous point, the gating strategy we employed plays a fundamental role: it reduces the overlap of fluorescence distributions as generations progress, enables the selection of an initial distribution distinct from the fluorescence background, allowing for longer tracking of proliferation, and synchronizes the initial population. The narrower the initial distribution, the more separated the peaks of different generations will be. However, this also results in a smaller number of cells available for the experiment, requiring a careful balance between precision and experimental feasibility. A similar procedure, although it would certainly limit the estimation error, would be impracticable In the case of microscopy. Indeed, the primary limitation and source of error is the number of recorded events. Our pipeline allowed us to track on the order of hundreds of division dynamics, but the analysis time scales non-linearly with the number of events. Significantly increasing the dataset would have been extremely time-consuming. Reducing the analysis to cells with similar fluorescence, although theoretically true, would have reduced the statistics to a level where the sampling error would drastically dominate the measure. Moreover, different experiments would have been hardly comparable, since different fluorescences could map in equally sized cells. In light of these factors, we expect higher CV for the microscopy measure than for flow cytometry’s ones.  In the plots below, we show the behaviour of the mean and the standard deviation of N numbers sampled from a gaussian distribution N(0,1) as a function of the sampling number N. The higher is N the closer the sampled distribution will be to the true one. The region in the hundreds of samples is still very noisy, but to do much better we would have to reach the order of thousands. We will add a discussion on these aspects in the reviewed version of the manuscript. 

      Author response image 3.

      Standard deviation and mean value of a distribution of points sampled from a Gaussian distribution with mean 0 and standard deviation 1,  versus the number of samples, N. Increasing N leads to a closer approximation of the expected values. In orange is highlighted the Microscopy Working Region (Microscopy WR) which corresponds to the number of samples we are able to reach with microscopy experiments. In yellow the region we would have to reach to lower the estimating error, which is although very expensive in terms of analysis time.

      (8) It would be helpful to show flow cytometry plots including the identified subpopulations for all cell lines, currently, they are shown only for HCT116 cells. More generally, very little raw data is shown.

      We will provide the requested plots for the other cell lines together with additional raw data coming from simulations in the Supplementary Material. 

      (9) The title of the manuscript could be tailored more to the considered problem. At the moment it is very generic.

      We see the Reviewer point. The proposed title aims at conveying the wide applicability of the presented approach, which ultimately allows for the assessment of the levels of fluctuations in the levels of the cellular components at division. This in turn reflects the asymmetricity in the division.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      This work provides a new dataset of 71,688 images of different ape species across a variety of environmental and behavioral conditions, along with pose annotations per image. The authors demonstrate the value of their dataset by training pose estimation networks (HRNet-W48) on both their own dataset and other primate datasets (OpenMonkeyPose for monkeys, COCO for humans), ultimately showing that the model trained on their dataset had the best performance (performance measured by PCK and AUC). In addition to their ablation studies where they train pose estimation models with either specific species removed or a certain percentage of the images removed, they provide solid evidence that their large, specialized dataset is uniquely positioned to aid in the task of pose estimation for ape species.

      The diversity and size of the dataset make it particularly useful, as it covers a wide range of ape species and poses, making it particularly suitable for training off-the-shelf pose estimation networks or for contributing to the training of a large foundational pose estimation model. In conjunction with new tools focused on extracting behavioral dynamics from pose, this dataset can be especially useful in understanding the basis of ape behaviors using pose.

      We thank the reviewer for the kind comments.

      Since the dataset provided is the first large, public dataset of its kind exclusively for ape species, more details should be provided on how the data were annotated, as well as summaries of the dataset statistics. In addition, the authors should provide the full list of hyperparameters for each model that was used for evaluation (e.g., mmpose config files, textual descriptions of augmentation/optimization parameters).

      We have added more details on the annotation process and have included the list of instructions sent to the annotators. We have also included mmpose configs with the code provided. The following files include the relevant details:

      File including the list of instructions sent to the annotators: OpenMonkeyWild Photograph Rubric.pdf

      Mmpose configs:

      i) TopDownOAPDataset.py

      ii) animal_oap_dataset.py

      iii) init.py

      iv) hrnet_w48_oap_256x192_full.py

      Anaconda environment files:

      i) OpenApePose.yml

      ii) requirements.txt

      Overall this work is a terrific contribution to the field and is likely to have a significant impact on both computer vision and animal behavior.

      Strengths:

      • Open source dataset with excellent annotations on the format, as well as example code provided for working with it.

      • Properties of the dataset are mostly well described.

      • Comparison to pose estimation models trained on humans vs monkeys, finding that models trained on human data generalized better to apes than the ones trained on monkeys, in accordance with phylogenetic similarity. This provides evidence for an important consideration in the field: how well can we expect pose estimation models to generalize to new species when using data from closely or distantly related ones? - Sample efficiency experiments reflect an important property of pose estimation systems, which indicates how much data would be necessary to generate similar datasets in other species, as well as how much data may be required for fine-tuning these types of models (also characterized via ablation experiments where some species are left out).

      • The sample efficiency experiments also reveal important insights about scaling properties of different model architectures, finding that HRNet saturates in performance improvements as a function of dataset size sooner than other architectures like CPMs (even though HRNets still perform better overall).

      We thank the reviewer for the kind comments.

      Weaknesses:

      • More details on training hyperparameters used (preferably full config if trained via mmpose).

      We have now included mmpose configs and anaconda environment files that allow researchers to use the dataset with specific versions of mmpose and other packages we trained our models with. The list of files is provided above.

      • Should include dataset datasheet, as described in Gebru et al 2021 (arXiv:1803.09010).

      We have included a datasheet for our dataset in the appendix lines 621-764.

      • Should include crowdsourced annotation datasheet, as described in Diaz et al 2022 (arXiv:2206.08931). Alternatively, the specific instructions that were provided to Hive/annotators would be highly relevant to convey what annotation protocols were employed here.

      We have included the list of instructions sent to the Hive annotators in the supplementary materials. File: OpenMonkeyWild Photograph Rubric.pdf

      • Should include model cards, as described in Mitchell et al (arXiv:1810.03993).

      We have included a model card for the included model in the results section line 359. See Author response image 1.

      Author response image 1.

      • It would be useful to include more information on the source of the data as they are collected from many different sites and from many different individuals, some of which may introduce structural biases such as lighting conditions due to geography and time of year.

      We agree that the source could introduce structural biases. This is why we included images from so many different sources and captured images at different times from the same source—in hopes that a large variety of background and lighting conditions are represented. However, doing so limits our ability to document each source background and lighting condition separately.

      • Is there a reason not to use OKS? This incorporates several factors such as landmark visibility, scale, and landmark type-specific annotation variability as in Ronchi & Perona 2017 (arXiv:1707.05388). The latter (variability) could use the human pose values (for landmarks types that are shared), the least variable keypoint class in humans (eyes) as a conservative estimate of accuracy, or leverage a unique aspect of this work (crowdsourced annotations) which affords the ability to estimate these values empirically.

      The focus of this work is on overall keypoint localization accuracy and hence we wanted a metric that is easy to interpret and implement, in this case we made use of PCK (Percentage of Correct Keypoints). PCK is a simple and widely used metric that measures the percentage of correctly localized keypoints within a certain distance threshold from their corresponding groundtruth keypoints.

      • A reporting of the scales present in the dataset would be useful (e.g., histogram of unnormalized bounding boxes) and would align well with existing pose dataset papers such as MS-COCO (arXiv:1405.0312) which reports the distribution of instance sizes and instance density per image.

      RESPONSE: We have now included a histogram of unnormalized bounding boxes in the manuscript, Author response image 2.

      Author response image 2.

      Reviewer #2 (Public Review):

      The authors present the OpenApePose database constituting a collection of over 70000 ape images which will be important for many applications within primatology and the behavioural sciences. The authors have also rigorously tested the utility of this database in comparison to available Pose image databases for monkeys and humans to clearly demonstrate its solid potential.

      We thank the reviewer for the kind comments.

      However, the variation in the database with regards to individuals, background, source/setting is not clearly articulated and would be beneficial information for those wishing to make use of this resource in the future. At present, there is also a lack of clarity as to how this image database can be extrapolated to aid video data analyses which would be highly beneficial as well.

      I have two major concerns with regard to the manuscript as it currently stands which I think if addressed would aid the clarity and utility of this database for readers.

      1) Human annotators are mentioned as doing the 16 landmarks manually for all images but there is no assessment of inter-observer reliability or the such. I think something to this end is currently missing, along with how many annotators there were. This will be essential for others to know who may want to use this database in the future.

      We thank the reviewer for pointing this out. Inter-observer reliability is important for ensuring the quality of the annotations. We first used Amazon MTurk to crowd source annotations and found that the inter-observer reliability and the annotation quality was poor. This was the reason for choosing a commercial service such as Hive AI. As the crowd sourcing and quality control are managed by Hive through their internal procedures, we do not have access to data that can allow us to assess inter-observer reliability. However, the annotation quality was assessed by first author ND through manual inspections of the annotations visualized on all of the images the database. Additionally, our ablation experiments with high out of sample performances further vaildate the quality of the annotations.

      Relevant to this comment, in your description of the database, a table or such could be included, providing the number of images from each source/setting per species and/or number of individuals. Something to give a brief overview of the variation beyond species. (subspecies would also be of benefit for example).

      Our goal was to obtain as many images as possible from the most commonly studied ape species. In order to ensure a large enough database, we focused only on the species and combined images from as many sources as possible to reach our goal of ~10,000 images per species. With the wide range of people involved in obtaining the images, we could not ensure that all the photographers had the necessary expertise to differentiate individuals and subspecies of the subjects they were photographing. We could only ensure that the right species was being photographed. Hence, we cannot include more detailed information.

      2) You mention around line 195 that you used a specific function for splitting up the dataset into training, validation, and test but there is no information given as to whether this was simply random or if an attempt to balance across species, individuals, background/source was made. I would actually think that a balanced approach would be more appropriate/useful here so whether or not this was done, and the reasoning behind that must be justified.

      This is especially relevant given that in one test you report balancing across species (for the sample size subsampling procedure).

      We created the training set to reflect the species composition of the whole dataset, but used test sets balanced by species. This was done to give a sense of the performance of a model that could be trained with the entire dataset, that does not have the species fully balanced. We believe that researchers interested in training models using this dataset for behavior tracking applications would use the entire dataset to fully leverage the variation in the dataset. However, for those interested in training models with balanced species, we provide an annotation file with all the images included, which would allow researchers to create their own training and test sets that meet their specific needs. We have added this justification in the manuscript to guide the other users with different needs. Lines 530-534: “We did not balance our training set for the species as we wanted to utilize the full variation in the dataset and assess models trained with the proportion of species as reflected in the dataset. We provide annotations including the entire dataset to allow others to make create their own training/validation/test sets that suit their needs.”

      And another perhaps major concern that I think should also be addressed somewhere is the fact that this is an image database tested on images while the abstract and manuscript mention the importance of pose estimation for video datasets, yet the current manuscript does not provide any clear test of video datasets nor engage with the practicalities associated with using this image-based database for applications to video datasets. Somewhere this needs to be added to clarify its practical utility.

      We thank the reviewer for this important suggestion. Since we can separate a video into its constituent frames, one can indeed use the provided model or other models trained using this dataset for inference on the frames, thus allowing video tracking applications. We now include a short video clip of a chimpanzee with inferences from the provided model visualized in the supplementary materials.

      Reviewer #1 (Recommendations For The Authors):

      • Please provide a more thorough description of the annotation procedure (i.e., the instructions given to crowd workers)! See public review for reference on dataset annotation reporting cards.

      We have included the list of instructions for Hive annotators in the supplementary materials.

      • An estimate of the crowd worker accuracy and variability would be super valuable!

      While we agree that this is useful, we do not have access to Hive internal data on crowd worker IDs that could allow us to estimate these metrics. Furthermore, we assessed each image manually to ensure good annotation quality.

      • In the methods section it is reported that images were discarded because they were either too blurry, small, or highly occluded. Further quantification could be provided. How many images were discarded per species?

      It’s not really clear to us why this is interesting or important. We used a large number of photographers and annotators, some of whom gave a high ratio of great images; some of whom gave a poor ratio. But it’s not clear what those ratios tell us.

      • Placing the numerical values at the end of the bars would make the graphs more readable in Figures 4 and 5.

      We thank the reviewer for this suggestion. While we agree that this can help, we do not have space to include the number in a font size that would be readable. Smaller font sizes that are likely to fit may not be readable for all readers. We have included the numerical values in the main text in the results section for those interested and hope that the figures provide a qualitative sense of the results to the readers.

    1. Author response:

      eLife Assessment

      This valuable short paper is an ingenious use of clinical patient data to address an issue in imaging neuroscience. The authors clarify the role of face-selectivity in human fusiform gyrus by measuring both BOLD fMRI and depth electrode recordings in the same individuals; furthermore, by comparing responses in different brain regions in the two patients, they suggested that the suppression of blood oxygenation is associated with a decrease in local neural activity. While the methods are compelling and provide a rare dataset of potentially general importance, the presentation of the data in its current form is incomplete.

      We thank the Reviewing editor and Senior editor at eLife for their positive assessment of our paper. After reading the reviewers’ comments – to which we reply below - we agree that the presentation of the data could be completed. We provide additional presentation of data in the responses below and we will slightly modify Figure 2 of the paper. However, in keeping the short format of the paper, the revised version will have the same number of figures, which support the claims made in the paper.

      Reviewer #1 (Public review):

      Summary:

      Measurement of BOLD MR imaging has regularly found regions of the brain that show reliable suppression of BOLD responses during specific experimental testing conditions. These observations are to some degree unexplained, in comparison with more usual association between activation of the BOLD response and excitatory activation of the neurons (most tightly linked to synaptic activity) in the same brain location. This paper finds two patients whose brains were tested with both non-invasive functional MRI and with invasive insertion of electrodes, which allowed the direct recording of neuronal activity. The electrode insertions were made within the fusiform gyrus, which is known to process information about faces, in a clinical search for the sites of intractable epilepsy in each patient. The simple observation is that the electrode location in one patient showed activation of the BOLD response and activation of neuronal firing in response to face stimuli. This is the classical association. The other patient showed an informative and different pattern of responses. In this person, the electrode location showed a suppression of the BOLD response to face stimuli and, most interestingly, an associated suppression of neuronal activity at the electrode site.

      Strengths:

      Whilst these results are not by themselves definitive, they add an important piece of evidence to a long-standing discussion about the origins of the BOLD response. The observation of decreased neuronal activation associated with negative BOLD is interesting because, at various times, exactly the opposite association has been predicted. It has been previously argued that if synaptic mechanisms of neuronal inhibition are responsible for the suppression of neuronal firing, then it would be reasonable

      Weaknesses:

      The chief weakness of the paper is that the results may be unique in a slightly awkward way. The observation of positive BOLD and neuronal activation is made at one brain site in one patient, while the complementary observation of negative BOLD and neuronal suppression actually derives from the other patient. Showing both effects in both patients would make a much stronger paper.

      We thank reviewer #1 for their positive evaluation of our paper. Obviously, we agree with the reviewer that the paper would be much stronger if BOTH effects – spike increase and decrease – would be found in BOTH patients in their corresponding fMRI regions (lateral and medial fusiform gyrus) (also in the same hemisphere). Nevertheless, we clearly acknowledge this limitation in the (revised) version of the manuscript (p.8: Material and Methods section).

      In the current paper, one could think that P1 shows only increases to faces, and P2 would show only decreases (irrespective of the region). However, that is not the case since 11% of P1’s face-selective units are decreases (89% are increases) and 4% of P2’s face-selective units are increases. This has now been made clearer in the manuscript (p.5).

      As the reviewer is certainly aware, the number and position of the electrodes are based on strict clinical criteria, and we will probably never encounter a situation with two neighboring (macro-micro hybrid electrodes), one with microelectrodes ending up in the lateral MidFG, the other in the medial MidFG, in the same patient. If there is no clinical value for the patient, this cannot be done.

      The only thing we can do is to strengthen these results in the future by collecting data on additional patients with an electrode either in the lateral or the medial FG, together with fMRI. But these are the only two patients we have been able to record so far with electrodes falling unambiguously in such contrasted regions and with large (and comparable) measures.

      While we acknowledge that the results may be unique because of the use of 2 contrasted patients only (and this is why the paper is a short report), the data is compelling in these 2 cases, and we are confident that it will be replicated in larger cohorts in the future.

      Reviewer #2 (Public review):

      Summary:

      This is a short and straightforward paper describing BOLD fMRI and depth electrode measurements from two regions of the fusiform gyrus that show either higher or lower BOLD responses to faces vs. objects (which I will call face-positive and facenegative regions). In these regions, which were studied separately in two patients undergoing epilepsy surgery, spiking activity increased for faces relative to objects in the face-positive region and decreased for faces relative to objects in the face-negative region. Interestingly, about 30% of neurons in the face-negative region did not respond to objects and decreased their responses below baseline in response to faces (absolute suppression).

      Strengths:

      These patient data are valuable, with many recording sessions and neurons from human face-selective regions, and the methods used for comparing face and object responses in both fMRI and electrode recordings were robust and well-established. The finding of absolute suppression could clarify the nature of face selectivity in human fusiform gyrus since previous fMRI studies of the face-negative region could not distinguish whether face < object responses came from absolute suppression, or just relatively lower but still positive responses to faces vs. objects.

      Weaknesses:

      The authors claim that the results tell us about both 1) face-selectivity in the fusiform gyrus, and 2) the physiological basis of the BOLD signal. However, I would like to see more of the data that supports the first claim, and I am not sure the second claim is supported.

      (1) The authors report that ~30% of neurons showed absolute suppression, but those data are not shown separately from the neurons that only show relative reductions. It is difficult to evaluate the absolute suppression claim from the short assertion in the text alone (lines 105-106), although this is a critical claim in the paper.

      We thank reviewer #2 for their positive evaluation of our paper. We understand the reviewer’s point, and we partly agree. Where we respectfully disagree is that the finding of absolute suppression is critical for the claim of the paper: finding an identical contrast between the two regions in terms of RELATIVE increase/decrease of face-selective activity in fMRI and spiking activity is already novel and informative. Where we agree with the reviewer is that the absolute suppression could be more documented: it wasn’t, due to space constraints (brief report). We provide below an example of a neuron showing absolute suppression to faces. In the frequency domain, there is only a face-selective response (1.2 Hz and harmonics) but no significant response at 6 Hz (common general visual response). In the time-domain, relative to face onset, the response drops below baseline level. It means that this neuron has baseline (non-periodic) spontaneous spiking activity that is actively suppressed when a face appears.

      Author response image 1.

      (2) I am not sure how much light the results shed on the physiological basis of the BOLD signal. The authors write that the results reveal "that BOLD decreases can be due to relative, but also absolute, spike suppression in the human brain" (line 120). But I think to make this claim, you would need a region that exclusively had neurons showing absolute suppression, not a region with a mix of neurons, some showing absolute suppression and some showing relative suppression, as here. The responses of both groups of neurons contribute to the measured BOLD signal, so it seems impossible to tell from these data how absolute suppression per se drives the BOLD response.

      It is a fact that we find both kinds of responses in the same region.  We cannot tell with this technique if neurons showing relative vs. absolute suppression of responses are spatially segregated for instance (e.g., forming two separate sub-regions) or are intermingled. And we cannot tell from our data how absolute suppression per se drives the BOLD response. In our view, this does not diminish the interest and originality of the study, but the statement "that BOLD decreases can be due to relative, but also absolute, spike suppression in the human brain” will be rephrased in the revised manuscript, in the following way: "that BOLD decreases can be due to relative, or absolute (or a combination of both), spike suppression in the human brain”.

      Reviewer #3 (Public review):

      In this paper the authors conduct two experiments an fMRI experiment and intracranial recordings of neurons in two patients P1 and P2. In both experiments, they employ a SSVEP paradigm in which they show images at a fast rate (e.g. 6Hz) and then they show face images at a slower rate (e.g. 1.2Hz), where the rest of the images are a variety of object images. In the first patient, they record from neurons over a region in the mid fusiform gyrus that is face-selective and in the second patient, they record neurons from a region more medially that is not face selective (it responds more strongly to objects than faces). Results find similar selectivity between the electrophysiology data and the fMRI data in that the location which shows higher fMRI to faces also finds face-selective neurons and the location which finds preference to non faces also shows non face preferring neurons.

      Strengths:

      The data is important in that it shows that there is a relationship between category selectivity measured from electrophysiology data and category-selective from fMRI. The data is unique as it contains a lot of single and multiunit recordings (245 units) from the human fusiform gyrus - which the authors point out - is a humanoid specific gyrus.

      Weaknesses:

      My major concerns are two-fold:

      (i) There is a paucity of data; Thus, more information (results and methods) is warranted; and in particular there is no comparison between the fMRI data and the SEEG data.

      We thank reviewer #3 for their positive evaluation of our paper. If the reviewer means paucity of data presentation, we agree and we provide more presentation below, although the methods and results information appear as complete to us. The comparison between fMRI and SEEG is there, but can only be indirect (i.e., collected at different times and not related on a trial-by-trial basis for instance). In addition, our manuscript aims at providing a short empirical contribution to further our understanding of the relationship between neural responses and BOLD signal, not to provide a model of neurovascular coupling.

      (ii) One main claim of the paper is that there is evidence for suppressed responses to faces in the non-face selective region. That is, the reduction in activation to faces in the non-face selective region is interpreted as a suppression in the neural response and consequently the reduction in fMRI signal is interpreted as suppression. However, the SSVEP paradigm has no baseline (it alternates between faces and objects) and therefore it cannot distinguish between lower firing rate to faces vs suppression of response to faces.

      We understand the concern of the reviewer, but we respectfully disagree that our paradigm cannot distinguish between lower firing rate to faces vs. suppression of response to faces. Indeed, since the stimuli are presented periodically (6 Hz), we can objectively distinguish stimulus-related activity from spontaneous neuronal firing. The baseline corresponds to spikes that are non-periodic, i.e., unrelated to the (common face and object) stimulation. For a subset of neurons, even this non-periodic baseline activity is suppressed, above and beyond the suppression of the 6 Hz response illustrated on Figure 2. We mention it in the manuscript, but we agree that we do not present illustrations of such decrease in the time-domain for SU, which we did not consider as being necessary initially (please see below for such presentation).

      (1) Additional data: the paper has 2 figures: figure 1 which shows the experimental design and figure 2 which presents data, the latter shows one example neuron raster plot from each patient and group average neural data from each patient. In this reader's opinion this is insufficient data to support the conclusions of the paper. The paper will be more impactful if the researchers would report the data more comprehensively.

      We answer to more specific requests for additional evidence below, but the reviewer should be aware that this is a short report, which reaches the word limit. In our view, the group average neural data should be sufficient to support the conclusions, and the example neurons are there for illustration. And while we cannot provide the raster plots for a large number of neurons, the anonymized data will be made available upon publication of the final version of the paper.

      (a) There is no direct comparison between the fMRI data and the SEEG data, except for a comparison of the location of the electrodes relative to the statistical parametric map generated from a contrast (Fig 2a,d). It will be helpful to build a model linking between the neural responses to the voxel response in the same location - i.e., estimate from the electrophysiology data the fMRI data (e.g., Logothetis & Wandell, 2004).

      As mentioned above the comparison between fMRI and SEEG is indirect (i.e., collected at different times and not related on a trial-by-trial basis for instance) and would not allow to make such a model.

      (b) More comprehensive analyses of the SSVEP neural data: It will be helpful to show the results of the frequency analyses of the SSVEP data for all neurons to show that there are significant visual responses and significant face responses. It will be also useful to compare and quantify the magnitude of the face responses compared to the visual responses.

      The data has been analyzed comprehensively, but we would not be able to show all neurons with such significant visual responses and face-selective responses.

      (c) The neuron shown in E shows cyclical responses tied to the onset of the stimuli, is this the visual response?

      Correct, it’s the visual response at 6 Hz.

      If so, why is there an increase in the firing rate of the neuron before the face stimulus is shown in time 0?

      Because the stimulation is continuous. What is displayed at 0 is the onset of the face stimulus, with each face stimulus being preceded by 4 images of nonface objects.

      The neuron's data seems different than the average response across neurons; This raises a concern about interpreting the average response across neurons in panel F which seems different than the single neuron responses

      The reviewer is correct, and we apologize for the confusion. This is because the average data on panel F has been notch-filtered for the 6 Hz (and harmonic responses), as indicated in the methods (p.11):  ‘a FFT notch filter (filter width = 0.05 Hz) was then applied on the 70 s single or multi-units time-series to remove the general visual response at 6 Hz and two additional harmonics (i.e., 12 and 18 Hz)’.

      Here is the same data without the notch-filter (the 6Hz periodic response is clearly visible):

      Author response image 2.

      For sake of clarity, we prefer presenting the notch-filtered data in the paper, but the revised version will make it clear in the figure caption that the average data has been notch-filtered.

      (d) Related to (c) it would be useful to show raster plots of all neurons and quantify if the neural responses within a region are homogeneous or heterogeneous. This would add data relating the single neuron response to the population responses measured from fMRI. See also Nir 2009.

      We agree with the reviewer that this is interesting, but again we do not think that it is necessary for the point made in the present paper. Responses in these regions appear rather heterogenous, and we are currently working on a longer paper with additional SEEG data (other patients tested for shorter sessions) to define and quantify the face-selective neurons in the MidFusiform gyrus with this approach (without relating it to the fMRI contrast as reported here).

      (e) When reporting group average data (e.g., Fig 2C,F) it is necessary to show standard deviation of the response across neurons.

      We agree with the reviewer and have modified Figure 2 accordingly in the revised manuscript.

      (f) Is it possible to estimate the latency of the neural responses to face and object images from the phase data? If so, this will add important information on the timing of neural responses in the human fusiform gyrus to face and object images.

      The fast periodic paradigm to measure neural face-selectivity has been used in tens of studies since its original reports:

      - in EEG: Rossion et al., 2015: https://doi.org/10.1167/15.1.18

      - in SEEG: Jonas et al., 2016: https://doi.org/10.1073/pnas.1522033113

      In this paradigm, the face-selective response spreads to several harmonics (1.2 Hz, 2.4 Hz, 3.6 Hz, etc.) (which are summed for quantifying the total face-selective amplitude). This is illustrated below by the averaged single units’ SNR spectra across all recording sessions for both participants.

      Author response image 3.

      There is no unique phase-value, each harmonic being associated with a phase-value, so that the timing cannot be unambiguously extracted from phase values. Instead, the onset latency is computed directly from the time-domain responses, which is more straightforward and reliable than using the phase. Note that the present paper is not about the specific time-courses of the different types of neurons, which would require a more comprehensive report, but which is not necessary to support the point made in the present paper about the SEEG-fMRI sign relationship.

      g) Related to (e) In total the authors recorded data from 245 units (some single units and some multiunits) and they found that both in the face and nonface selective most of the recoded neurons exhibited face -selectivity, which this reader found confusing: They write “ Among all visually responsive neurons, we found a very high proportion of face-selective neurons (p < 0.05) in both activated and deactivated MidFG regions (P1: 98.1%; N = 51/52; P2: 86.6%; N = 110/127)’. Is the face selectivity in P1 an increase in response to faces and P2 a reduction in response to faces or in both it’s an increase in response to faces

      Face-selectivity is defined as a DIFFERENTIAL response to faces compared to objects, not necessarily a larger response to faces. So yes, face-selectivity in P1 is an increase in response to faces and P2 a reduction in response to faces.

      (1) Additional methods

      (a) it is unclear if the SSVEP analyses of neural responses were done on the spikes or the raw electrical signal. If the former, how is the SSVEP frequency analysis done on discrete data like action potentials?

      The FFT is applied directly on spike trains using Matlab’s discrete Fourier Transform function. This function is suitable to be applied to spike trains in the same way as to any sampled digital signal (here, the microwires signal was sampled at 30 kHz, see Methods).

      In complementary analyses, we also attempted to apply the FFT on spike trains that had been temporally smoothed by convolving them with a 20ms square window (Le Cam et al., 2023, cited in the paper ). This did not change the outcome of the frequency analyses in the frequency range we are interested in.

      (b) it is unclear why the onset time was shifted by 33ms; one can measure the phase of the response relative to the cycle onset and use that to estimate the delay between the onset of a stimulus and the onset of the response. Adding phase information will be useful.

      The onset time was shifted by 33ms because the stimuli are presented with a sinewave contrast modulation (i.e., at 0ms, the stimulus has 0% contrast). 100% contrast is reached at half a stimulation cycle, which is 83.33ms here, but a response is likely triggered before reaching 100% contrast. To estimate the delay between the start of the sinewave (0% contrast) and the triggering of a neural response, we tested 7 SEEG participants with the same images presented in FPVS sequences either as a sinewave contrast (black line) modulation or as a squarewave (i.e. abrupt) contrast modulation (red line).  The 33ms value is based on these LFP data obtained in response to such sinewave stimulation and squarewave stimulation of the same paradigm. This delay corresponds to 4 screen refresh frames (120 Hz refresh rate = 8.33ms by frame) and 35% of the full contrast, as illustrated below (please see also Retter, T. L., & Rossion, B. (2016). Uncovering the neural magnitude and spatio-temporal dynamics of natural image categorization in a fast visual stream. Neuropsychologia, 91, 9–28).

      Author response image 4.

      (2) Interpretation of suppression:

      The SSVEP paradigm alternates between 2 conditions: faces and objects and has no baseline; In other words, responses to faces are measured relative to the baseline response to objects so that any region that contains neurons that have a lower firing rate to faces than objects is bound to show a lower response in the SSVEP signal. Therefore, because the experiment does not have a true baseline (e.g. blank screen, with no visual stimulation) this experimental design cannot distinguish between lower firing rate to faces vs suppression of response to faces.

      The strongest evidence put forward for suppression is the response of non-visual neurons that was also reduced when patients looked at faces, but since these are non-visual neurons, it is unclear how to interpret the responses to faces.

      We understand this point, but how does the reviewer know that these are non-visual neurons? Because these neurons are located in the visual cortex, they are likely to be visual neurons that are not responsive to non-face objects. In any case, as the reviewer writes, we think it’s strong evidence for suppression.

      We thank all three reviewers for their positive evaluation of our paper and their constructive comments.

    1. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This paper concerns mechanisms of foraging behavior in C. elegans. Upon removal from food, C. elegans first executes a stereotypical local search behavior in which it explores a small area by executing many random, undirected reversals and turns called "reorientations." If the worm fails to find food, it transitions to a global search in which it explores larger areas by suppressing reorientations and executing long forward runs (Hills et al., 2004). At the population level, the reorientation rate declines gradually. Nevertheless, about 50% of individual worms appear to exhibit an abrupt transition between local and global search, which is evident as a discrete transition from high to low reorientation rate (Lopez-Cruz et al., 2019). This observation has given rise to the hypothesis that local and global search correspond to separate internal states with the possibility of sudden transitions between them (Calhoun et al., 2014). The main conclusion of the paper is that it is not necessary to posit distinct internal states to account for discrete transitions from high to low reorientation rates. On the contrary, discrete transitions can occur simply because of the stochastic nature of the reorientation behavior itself.

      Strengths:

      The strength of the paper is the demonstration that a more parsimonious model explains abrupt transitions in the reorientation rate.

      Weaknesses:

      (1) Use of the Gillespie algorithm is not well justified. A conventional model with a fixed dt and an exponentially decaying reorientation rate would be adequate and far easier to explain. It would also be sufficiently accurate - given the appropriate choice of dt - to support the main claims of the paper, which are merely qualitative. In some respects, the whole point of the paper - that discrete transitions are an epiphenomenon of stochastic behavior - can be made with the authors' version of the model having a constant reorientation rate (Figure 2f).

      We apologize, but we are not sure what the reviewer means by “fixed dt”. If the reviewer means taking discrete steps in time (dt), and modeling whether a reorientation occurs, we would argue that the Gillespie algorithm is a better way to do this because it provides floating-point precision time resolution, rather than a time resolution limited by dt, which we hopefully explain in the comments below.

      The reviewer is correct that discrete transitions are an epiphenomenon of stochastic behavior as we show in Figure 2f. However, abrupt stochastic jumps that occur with a constant rate do not produce persistent changes in the observed rate because it is by definition, constant. The theory that there are local and global searches is based on the observation that individual worms often abruptly change their rates. But this observation is only true for a fraction of worms. We are trying to argue that the reason why this is not observed for all, or even most worms is because these are the result of stochastic sampling, not a sudden change in search strategy.

      (2) In the manuscript, the Gillespie algorithm is very poorly explained, even for readers who already understand the algorithm; for those who do not it will be essentially impossible to comprehend. To take just a few examples: in Equation (1), omega is defined as reorientations instead of cumulative reorientations; it is unclear how (4) follows from (2) and (3); notation in (5), line 133, and (7) is idiosyncratic. Figure 1a does not help, partly because the notation is unexplained. For example, what do the arrows mean, what does "*" mean?

      We apologize for this, you are correct,  is cumulative reorientations, and we will edit the text as follows:

      Experimentally, reorientation rate is measured as the number of reorientation events that occurred in an observational window. However, these are discrete stochastic events, so we should describe them in terms of propensity, i.e. the probability of observing a transitional event (in this case, a reorientation) is:

      Here, P(W+1,t) is the probability of observing a reorientation event at time t, and a<sub>1</sub> is the propensity for this event to occur. Observationally, the frequency of reorientations observed decays over time, so we can define the propensity as:

      Where α is the initial propensity at t=0.

      We can model this decay as the reorientation propensity coupled to a decaying factor (M):

      Where the propensity of this event (a<sub>2</sub>) is:

      Since M is a first-order decay process, when integrated, the cumulative M observed is:

      We can couple the probability of observing a reorientation to this decay by redefining (a<sub>1</sub> as:

      So that now:

      A critical detail should be noted. While reorientations are modeled as discrete events, the amount of M at time t\=0 is chosen to be large (M<sub>0</sub>←1,000), so that over the timescale of 40 minutes, the decay in M is practically continuous. This ensures that sudden changes in reorientations are not due to sudden changes in M, but due to the inherent stochasticity of reorientations.

      To model both processes, we can create the master equation:

      Since these are both Poisson processes, the probability density function for a state change i occurring in time t is:

      The probability that an event will not occur in time interval t is:

      The probability that no events will occur for ALL transitions in this time interval is:

      We can draw a random number (r<sub>1</sub> ∈[0,1]) that represents the probability of no events in time interval t, so that this time interval can be assigned by rearranging equation 11:

      where:

      This is the time interval for any event (W+1 or M-1) happening at t + t. The probability of which event occurs is proportional to its propensity:

      We can draw a second number (r<sub>2</sub> ∈[0,1]) that represents this probability so that which event occurs at time t + t is determined by the smallest n that satisfies:

      so that:

      The elegant efficiency of the Gillespie algorithm is two-fold. First, it models all transitions simultaneously, not separately. Second, it provides floating-point time resolution. Rather than drawing a random number, and using a cumulative probability distribution of interval-times to decide whether an event occurs at discrete steps in time, the Gillespie algorithm uses this distribution to draw the interval-time itself. The time resolution of the prior approach is limited by step size, whereas the Gillespie algorithm’s time resolution is limited by the floating-point precision of the random number that is drawn.

      We are happy to add this text to improve clarity.

      We apologize for the arrow notation confusion. Arrow notation is commonly used in pseudocode to indicate variable assignment, and so we used it to indicate variable assignment updates in the algorithm.

      We added Figure 2a to help explain the Gillespie algorithm for people who are unfamiliar with it, but you are correct, some notation, like probabilities, were left unexplained. We will address this to improve clarity.

      (3) In the model, the reorientation rate dΩ⁄dt declines to zero but the empirical rate clearly does not. This is a major flaw. It would have been easy to fix by adding a constant to the exponentially declining rate in (1). Perhaps fixing this obvious problem would mitigate the discrepancies between the data and the model in Figure 2d.

      You are correct that the model deviates slightly at longer times, but this result is consistent with Klein et al. that show a continuous decline of reorientations. However, we could add a constant to the model, since an infinite run length is likely not physiological.

      (4) Evidence that the model fits the data (Figure 2d) is unconvincing. I would like to have seen the proportion of runs in which the model generated one as opposed to multiple or no transitions in reorientation rate; in the real data, the proportion is 50% (Lopez). It is claimed that the "model demonstrated a continuum of switching to non-switching behavior" as seen in the experimental data but no evidence is provided.

      We should clarify that the 50% proportion cited by López-Cruz was based on an arbitrary difference in slopes, and by assessing the data visually. We sought to avoid this subjective assessment by plotting the distribution of slopes and transition times produced by the method used in López-Cruz. We should also clarify by what we meant by “a continuum of switching and non-switching” behavior. Both the transition time distributions and the slope-difference distributions do not appear to be the result of two distributions. This is unlike roaming and dwelling on food, where two distinct distributions of behavioral metrics can be identified based on speed and angular speed (Flavell et al, 2009, Fig S2a). We will add a permutation test to verify the mean differences in slopes and transition times between the experiment and model are not significant.

      (5) The explanation for the poor fit between the model and data (lines 166-174) is unclear. Why would externally triggered collisions cause a shift in the transition distribution?

      Thank you, we should rewrite the text to clarify this better. There were no externally triggered collisions; 10 animals were used per experiment. They would occasionally collide during the experiment, but these collisions were excluded from the data that were provided. However, worms are also known to increase reorientations when they encounter a pheromone trail, and it is unknown (from this dataset) which orientations may have been a result of this phenomenon.

      (6) The discussion of Levy walks and the accompanying figure are off-topic and should be deleted.

      Thank you, we agree that this topic is tangential, and we will remove it.

      Reviewer #2 (Public review):

      Summary:

      In this study, the authors build a statistical model that stochastically samples from a time-interval distribution of reorientation rates. The form of the distribution is extracted from a large array of behavioral data, and is then used to describe not only the dynamics of individual worms (including the inter-individual variability in behavior), but also the aggregate population behavior. The authors note that the model does not require assumptions about behavioral state transitions, or evidence accumulation, as has been done previously, but rather that the stochastic nature of behavior is "simply the product of stochastic sampling from an exponential function".

      Strengths:

      This model provides a strong juxtaposition to other foraging models in the worm. Rather than evoking a behavioral transition function (that might arise from a change in internal state or the activity of a cell type in the network), or evidence accumulation (which again maps onto a cell type, or the activity of a network) - this model explains behavior via the stochastic sampling of a function of an exponential decay. The underlying model and the dynamics being simulated, as well as the process of stochastic sampling, are well described and the model fits the exponential function (Equation 1) to data on a large array of worms exhibiting diverse behaviors (1600+ worms from Lopez-Cruz et al). The work of this study is able to explain or describe the inter-individual diversity of worm behavior across a large population. The model is also able to capture two aspects of the reorientations, including the dynamics (to switch or not to switch) and the kinetics (slow vs fast reorientations). The authors also work to compare their model to a few others including the Levy walk (whose construction arises from a Markov process) to a simple exponential distribution, all of which have been used to study foraging and search behaviors.

      Weaknesses:

      This manuscript has two weaknesses that dampen the enthusiasm for the results. First, in all of the examples the authors cite where a Gillespie algorithm is used to sample from a distribution, be it the kinetics associated with chemical dynamics, or a Lotka-Volterra Competition Model, there are underlying processes that govern the evolution of the dynamics, and thus the sampling from distributions. In one of their references, for instance, the stochasticity arises from the birth and death rates, thereby influencing the genetic drift in the model. In these examples, the process governing the dynamics (and thus generating the distributions from which one samples) is distinct from the behavior being studied. In this manuscript, the distribution being sampled is the exponential decay function of the reorientation rate (lines 100-102). This appears to be tautological - a decay function fitted to the reorientation data is then sampled to generate the distributions of the reorientation data. That the model performs well and matches the data is commendable, but it is unclear how that could not be the case if the underlying function generating the distribution was fit to the data.

      Thank you, we apologize that this was not clearer. In the Lotka-Volterra model, the density of predators and prey are being modeled, with the underlying assumption that rates of birth and death are inherently stochastic. In our model, the number of reorientations are being modeled, with the assumption (based on the experiments), that the occurrence of reorientations is stochastic, just like the occurrence (birth) of a prey animal is stochastic. However, the decay in M is phenomenological, and we speculate about the nature of M later in the manuscript.

      You are absolutely right that the decay function for M was fitted to the population average of reorientations and then sampled to generate the distributions of the reorientation data. This was intentional to show that the parameters chosen to match the population average would produce individual trajectories with comparable stochastic “switching” as the experimental data. All we’re trying to show really is that observed sudden changes in reorientation that appear persistent can be produced by a stochastic process without resorting to binary state assignments. In Calhoun, et al 2014 it is reported all animals produced switch-like behavior, but in Klein et al, 2017 it is reported that no animals showed abrupt transitions. López-Cruz et al seem to show a mix of these results, which can be easily explained by an underlying stochastic process.

      The second weakness is somewhat related to the first, in that absent an underlying mechanism or framework, one is left wondering what insight the model provides. Stochastic sampling a function generated by fitting the data to produce stochastic behavior is where one ends up in this framework, and the authors indeed point this out: "simple stochastic models should be sufficient to explain observably stochastic behaviors." (Line 233-234). But if that is the case, what do we learn about how the foraging is happening? The authors suggest that the decay parameter M can be considered a memory timescale; which offers some suggestion, but then go on to say that the "physical basis of M can come from multiple sources". Here is where one is left for want: The mechanisms suggested, including loss of sensory stimuli, alternations in motor integration, ionotropic glutamate signaling, dopamine, and neuropeptides are all suggested: these are basically all of the possible biological sources that can govern behavior, and one is left not knowing what insight the model provides. The array of biological processes listed is so variable in dynamics and meaning, that their explanation of what governs M is at best unsatisfying. Molecular dynamics models that generate distributions can point to certain properties of the model, such as the binding kinetics (on and off rates, etc.) as explanations for the mechanisms generating the distributions, and therefore point to how a change in the biology affects the stochasticity of the process. It is unclear how this model provides such a connection, especially taken in aggregate with the previous weakness.

      Providing a roadmap of how to think about the processes generating M, the meaning of those processes in search, and potential frameworks that are more constrained and with more precise biological underpinning (beyond the array of possibilities described) would go a long way to assuaging the weaknesses.

      Thank you, these are all excellent points. We should clarify that in López-Cruz et al, they claim that only 50% of the animals fit a local/global search paradigm. We are simply proposing there is no need for designating local and global searches if the data don’t really support it. The underlying behavior is stochastic, so the sudden switches sometimes observed can be explained by a stochastic process where the underlying rate is slowing down, thus producing the persistently slow reorientation rate when an apparent “switch” occurs. What we hope to convey is that foraging doesn’t appear to follow a decision paradigm, but instead a gradual change in reorientations which for individual worms, can occasionally produce reorientation trajectories that appear switch-like.

      As for M, you are correct, we should be more explicit. A decay in reorientation rate, rather than a sudden change, is consistent with observations made by López-Cruz et al.  They found that the neurons AIA and ADE redundantly suppress reorientations, and that silencing either one was sufficient to restore the large number of reorientations during early foraging. The synaptic output of AIA and ADE was inhibited over long timescales (tens of minutes) by presynaptic glutamate binding to MGL-1, a slow G-Protein coupled receptor expressed in AIA and ADE. Their results support a model where sensory neurons suppress the synaptic output of AIA and ADE, which in turn leads to a large number of reorientations early in foraging. As time passes, glutamatergic input from the sensory neurons decrease, which leads to disinhibition of AIA and ADE, and a subsequent suppression of reorientations.

      The sensory inputs into AIA and ADE are sequestered into two separate circuits, with AIA receiving chemosensory input and ADE receiving mechanosensory input. Since the suppression of either AIA or ADE is sufficient to increase reorientations, the decay in reorientations is likely due to the synaptic output of both of these neurons decaying in time. This correlates with an observed decrease in sensory neuron activity as well, so the timescale of reorientation decay could be tied to the timescale of sensory neuron activity, which in turn is influencing the timescale of AIA/ADE reorientation suppression. This implies that our factor “M” is likely the sum of several different sensory inputs decaying in time.

      The molecular basis of which sensory neuron signaling factors contribute to decreased AIA and ADE activity is made more complicated by the observation that the glutamatergic input provided by the sensory neurons was not essential, and that additional factors besides glutamate contribute to the signaling to AIA and ADE. In addition to this, it is simply not the sensory neuron activity that decays in time, but also the sensitivity of AIA and ADE to sensory neuron input that decays in time. Simply depolarizing sensory neurons after the animals had starved for 30 minutes was insufficient to rescue the reorientation rates observed earlier in the foraging assay. This observation could be due to decreased presynaptic vesicle release, and/or decreased receptor localization on the postsynaptic side.

      In summary, there are two neuronal properties that appear to be decaying in time. One is sensory neuron activity, and the other is decreased potentiation of presynaptic input onto AIA and ADE. Our factor “M” is a phenomenological manifestation of these numerous decaying factors.

      Reviewer #3 (Public review):

      Summary:

      This intriguing paper addresses a special case of a fundamental statistical question: how to distinguish between stochastic point processes that derive from a single "state" (or single process) and more than one state/process. In the language of the paper, a "state" (perhaps more intuitively called a strategy/process) refers to a set of rules that determine the temporal statistics of the system. The rules give rise to probability distributions (here, the probability for turning events). The difficulty arises when the sampling time is finite, and hence, the empirical data is finite, and affected by the sampling of the underlying distribution(s). The specific problem being tackled is the foraging behavior of C. elegans nematodes, removed from food. Such foraging has been studied for decades, and described by a transition over time from 'local'/'area-restricted' search'(roughly in the initial 10-30 minutes of the experiments, in which animals execute frequent turns) to 'dispersion', or 'global search' (characterized by a low frequency of turns). The authors propose an alternative to this two-state description - a potentially more parsimonious single 'state' with time-changing parameters, which they claim can account for the full-time course of these observations.

      Figure 1a shows the mean rate of turning events as a function of time (averaged across the population). Here, we see a rapid transient, followed by a gradual 4-5 fold decay in the rate, and then levels off. This picture seems consistent with the two-state description. However, the authors demonstrate that individual animals exhibit different "transition" statistics (Figure 1e) and wish to explain this. They do so by fitting this mean with a single function (Equations 1-3).

      Strengths:

      As a qualitative exercise, the paper might have some merit. It demonstrates that apparently discrete states can sometimes be artifacts of sampling from smoothly time-changing dynamics. However, as a generic point, this is not novel, and so without the grounding in C. elegans data, is less interesting.

      Weaknesses:

      (1) The authors claim that only about half the animals tested exhibit discontinuity in turning rates. Can they automatically separate the empirical and model population into these two subpopulations (with the same method), and compare the results?

      Thank you, we should clarify that the observation that about half the animals exhibit discontinuity was not made by us, but by López-Cruz et al. The observed fraction of 50% was based on a visual assessment of the dual regression method we described. To make the process more objective, we decided to simply plot the distributions of the metrics they used for this assessment to see if two distinct populations could be observed. However, the distributions of slope differences and transition times do not produce two distinct populations. Our stochastic approach, which does not assume abrupt state-transitions, also produces comparable distributions. To quantify this, we will perform permutation tests on the means and variances differences between experimental and model data.

      (2) The equations consider an exponentially decaying rate of turning events. If so, Figure 2b should be shown on a semi-logarithmic scale.

      We are happy to add this panel as well.

      (3) The variables in Equations 1-3 and the methods for simulating them are not well defined, making the method difficult to follow. Assuming my reading is correct, Omega should be defined as the cumulative number of turning events over time (Omega(t)), not as a "turn" or "reorientation", which has no derivative. The relevant entity in Figure 1a is apparently <Omega (t)>, i.e. the mean number of events across a population which can be modelled by an expectation value. The time derivative would then give the expected rate of turning events as a function of time.

      Thank you, you are correct. Please see response to Reviewer #1.

      (4) Equations 1-3 are cryptic. The authors need to spell out up front that they are using a pair of coupled stochastic processes, sampling a hidden state M (to model the dynamic turning rate) and the actual turn events, Omega(t), separately, as described in Figure 2a. In this case, the model no longer appears more parsimonious than the original 2-state model. What then is its benefit or explanatory power (especially since the process involving M is not observable experimentally)?

      Thank you, yes we see how as written this was confusing. In our response to Reviewer #1, we added an important detail:

      While reorientations are modeled as discrete events, which is observationally true, the amount of M at time t\=0 is chosen to be large (M<sub>0</sub>←1,000), so that over the timescale of 40 minutes, the decay in M is practically continuous. This ensures that sudden changes in reorientations are not due to sudden changes in M, but due to the inherent stochasticity of reorientations.

      However you are correct that if M was chosen to have a binary value of 0 or 1, then this would indeed be the two state model. Adding this as an additional model would be a good idea to compare how this matches the experimental data, and we are happy to add it.

      (5) Further, as currently stated in the paper, Equations 1-3 are only for the mean rate of events. However, the expectation value is not a complete description of a stochastic system. Instead, the authors need to formulate the equations for the probability of events, from which they can extract any moment (they write something in Figure 2a, but the notation there is unclear, and this needs to be incorporated here).

      Thank you, yes please see our response to Reviewer #1.

      (6) Equations 1-3 have three constants (alpha and gamma which were fit to the data, and M0 which was presumably set to 1000). How does the choice of M0 affect the results?

      Thank you, this is a good question. We will test this down to a binary state of M as mentioned in comment #4.

      (7) M decays to near 0 over 40 minutes, abolishing omega turns by the end of the simulations. Are omega turns entirely abolished in worms after 30-40 minutes off food? How do the authors reconcile this decay with the leveling of the turning rate in Figure 1a?

      Yes, reviewer #1 recommended adding a baseline reorientation rate which is likely more biologically plausible. However, we should also note that in Klein et al they observed a continuous decay over 50 minutes.

      (8) The fit given in Figure 2b does not look convincing. No statistical test was used to compare the two functions (empirical and fit). No error bars were given (to either). These should be added. In the discussion, the authors explain the discrepancy away as experimental limitations. This is not unreasonable, but on the flip side, makes the argument inconclusive. If the authors could model and simulate these limitations, and show that they account for the discrepancies with the data, the model would be much more compelling. To do this, I would imagine that the authors would need to take the output of their model (lists of turning times) and convert them into simulated trajectories over time. These trajectories could be used to detect boundary events (for a given size of arena), collisions between individuals, etc. in their simulations and to see their effects on the turn statistics.

      Thank you, we will add error bars and perform a permutation test on the mean and variance differences between experiment and model over the 40 minute window.

      (9) The other figures similarly lack any statistical tests and by eye, they do not look convincing. The exception is the 6 anecdotal examples in Figure 2e. Those anecdotal examples match remarkably closely, almost suspiciously so. I'm not sure I understood this though - the caption refers to "different" models of M decay (and at least one of the 6 examples clearly shows a much shallower exponential). If different M models are allowed for each animal, this is no longer parsimonious. Are the results in Figure 2d for a single M model? Can Figure 2e explain the data with a single (stochastic) M model?

      Thank you, yes, we will perform permutation tests on the mean and variance differences in the observed distributions in figure 2d. We certainly don’t want the panels in Figure 2e to be suspicious! These comparisons were drawn from calculating the correlations between all model traces and all experimental traces, and then choosing the top hits. Every time we run the simulation, we arrive at a different set of examples. Since it was recommended we add a baseline rate, these examples will be a completely different set when we run the simulation, again.

      We apologize for the confusion regarding M. Since the worms do not all start out with identical reorientation rates, we drew the initial M value from a distribution centered on M0 and a variance to match the initial distribution of observed experimental rates.

      (10) The left axes of Figure 2e should be reverted to cumulative counts (without the normalization).

      Thank you, we will add this. We want to clarify that we normalized it because we chose these examples based on correlation to show that the same types of sudden changes in search strategy can occur with a model that doesn’t rely on sudden rate changes.

      (11) The authors give an alternative model of a Levy flight, but do not give the obvious alternative models:

      a) the 1-state model in which P(t) = alpha exp (-gamma t) dt (i.e. a single stochastic process, without a hidden M, collapsing equations 1-3 into a single equation).

      b) the originally proposed 2-state model (with 3 parameters, a high turn rate, a low turn rate, and the local-to-global search transition time, which can be taken from the data, or sampled from the empirical probability distributions). Why not? The former seems necessary to justify the more complicated 2-process model, and the latter seems necessary since it's the model they are trying to replace. Including these two controls would allow them to compare the number of free parameters as well as the model results. I am also surprised by the Levy model since Levy is a family of models. How were the parameters of the Levy walk chosen?

      Thank you, we will remove this section completely, as it is tangential to the main point of the paper.

      (12) One point that is entirely missing in the discussion is the individuality of worms. It is by now well known that individual animals have individual behaviors. Some are slow/fast, and similarly, their turn rates vary. This makes this problem even harder. Combined with the tiny number of events concerned (typically 20-40 per experiment), it seems daunting to determine the underlying model from behavioral statistics alone.

      Thank you, yes we should have been more explicit in the reasoning behind drawing the initial M from a distribution (response to comment #9). We assume that not every worm starts out with the same reorientation rate, but that some start out fast (high M) and some start out slow (low M). However, we do assume M decays with the same kinetics, which seems sufficient to produce the observed phenomena.

      (13) That said, it's well-known which neurons underpin the suppression of turning events (starting already with Gray et al 2005, which, strangely, was not cited here). Some discussion of the neuronal predictions for each of the two (or more) models would be appropriate.

      Thank you, yes we will add Gray et al, but also the more detailed response to Reviewer #2.

      (14) An additional point is the reliance entirely on simulations. A rigorous formulation (of the probability distribution rather than just the mean) should be analytically tractable (at least for the first moment, and possibly higher moments). If higher moments are not obtainable analytically, then the equations should be numerically integrable. It seems strange not to do this.

      Thank you for suggesting this, we will add these analyses.

      In summary, while sample simulations do nicely match the examples in the data (of discontinuous vs continuous turning rates), this is not sufficient to demonstrate that the transition from ARS to dispersion in C. elegans is, in fact, likely to be a single 'state', or this (eq 1-3) single state. Of course, the model can be made more complicated to better match the data, but the approach of the authors, seeking an elegant and parsimonious model, is in principle valid, i.e. avoiding a many-parameter model-fitting exercise.

      As a qualitative exercise, the paper might have some merit. It demonstrates that apparently discrete states can sometimes be artifacts of sampling from smoothly time-changing dynamics. However, as a generic point, this is not novel, and so without the grounding in C. elegans data, is less interesting.

      Thank you, we agree that this is a generic phenomenon, which is partly why we did this. The data from López-Cruz seem to agree in part with Calhoun et al, that claim abrupt transitions occur, and Klein et al, which claim they do not occur. Since the underlying phenomenon is stochastic, we propose the mixed observations of sudden and gradual changes in search strategy are simply the result of a stochastic process, which can produce both phenomena for individual observations.

    1. Author Response

      Reviewer 1:

      Comment 1.1: The distinction of PIGS from nearby OPA, which has also been implied in navigation and ego-motion, is not as clear as it could be.

      Response1.1: The main functional distinction between TOS/OPA and PIGS is that TOS/OPA responds preferentially to moving vs. stationary stimuli (even concentric rings), likely due to its overlap with the retinotopic motion-selective visual area V3A, for which this is a defining functional property (e.g. Tootell et al., 1997, J Neurosci). In comparison, PIGS does not show such a motion-selectivity. Instead, PIGS responds preferentially to more complex forms of motion within scenes. In this revision, we tried to better highlight this point in the Discussion (see also the response to the first comment from Reviewer #2).

      Reviewer 2:

      Comment 2.1: First, the scene-selective region identified appears to overlap with regions that have previously been identified in terms of their retinotopic properties. In particular, it is unclear whether this region overlaps with V7/IPS0 and/or IPS1. This is particularly important since prior work has shown that OPA often overlaps with v7/IPS0 (Silson et al, 2016, Journal of Vision). The findings would be much stronger if the authors could show how the location of PIGS relates to retinotopic areas (other than V6, which they do currently consider). I wonder if the authors have retinotopic mapping data for any of the participants included in this study. If not, the authors could always show atlas-based definitions of these areas (e.g. Wang et al, 2015, Cerebral Cortex).

      Response 2.1: We thank the reviewers for reminding us to more clearly delineate this issue of possible overlap, including the information provided by Silson et al, 2016. The issue of possible overlap between area TOS/OPA and the retinotopic visual areas, both in humans and non-human primates, was also clarified by our team in 2011 (Nasr et al., 2011). As you can see in the enclosed figure, and consistent with those previous studies, TOS/OPA overlaps with visual areas V3A/B and V7. Whereas PIGS is located more dorsally close to IPS2-4. As shown here, there is no overlap between PIGS and TOS/OPA and there is no overlap between PIGS and areas V3A/B and V7. To more directly address the reviewer’s concern, in the next revision, we will show the relative position of PIGS and the retinotopic areas (at least) in one individual subject.

      Author response image 1.

      The relative location of PIGS, TOS/OPA and the retinotopic visual areas. The left panel showed the result of high-resolution (7T; voxel size = 1 mm; no spatial smoothing) polar angle mapping in one individual. The right panel shows the location of scene-selective areas PIGS and TOS/OPA in the same subject (7T; voxel size = 1 mm; no spatial smoothing). While area TOS/OPA shows some overlap with the retinotopic visual areas V3A/B and V7, PIGS shows partial overlap with area IPS2-4. In both panels, the activity maps are overlaid on the subjects’ own reconstructed brain surface.

      Comment 2.2: Second, recent studies have reported a region anterior to OPA that seems to be involved in scene memory (Steel et al, 2021, Nature Communications; Steel et al, 2023, The Journal of Neuroscience; Steel et al, 2023, biorXiv). Is this region distinct from PIGS? Based on the figures in those papers, the scene memory-related region is inferior to V7/IPS0, so characterizing the location of PIGS to V7/IPS0 as suggested above would be very helpful here as well. If PIGS overlaps with either of V7/IPS0 or the scene memory-related area described by Steel and colleagues, then arguably it is not a newly defined region (although the characterization provided here still provides new information).

      Response 2.2: The lateral-place memory area (LPMA) is located on the lateral brain surface, anterior relative to the IPS (see Figure 1 from Steel et al., 2021 and Figure 3 from Steel et al., 2023). In contrast, PIGS is located on the posterior brain surface, also posterior relative to the IPS. In other words, they are located on two different sides of a major brain sulcus. In this revision we have clarified this point, including the citations by Steel and colleagues.

      Comments 2.3: Another reason that it would be helpful to relate PIGS to this scene memory area is that this scene memory area has been shown to have activity related to the amount of visuospatial context (Steel et al, 2023, The Journal of Neuroscience). The conditions used to show the sensitivity of PIGS to ego-motion also differ in the visuospatial context that can be accessed from the stimuli. Even if PIGS appears distinct from the scene memory area, the degree of visuospatial context is an alternative account of what might be represented in PIGS.

      Response 2.3: The reviewer raises an interesting point. One minor confusion is that we may be inadvertently referring to two slightly different types of “visuospatial context”. Specifically, the stimuli used in the ego-motion experiment here (i.e. coherently vs. incoherently changing scenes) represent the same scenes, and the only difference between the two conditions is the sequence of images across the experimental blocks. In that sense, the two experimental conditions may be considered to have the same visuospatial context. However, it could be also argued that the coherently changing scenes provide more information about the environmental layout. In that case, considering the previous reports that PPA/TPA and RSC/MPA may also be involved in layout encoding (Epstein and Kanwisher 1998; Wolbers et al. 2011), we expected to see more activity within those regions in response to coherently compared incoherently changing scenes. These issues are now more explicitly discussed in the revised article.

      Reviewer 3:

      Comment 3.1: There are few weaknesses in this work. If pressed, I might say that the stimuli depicting ego-motion do not, strictly speaking, depict motion, but only apparent motion between 2s apart photographs. However, this choice was made to equate frame rates and motion contrast between the 'ego-motion' and a control condition, which is a useful and valid approach to the problem. Some choices for visualization of the results might be made differently; for example, outlines of the regions might be shown in more plots for easier comparison of activation locations, but this is a minor issue.

      Response 3.1: We thank the reviewer for these constructive suggestions, and we agree with their comment that the ego-motion stimuli are not smooth, even though they were refreshed every 100 ms. However, the stimuli were nevertheless coherent enough to activate areas V6 and MT, two major areas known to respond preferentially to coherent compared to incoherent motion.

      Epstein, R., and N. Kanwisher. 1998. 'A cortical representation of the local visual environment', Nature, 392: 598-601.

      Wolbers, T., R. L. Klatzky, J. M. Loomis, M. G. Wutte, and N. A. Giudice. 2011. 'Modality-independent coding of spatial layout in the human brain', Curr Biol, 21: 984-9.

    1. Author response:

      We are very pleased to hear the overall positive views and constructive criticisms of eLife Editors and Reviewers on our work. In particular, we appreciate their global assessment that the work offers a valuable tool for neuroscientists to visualize and assess dendritic computations.

      We will clarify in a revised version of the manuscript that we do not infer the synaptic inputs of the neuron. Also, we will add a new simulation with simpler morphology to illustrate the method under more intuitive conditions. We will also clarify the meaning of the "target" and "reference" compartments. These labels do not depend on the direction of the current flow, but we can freely chose any compartment to be the target, and then the axial currents will be evaluated relative to that compartment in each time step.

    1. Author response:

      We thank the reviewers for their time and work assessing our manuscript, and for their constructive suggestions for improvements. Based on the reviews, our plan is to adapt the work as follows:

      (1)  Perform a sensitivity analysis considering only confirmed dengue, Zika, and chikungunya cases,

      (2)  Explore and discuss the potential correlation between diseases,

      (3)  Compare the baseline and final models,

      (4)  Assess model fit using a wider variety of metrics.

      We would like to emphasise that our research question was to explore drivers of arbovirus incidence outside of seasonal trends. We therefore designed our models with flexible spatiotemporal random effects to capture baseline patterns, and as the reviewers have highlighted, much of the variance is explained by these random effects. To expand on point 3 above, we will perform a comparison of the baseline random effect models and the final multivariable models to show the differences between the models and quantify the additional impact of the meteorological variables in the final models.

    1. Author response:

      Thank you for the thorough assessment and insightful reviews of our manuscript, "Multi-timescale neural adaptation underlying long-term musculoskeletal reorganization." We are very encouraged by the positive evaluation – particularly the recognition of the study as "important" with "solid" evidence – and we appreciate the constructive feedback provided in the public reviews.

      As requested, we would like to provide this provisional author response to accompany the first version of the Reviewed Preprint. While we plan to provide a detailed point-by-point response upon submission of the revised manuscript, this email outlines our overall revision plan based on the public reviews.

      We found the reviewers' comments to be extremely helpful and largely aligned with our own assessment of areas for clarification and strengthening. We plan a full revision that will address all points raised.

      Regarding Interpretations and Clarity:

      Several comments focused on clarifying key interpretations. We agree with these suggestions and have already incorporated significant textual revisions into the manuscript to:

      More explicitly articulate the proposed multi-timescale model that reconciles the smooth behavioral recovery with the abrupt neural shifts (addressing a core point from R1).

      Refine the interpretation of the compensatory tenodesis strategy, clarifying the distinct neural implementations observed in each monkey and the crucial role of temporal re-timing versus amplitude scaling (addressing points from R1 and R2).

      Correct our interpretation regarding the apparent differences in the "arms race" phenomenon, framing it more parsimoniously in terms of observational windows and individual adaptation rates (addressing R1).

      Ensure consistent and unambiguous terminology (e.g., using "activation profiles") throughout the text and figure captions (addressing R1).

      Carefully adjust language to distinguish between direct empirical findings and interpretations regarding concepts like energetic cost and the drivers of adaptation (addressing R2).

      Explicitly address the potential confound of physical tendon healing, clarifying in the Methods and Discussion why our surgical technique allows us to interpret the findings primarily in terms of neural learning (addressing R3).

      Regarding New Analyses and Data Presentation:

      The reviewers also provided excellent suggestions for new analyses to enhance the rigor and depth of our findings. We plan to undertake these analyses for the full revision, including:

      Adding measures of trial-to-trial variability (e.g., SEM envelopes) and time-lag analysis to our cross-correlation results (addressing R1).

      Performing a point-by-point statistical comparison to better characterize the subtle differences between pre-surgery and final recovered synergy profiles (addressing R1).

      Formally quantifying the baseline behavioral variability between the monkeys (addressing R1).

      Creating a new kinematic plot visualizing the refinement of the tenodesis skill over time (addressing R1).

      Establishing a baseline for normal day-to-day synergy variability by analyzing pre-surgery data (addressing R3).

      Incorporating additional behavioral/kinematic data (pull times and grasp aperture) into Figure 5 to provide a clearer link between neural changes and functional recovery (addressing R2).

      We have also noted the reviewers' suggestions regarding figure clarity and plan improvements where possible. We have already addressed some specific recommendations (e.g., elaborating captions for Figs 6 & 7, adding a supplementary table for muscle acronyms).

      We plan to address the 'Recommendations for the authors' thoroughly during the preparation of the revised manuscript. We are very grateful for all these recommendations, as we are confident they will significantly improve the quality, clarity, and impact of our work. We hope that these comprehensive revisions might also strengthen the final eLife assessment.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The authors propose a new technique which they name "Multi-gradient Permutation Survival Analysis (MEMORY)" that they use to identify "Genes Steadily Associated with Prognosis (GEARs)" using RNA-seq data from the TCGA database. The contribution of this method is one of the key stated aims of the paper. The vast majority of the paper focuses on various downstream analyses that make use of the specific GEARs identified by MEMORY to derive biological insights, with a particular focus on lung adenocarcinoma (LUAD) and breast invasive carcinoma (BRCA) which are stated to be representative of other cancers and are observed to have enriched mitosis and immune signatures, respectively. Through the lens of these cancers, these signatures are the focus of significant investigation in the paper.

      Strengths:

      The approach for MEMORY is well-defined and clearly presented, albeit briefly. This affords statisticians and bioinformaticians the ability to effectively scrutinize the proposed methodology and may lead to further advancements in this field.

      The scientific aspects of the paper (e.g., the results based on the use of MEMORY and the downstream bioinformatics workflows) are conveyed effectively and in a way that is digestible to an individual who is not deeply steeped in the cancer biology field.

      Weaknesses:

      I was surprised that comparatively little of the paper is devoted to the justification of MEMORY (i.e., the authors' method) for the identification of genes that are important broadly for the understanding of cancer. The authors' approach is explained in the methods section of the paper, but no rationale is given for why certain aspects of the method are defined as they are. Moreover, no comparison or reference is made to any other methods that have been developed for similar purposes and no results are shown to illustrate the robustness of the proposed method (e.g., is it sensitive to subtle changes in how it is implemented).

      For example, in the first part of the MEMORY algorithm, gene expression values are dichotomized at the sample median and a log-rank test is performed. This would seemingly result in an unnecessary loss of information for detecting an association between gene expression and survival. Moreover, while dichotomizing at the median is optimal from an information theory perspective (i.e., it creates equally sized groups), there is no reason to believe that median-dichotomization is correct vis-à-vis the relationship between gene expression and survival. If a gene really matters and expression only differentiates survival more towards the tail of the empirical gene expression distribution, median-dichotomization could dramatically lower the power to detect group-wise differences.

      Thanks for these valuable comments!! We understand the reviewer’s concern regarding the potential loss of information caused by median-based dichotomization. In this study, we adopted the median as the cut-off value to stratify gene expression levels primarily for the purpose of data balancing and computational simplicity. This approach ensures approximately equal group sizes, which is particularly beneficial in the context of limited sample sizes and repeated sampling. While we acknowledge that this method may discard certain expression nuances, it remains a widely used strategy in survival analysis. To further evaluate and potentially enhance sensitivity, alternative strategies such as percentile-based cutoffs or survival models using continuous expression values (e.g., Cox regression) may be explored in future optimization of the MEMORY pipeline. Nevertheless, we believe that this dichotomization approach offers a straightforward and effective solution for the initial screening of survival-associated genes. We have now included this explanation in the revised manuscript (Lines 391–393).

      Specifically, the authors' rationale for translating the Significant Probability Matrix into a set of GEARs warrants some discussion in the paper. If I understand correctly, for each cancer the authors propose to search for the smallest sample size (i.e., the smallest value of k_{j}) were there is at least one gene with a survival analysis p-value <0.05 for each of the 1000 sampled datasets. I base my understanding on the statement "We defined the sampling size k_{j} reached saturation when the max value of column j was equal to 1 in a significant-probability matrix. The least value of k_{j} was selected". Then, any gene with a p-value <0.05 in 80% of the 1000 sampled datasets would be called a GEAR for that cancer. The 80% value here seems arbitrary but that is a minor point. I acknowledge that something must be chosen. More importantly, do the authors believe this logic will work effectively in general? Presumably, the gene with the largest effect for a cancer will define the value of K_{j}, and, if the effect is large, this may result in other genes with smaller effects not being selected for that cancer by virtue of the 80% threshold. One could imagine that a gene that has a small-tomoderate effect consistently across many cancers may not show up as a gear broadly if there are genes with more substantive effects for most of the cancers investigated. I am taking the term "Steadily Associated" very literally here as I've constructed a hypothetical where the association is consistent across cancers but not extremely strong. If by "Steadily Associated" the authors really mean "Relatively Large Association", my argument would fall apart but then the definition of a GEAR would perhaps be suboptimal. In this latter case, the proposed approach seems like an indirect way to ensure there is a reasonable effect size for a gene's expression on survival.

      Thank you for the comment and we apologize for the confusion! 𝐴<sub>𝑖𝑗</sub> refers to the value of gene i under gradient j in the significant-probability matrix, primarily used to quantify the statistical probability of association with patient survival for ranking purposes. We believe that GEARs are among the top-ranked genes, but there is no established metric to define the optimal threshold. An 80% threshold is previously employed as an empirical standard in studies related to survival estimates [1]. In addition, we acknowledge that the determination of the saturation point 𝑘<sub>𝑗</sub> is influenced by the earliest point at which any gene achieves consistent significance across 1000 permutations. We recognize that this may lead to the under representation of genes with moderate but consistent effects, especially in the presence of highly significant genes that dominate the statistical landscape. We therefore empirically used 𝐴<sub>𝑖𝑗</sub> > 0.8 the threshold to distinguish between GEARs and non-GEARs. Of course, this parameter variation may indeed result in the loss of some GEARs or the inclusion of non-GEARs. We also agree that future studies could investigate alternative metrics and more refined thresholds to improve the application of GEARs.

      Regarding the term ‘Steadily Associated’, we define GEARs based on statistical robustness across subsampled survival analyses within individual cancer types, rather than cross-cancer consistency or pan-cancer moderate effects. Therefore, our operational definition of “steadiness” emphasizes within-cancer reproducibility across sampling gradients, which does not necessarily exclude high-effect-size genes. Nonetheless, we agree that future extensions of MEMORY could incorporate cross-cancer consistency metrics to capture genes with smaller but reproducible pan-cancer effects.

      The paper contains numerous post-hoc hypothesis tests, statements regarding detected associations and correlations, and statements regarding statistically significant findings based on analyses that would naturally only be conducted in light of positive results from analyses upstream in the overall workflow. Due to the number of statistical tests performed and the fact that the tests are sometimes performed using data-driven subgroups (e.g., the mitosis subgroups), it is highly likely that some of the findings in the work will not be replicable. Of course, this is exploratory science, and is to be expected that some findings won't replicate (the authors even call for further research into key findings). Nonetheless, I would encourage the authors to focus on the quantification of evidence regarding associations or claims (i.e., presenting effect estimates and uncertainty intervals), but to avoid the use of the term statistical significance owing to there being no clear plan to control type I error rates in any systematic way across the diverse analyses there were performed.

      Thank you for the comment! We agree that rigorous control of type-I error is essential once a definitive list of prognostic genes is declared. The current implementation of MEMORY, however, is deliberately positioned as an exploratory screening tool: each gene is evaluated across 10 sampling gradients and 1,000 resamples per gradient, and the only quantity carried forward is its reproducibility probability (𝐴<sub>𝑖𝑗</sub>).

      Because these probabilities are derived from aggregate “votes” rather than single-pass P-values, the influence of any one unadjusted test is inherently diluted. In another words, whether or not a per-iteration BH adjustment is applied does not materially affect the ranking of genes by reproducibility, which is the key output at this stage. However, we also recognize that a clinically actionable GEARs catalogue will require extensive, large-scale multiple-testing adjustments. Accordingly, future versions of MEMORY will embed a dedicated false-positive control framework tailored to the final GEARs list before any translational application. We have added this point in the ‘Discussion’ in the revised manuscript (Lines 350-359).

      A prespecified analysis plan with hypotheses to be tested (to the extent this was already produced) and a document that defines the complete scope of the scientific endeavor (beyond that which is included in the paper) would strengthen the contribution by providing further context on the totality of the substantial work that has been done. For example, the focus on LUAD and BRCA due to their representativeness could be supplemented by additional information on other cancers that may have been investigated similarly but where results were not presented due to lack of space.

      We thank the reviewer for requesting greater clarity on the analytic workflow. The MEMORY pipeline was fully specified before any results were examined and is described in ‘Methods’ (Lines 386–407). By contrast, the pathway-enrichment and downstream network/mutation analyses were deliberately exploratory: their exact content necessarily depended on which functional categories emerged from the unbiased GEAR screen.

      Our screen revealed a pronounced enrichment of mitotic signatures in LUAD and immune signatures in BRCA.

      We then chose these two cancer types for deeper “case-study” analysis because they contained the largest sample sizes among all cancers showing mitotic- or immune-dominated GEAR profiles, and provided the greatest statistical power for follow-up investigations. We have added this explanation into the revised manuscript (Line 163, 219-220).

      Reviewer #2 (Public review):

      Summary:

      The authors are trying to come up with a list of genes (GEAR genes) that are consistently associated with cancer patient survival based on TCGA database. A method named "Multi-gradient Permutation Survival Analysis" was created based on bootstrapping and gradually increasing the sample size of the analysis. Only the genes with consistent performance in this analysis process are chosen as potential candidates for further analyses.

      Strengths:

      The authors describe in detail their proposed method and the list of the chosen genes from the analysis. The scientific meaning and potential values of their findings are discussed in the context of published results in this field.

      Weaknesses:

      Some steps of the proposed method (especially the definition of survival analysis similarity (SAS) need further clarification or details since it would be difficult if anyone tries to reproduce the results. In addition, the multiplicity (a large number of p-values are generated) needs to be discussed and/or the potential inflation of false findings needs to be part of the manuscript.

      Thank you for the reviewer’s insightful comments. Accordingly, in the revised manuscript, we have provided a more detailed explanation of the definition and calculation of Survival-Analysis Similarity (SAS) to ensure methodological clarity and reproducibility (Lines 411-428); and the full code is now publicly available on GitHub (https://github.com/XinleiCai/MEMORY). We have also expanded the ‘Discussion’ to clarify our position on false-positive control: future releases of MEMORY will incorporate a dedicated framework to control false discoveries in the final GEARs catalogue, where itself will be subjected to rigorous, large-scale multiple-testing adjustment.

      If the authors can improve the clarity of the proposed method and there is no major mistake there, the proposed approach can be applied to other diseases (assuming TCGA type of data is available for them) to identify potential gene lists, based on which drug screening can be performed to identify potential target for development.

      Thank you for the suggestion. All source code has now been made publicly available on GitHub for reference and reuse. We agree that the GEAR lists produced by MEMORY hold considerable promise for drugscreening and target-validation efforts, and the framework could be applied to any disease with TCGA-type data. Of course, we also notice that the current GEAR catalogue should first undergo rigorous, large-scale multipletesting correction to further improve its precision before broader deployment.

      Reviewer #3 (Public review):

      Summary:

      The authors describe a valuable method to find gene sets that may correlate with a patient's survival. This method employs iterative tests of significance across randomised samples with a range of proportions of the original dataset. Those genes that show significance across a range of samples are chosen. Based on these gene sets, hub genes are determined from similarity scores.

      Strengths:

      MEMORY allows them to assess the correlation between a gene and patient prognosis using any available transcriptomic dataset. They present several follow-on analyses and compare the gene sets found to previous studies.

      Weaknesses:

      Unfortunately, the authors have not included sufficient details for others to reproduce this work or use the MEMORY algorithm to find future gene sets, nor to take the gene findings presented forward to be validated or used for future hypotheses.

      Thank you for the reviewer’s comments! We apologize for the inconvenience and the lack of details.

      Followed the reviewer’s valuable suggestion, we have now made all source code and relevant scripts publicly available on GitHub to ensure full reproducibility and facilitate future use of the MEMORY algorithm for gene discovery and hypothesis generation.

      Reviewer #4 (Public review):

      The authors apply what I gather is a novel methodology titled "Multi-gradient Permutation Survival Analysis" to identify genes that are robustly associated with prognosis ("GEARs") using tumour expression data from 15 cancer types available in the TCGA. The resulting lists of GEARs are then interrogated for biological insights using a range of techniques including connectivity and gene enrichment analysis.

      I reviewed this paper primarily from a statistical perspective. Evidently, an impressive amount of work has been conducted, and concisely summarised, and great effort has been undertaken to add layers of insight to the findings. I am no stranger to what an undertaking this would have been. My primary concern, however, is that the novel statistical procedure proposed, and applied to identify the gene lists, as far as I can tell offers no statistical error control or quantification. Consequently, we have no sense of what proportion of the highlighted GEAR genes and networks are likely to just be noise.

      Major comments:

      (1) The main methodology used to identify the GEAR genes, "Multi-gradient Permutation Survival Analysis" does not formally account for multiple testing and offers no formal error control. Meaning we are left with no understanding of what the family-wise (aka type 1) error rate is among the GEAR lists, nor the false discovery rate. I would generally recommend against the use of any feature selection methodology that does not provide some form of error quantification and/or control because otherwise we do not know if we are encouraging our colleagues and/or readers to put resources into lists of genes that contain more noise than not. There are numerous statistical techniques available these days that offer error control, including for lists of p-values from arbitrary sets of tests (see expansion on this and some review references below).

      Thank you for your thoughtful and important comment! We fully agree that controlling type I error is critical when identifying gene sets for downstream interpretation or validation. As an exploratory study, our primary aim was to define and screen for GEARs by using the MEMORY framework; however, we acknowledge that the current implementation of MEMORY does not include a formal procedure for error control. Given that MEMORY relies on repeated sampling and counts the frequency of statistically significant p-values, applying standard p-value–based multiple-testing corrections at the individual test level would not meaningfully reduce the false-positive rate in this framework.

      We believe that error control should instead be applied at the level of the final GEAR catalogue. However, we also recognize that conventional correction methods are not directly applicable. In future versions of MEMORY, we plan to incorporate a dedicated and statistically appropriate false-positive control module tailored specifically to the aggregated outputs of the pipeline. We have clarified this point explicitly in the revised manuscript. (Lines 350-359)

      (2) Similarly, no formal significance measure was used to determine which of the strongest "SAS" connections to include as edges in the "Core Survival Network".

      We agree that the edges in the Core Survival Network (CSN) were selected based on the top-ranked SAS values rather than formal statistical thresholds. This was a deliberate design choice, as the CSN was intended as a heuristic similarity network to prioritize genes for downstream molecular classification and biological exploration, not for formal inference. To address potential concerns, we have clarified this intent in the revised manuscript, and we now explicitly state that the network construction was based on empirical ranking rather than statistical significance (Lines 422-425).

      (3) There is, as far as I could tell, no validation of any identified gene lists using an independent dataset external to the presently analysed TCGA data.

      Thank you for the comment. We acknowledge that no independent external dataset was used in the present study to validate the GEARs lists. However, the primary aim of this work was to systematically identify and characterize genes with robust prognostic associations across cancer types using the MEMORY framework. To assess the biological relevance of the resulting GEARs, we conducted extensive downstream analyses including functional enrichment, mutation profiling, immune infiltration comparison, and drug-response correlation. These analyses were performed across multiple cancer types and further supported by a wide range of published literature.

      We believe that this combination of functional characterization and literature validation provides strong initial support for the robustness and relevance of the GEARs lists. Nonetheless, we agree that validation in independent datasets is an important next step, and we plan to carry this out in future work to further strengthen the clinical application of MEMORY.

      (4) There are quite a few places in the methods section where descriptions were not clear (e.g. elements of matrices referred to without defining what the columns and rows are), and I think it would be quite challenging to re-produce some aspects of the procedures as currently described (more detailed notes below).

      We apologize for the confusion. In the revised manuscript, we have provided a clearer and more detailed description of the computational workflow of MEMORY to improve clarity and reproducibility.

      (5) There is a general lack of statistical inference offered. For example, throughout the gene enrichment section of the results, I never saw it stated whether the pathways highlighted are enriched to a significant degree or not.

      We apologize for not clearly stating this information in the original manuscript. In the revised manuscript, we have updated the figure legend to explicitly report the statistical significance of the enriched pathways (Line 870, 877, 879-880).

      Reviewer #1 (Recommendations for the authors):

      Overall, the paper reads well but there are numerous small grammatical errors that at times cost me non-trivial amounts of time to understand the authors' key messages.

      We apologize for the grammatical errors that hindered clarity. In response, we have thoroughly revised the manuscript for grammar, spelling, and overall language quality.

      Reviewer #2 (Recommendations for the authors):

      Major comments:

      (1) Line 427: survival analysis similarity (SAS) definition. Any reference on this definition and why it is defined this way? Can the SAS value be negative? Based on line 429 definition, if A and B are exactly the same, SAS ~ 1; completely opposite, SAS =0; otherwise, SAS could be any value, positive or negative. So it is hard to tell what SAS is measuring. It is important to make sure SAS can measure the similarity in a systematic and consistent way since it is used as input in the following network analysis.

      We apologize for the confusion caused by the ambiguity in the original SAS formula. The SAS metric was inspired by the Jaccard index, but we modified the denominator to increase contrast between gene pairs. Specifically, the numerator counts the number of permutations in which both genes are simultaneously significant (i.e., both equal to 1), while the denominator is the sum of the total number of significant events for each gene minus twice the shared significant count. An additional +1 term was included in the denominator to avoid division by zero. This formulation ensures that SAS is always non-negative and bounded between 0 and 1, with higher values indicating greater similarity. We have clarified this definition and updated the formula in the revised manuscript (Lines 405-425). 

      (2) For the method with high dimensional data, multiplicity adjustment needs to be discussed, but it is missing in the manuscript. A 5% p-value cutoff was used across the paper, which seems to be too liberal in this type of analysis. The suggestion is to either use a lower cutoff value or use False Discovery Rate (FDR) control methods for such adjustment. This will reduce the length of the gene list and may help with a more focused discussion.

      We appreciate the reviewer’s suggestion regarding multiplicity. MEMORY is intentionally positioned as an exploratory screen: each gene is tested across 10 sampling gradients and 1,000 resamples, and only its reproducibility probability (𝐴<sub>𝑖𝑗</sub>) is retained. Because this metric is an aggregate of 1,000 “votes” the influence of any single unadjusted P-value is already strongly diluted; adding a per-iteration BH/FDR step therefore has negligible impact on the reproducibility ranking that drives all downstream analyses.

      That said, we recognize that a clinically actionable GEARs catalogue must undergo formal, large-scale multipletesting correction. Future releases of MEMORY will incorporate an error control module applied to the consolidated GEAR list before any translational use. We have now added a statement to this effect in the revised manuscript (Lines 350-359).

      (3) To allow reproducibility from others, please include as many details as possible (software, parameters, modules etc.) for the analyses performed in different steps.

      All source codes are now publically available on GitHub. We have also added the GitHub address in the section Online Content.

      Minor comments or queries:

      (4) The manuscript needs to be polished to fix grammar, incomplete sentences, and missing figures.

      Thank you for the suggestion. We have thoroughly proofread the manuscript to correct grammar, complete any unfinished sentences, and restore or renumber all missing figure panels. All figures are now properly referenced in the text.

      (5) Line 131: "survival probability of certain genes" seems to be miss-leading. Are you talking about its probability of associating with survival (or prognosis)?

      Sorry for the oversight. What we mean is the probability that a gene is found to be significantly associated with survival across the 1,000 resamples. We have revised the statement to “significant probability of certain genes” (Line 102).

      (6) Lines 132, 133: "remained consistent": the score just needs to stay > 0.8 as the sample increases, or the score needs to be monotonously non-decreasing?

      We mean the score stay above 0.8. We understand “remained consistent” is confusing and now revised it to “remained above 0.8”.

      (7) Lines 168-170 how can supplementary figure 5A-K show "a certain degree of correlation with cancer stages"?

      Sorry for the confusion! We have now revised Supplementary Figure 5A–K to support the visual impression with formal statistics. For each cancer type, we built a contingency table of AJCC stage (I–IV) versus hub-gene subgroup (Low, Mid, High) and applied Pearson’s 𝑥<sup>2</sup> test (Monte-Carlo approximation, 10⁵ replicates when any expected cell count < 5). The 𝑥<sup>2</sup> statistic and p-value are printed beneath every panel; eight of the eleven cancers show a significant association (p-value < 0.05), while LUSC, THCA and PAAD do not.We have replaced the vague phrase “a certain degree of correlation” with this explicit statistical statement in the revised manuscript (Lines 141-143).

      (8) Lines 172-174: since the hub genes are a subset of GEAR genes through CSN construction, it is not a surprise of the consistency. any explanation about PAAD that is shown only in GOEA with GEARs but not with hub genes?

      Thanks for raising this interesting point! In PAAD the Core Survival Network is unusually diffuse: the top-ranked SAS edges are distributed broadly rather than converging on a single dense module. Because of this flat topology, the ten highest-degree nodes (our hub set) do not form a tightly interconnected cluster, nor are they collectively enriched in the mitosis-related pathway that dominates the full GEAR list. This might explain that the mitotic enrichment is evident when all PAAD GEARs were analyzed but not when the analysis is confined to the far smaller—and more functionally dispersed—hub-gene subset.

      (9) Lines 191: how the classification was performed? Tool? Cutoff values etc?

      The hub-gene-based molecular classification was performed in R using hierarchical clustering. Briefly, we extracted the 𝑙𝑜𝑔<sub>2</sub>(𝑇𝑃𝑀 +1) expression matrix of hub genes, computed Euclidean distances between samples, and applied Ward’s minimum variance method (hclust, method = "ward.D2"). The resulting dendrogram was then divided into three groups (cutree, k = 3), corresponding to low, mid, and high expression classes. These parameters were selected based on visual inspection of clustering structure across cancer types. We have added this information to the revised ‘Methods’ section (Lines 439-443).

      (10) Lines 210-212: any statistics to support the conclusion? The bar chat of Figure 3B seems to support that all mutations favor ML & MM.

      We agree that formal statistical support is important for interpreting groupwise comparisons. In this case, however, several of the driver events, such as ROS1 and ERBB2, had very small subgroup counts, which violate the assumptions of Pearson’s 𝑥<sup>2</sup> test. While we explored 𝑥<sup>2</sup> and Fisher’s exact tests, the results were unstable due to sparse counts. Therefore, we chose to present these distributions descriptively to illustrate the observed subtype preferences across different driver mutations (Figure 3B). We have revised the manuscript text to clarify this point (Lines 182-188).

      (11) Line 216: should supplementary Figure 6H-J be "6H-I"?

      We apologize for the mistake. We have corrected it in the revised manuscript.

      (12) Line 224: incomplete sentence starting with "To further the functional... ".

      Thanks! We have made the revision and it states now “To further expore the functional implications of these mutations, we enriched them using a pathway system called Nested Systems in Tumors (NeST)”.

      (13) Lines 261-263: it is better to report the median instead of the mean. Use log scale data for analysis or use non-parametric methods due to the long tail of the data.

      Thank you for the very helpful suggestion. In the revised manuscript, we now report the median instead of the mean to better reflect the distribution of the data. In addition, we have applied log-scale transformation where appropriate and replaced the original statistical tests with non-parametric Wilcoxon ranksum tests to account for the long-tailed distribution. These changes have been implemented in both the main text and figure legends (Lines 234–237, Figure 5F).

      (14) Line 430: why based on the first sampling gradient, i.e. k_1 instead of the k_j selected? Or do you mean k_j here?

      Thanks for this question! We deliberately based SAS on the vectors from the first sampling gradient ( 𝑘<sub>1</sub>, ≈ 10 % of the cohort). At this smallest sample size, the binary significance patterns still contain substantial variation, and many genes are not significant in every permutation. Based on this, we think the measure can meaningfully identify gene pairs that behave concordantly throughout the gradient permutation. 

      We have now added a sentence to clarify this in the Methods section (Lines 398–403).

      (15) Need clarification on how the significant survival network was built.

      Thank you for pointing this out. We have now provided a more detailed clarification of how the Survival-Analysis Similarity (SAS) metric was defined and applied in constructing the core survival network (CSN), including the rationale for key parameter choices (Lines 409–430). Additionally, we have made full source code publicly available on GitHub to facilitate transparency and reproducibility (https://github.com/XinleiCai/MEMORY).

      (16) Line 433: what defines the "significant genes" here? Are they the same as GEAR genes? And what are total genes, all the genes?

      We apologize for the inconsistency in terminology, which may have caused confusion. In this context,

      “significant genes” refers specifically to the GEARs (Genes Steadily Associated with Prognosis). The SAS values were calculated between each GEAR and all genes. We have revised the manuscript to clarify this by consistently using the term “GEARs” throughout.

      (17) Line 433: more detail on how SAS values were used will be helpful. For example, were pairwise SAS values fed into Cytoscape as an additional data attribute (on top of what is available in TCGA) or as the only data attribute for network building?

      The SAS values were used as the sole metric for defining connections (edges) between genes in the construction of the core survival network (CSN). Specifically, we calculated pairwise SAS values between each GEAR and all other genes, then selected the top 1,000 gene pairs with the highest SAS scores to construct the network. No additional data attributes from TCGA (such as expression levels or clinical features) were used in this step. These selected pairs were imported into Cytoscape solely based on their SAS values to visualize the CSN.

      (18) Line 434: what is "ranking" here, by degree? Is it the same as "nodes with top 10 degrees" at line 436?

      The “ranking” refers specifically to the SAS values between gene pairs. The top 1,000 ranked SAS values were selected to define the edges used in constructing the Core Survival Network (CSN).

      Once the CSN was built, we calculated the degree (number of connections) for each node (i.e., each gene). The

      “top 10 degrees” mentioned on Line 421 refers to the 10 genes with the highest node degrees in the CSN. These were designated as hub genes for downstream analyses.

      We have clarified this distinction in the revised manuscript (Line 398-403).

      (19) Line 435: was the network built in Cytoscape? Or built with other tool first and then visualized in Cytoscape?

      The network was constructed in R by selecting the top 1,000 gene pairs with the highest SAS values to define the edges. This edge list was then imported into Cytoscape solely for visualization purposes. No network construction or filtering was performed within Cytoscape itself. We have clarified this in the revised ‘Methods’ section (Lines 424-425).

      (20) Line 436: the degree of each note was calculated, what does it mean by "degree" here and is it the same as the number of edges? How does it link to the "higher ranked edges" in Line 165?

      The “degree” of a node refers to the number of edges connected to that node—a standard metric in graph theory used to quantify a node’s centrality or connectivity in the network. It is equivalent to the number of edges a gene shares with others in the CSN.

      The “higher-ranked edges” refer to the top 1,000 gene pairs with the highest SAS values, which we used to construct the Core Survival Network (CSN). The degree for each node was computed within this fixed network, and the top 10 nodes with the highest degree were selected as hub genes. Therefore, the node degree is largely determined by this pre-defined edge set.

      (21) Line 439: does it mean only 1000 SAS values were used or SAS values from 1000 genes, which should come up with 1000 choose 2 pairs (~ half million SAS values).

      We computed the SAS values between each GEAR gene and all other genes, resulting in a large number of pairwise similarity scores. Among these, we selected the top 1,000 gene pairs with the highest SAS values—regardless of how many unique genes were involved—to define the edges in the Core Survival Network (CSN). In another words, the network is constructed from the top 1,000 SAS-ranked gene pairs, not from all possible combinations among 1,000 genes (which would result in nearly half a million pairs). This approach yields a sparse network focused on the strongest co-prognostic relationships.

      We have clarified this in the revised ‘Methods’ section (Lines 409–430).

      (22) Line 496: what tool is used and what are the parameters set for hierarchical clustering if someone would like to reproduce the result?

      The hierarchical clustering was performed in R using the hclust function with Ward's minimum variance method (method = "ward.D2"), based on Euclidean distance computed from the log-transformed expression matrix (𝑙𝑜𝑔<sub>2</sub>(𝑇𝑃𝑀 +1)). Cluster assignment was done using the cutree function with k = 3 to define low, mid, and high expression subgroups. These settings have now been explicitly stated in the revised ‘Methods’ section (Lines 439–443) to facilitate reproducibility.

      (23) Lines 901-909: Figure 4 missing panel C. Current panel C seems to be the panel D in the description.

      Sorry for the oversights and we have now made the correction (Line 893).

      (24) Lines 920-928: Figure 6C: considering a higher bar to define "significant".

      We agree that applying a more stringent cutoff (e.g., p < 0.01) may reduce potential false positives. However, given the exploratory nature of this study, we believe the current threshold remains appropriate for the purpose of hypothesis generation.

      Reviewer #3 (Recommendations for the authors):

      (1) The title says the genes that are "steadily" associated are identified, but what you mean by the word "steadily" is not defined in the manuscript. Perhaps this could mean that they are consistently associated in different analyses, but multiple analyses are not compared.

      In our manuscript, “steadily associated” refers to genes that consistently show significant associations with patient prognosis across multiple sample sizes and repeated resampling within the MEMORY framework (Lines 65–66). Specifically, each gene is evaluated across 10 sampling gradients (from ~10% to 100% of the cohort) with 1,000 permutations at each level. A gene is defined as a GEAR if its probability of being significantly associated with survival remains ≥ 0.8 throughout the whole permutation process. This stability in signal under extensive resampling is what we refer to as “steadily associated.”

      (2) I think the word "gradient" is not appropriately used as it usually indicates a slope or a rate of change. It seems to indicate a step in the algorithm associated with a sampling proportion.

      Thank you for pointing out the potential ambiguity in our use of the term “gradient.” In our study, we used “gradient” to refer to stepwise increases in the sample proportion used for resampling and analysis. We have now revised it to “progressive”.

      (3) Make it clear that the name "GEARs" is introduced in this publication.

      Done.

      (4) Sometimes the document is hard to understand, for example, the sentence, "As the number of samples increases, the survival probability of certain genes gradually approaches 1." It does not appear to be calculating "gene survival probability" but rather a gene's association with patient survival. Or is it that as the algorithm progresses genes are discarded and therefore do have a survival probability? It is not clear.

      What we intended to describe is the probability that a gene is judged significant in the 1,000 resamples at a given sample-size step, that is, its reproducibility probability in the MEMORY framework. We have now revised the description (Lines 101-104).

      (5) The article lacks significant details, like the type of test used to generate p-values. I assume it is the log-rank test from the R survival package. This should be explicitly stated. It is not clear why the survminer R package is required or what function it has. Are the p-values corrected for multiple hypothesis testing at each sampling?

      We apologize for the lack of details. In each sampling iteration, we used the log-rank test (implemented via the survdiff function in the R survival package) to evaluate the prognostic association of individual genes. This information has now been explicitly added to the revised manuscript.

      The survminer package was originally included for visualization purposes, such as plotting illustrative Kaplan– Meier curves. However, since it did not contribute to the core statistical analysis, we have now removed this package from the Methods section to avoid confusion (Lines 386-407).

      As for multiple-testing correction, we did not adjust p-values in each iteration, because the final selection of GEARs is based on the frequency with which a gene is found significant across 1,000 resamples (i.e., its reproducibility probability). Classical FDR corrections at the per-sample level do not meaningfully affect this aggregate metric. That said, we fully acknowledge the importance of multiple-testing control for the final GEARs catalogue. Future versions of the MEMORY framework will incorporate appropriate adjustment procedures at that stage.

      (6) It is not clear what the survival metric is. Is it overall survival (OS) or progression-free survival (PFS), which would be common choices?

      It’s overall survival (OS).

      (7) The treatment of the patients is never considered, nor whether the sequencing was performed pre or posttreatment. The patient's survival will be impacted by the treatment that they receive, and many other factors like commodities, not just the genomics.

      We initially thought there exist no genes significantly associated with patient survival (GEARs) without counting so many different influential factors. This is exactly what motivated us to invent the

      MEMORY. However, this work proves “we were wrong”, and it demonstrates the real power of GEARs in determining patient survival. Of course, we totally agree with the reviewer that incorporating therapy variables and other clinical covariates will further improve the power of MEMORY analyses.

      (8) As a paper that introduces a new analysis method, it should contain some comparison with existing state of the art, or perhaps randomised data.

      Our understanding is --- the MEMORY presents as an exploratory and proof-of-concept framework. Comparison with regular survival analyses seems not reasonable. We have added some discussion in revised manuscript (Lines 350-359).

      (9) In the discussion it reads, "it remains uncertain whether there exists a set of genes steadily associated with cancer prognosis, regardless of sample size and other factors." Of course, there are many other factors that may alter the consistency of important cancer genes, but sample size is not one of them. Sample size merely determines whether your study has sufficient power to detect certain gene effects, it does not effect whether genes are steadily associated with cancer prognosis in different analyses. (Of course, this does depend on what you mean by "steadily".)

      We totally agree with reviewer that sample size itself does not alter a gene’s biological association with prognosis; it only affects the statistical power to detect that association. Because this study is exploratory and we were initially uncertain whether GEARs existed, we first examined the impact of sample-size variation—a dominant yet experimentally tractable source of heterogeneity—before considering other, less controllable factors.

      Reviewer #4 (Recommendations for the authors):

      Other more detailed comments:

      (1) Introduction

      L93: When listing reasons why genes do not replicate across different cohorts / datasets, there is also the simple fact that some could be false positives

      We totally agree that some genes may simply represent false-positive findings apart from biological heterogeneity and technical differences between cohorts. Although the MEMORY framework reduces this risk by requiring high reproducibility across 1,000 resamples and multiple sample-size tiers, it cannot eliminate false positives completely. We have added some discussion and explicitly note that external validation in independent datasets is essential for confirming any GEAR before clinical application.

      (2) Results Section

      L143: Language like "We also identified the most significant GEARs in individual cancer types" I think is potentially misleading since the "GEAR" lists do not have formal statistical significance attached.

      We removed “significant” ad revised it to “top 1” (Line 115).

      L153 onward: The pathway analysis results reported do not include any measures of how statistically significant the enrichment was.

      We have now updated the figure legends to clearly indicate that the displayed pathways represent the top significantly enriched results based on adjusted p-values from GO enrichment analyses (Lines 876-878).

      L168: "A certain degree of correlation with cancer stages (TNM stages) is observed in most cancer types except for COAD, LUSC and PRAD". For statements like this statistical significance should be mentioned in the same sentence or, if these correlations failed to reach significance, that should be explicitly stated.

      In the revised Supplementary Figure 5A–K, we now accompany the visual trends with formal statistical testing. Specifically, for each cancer type, we constructed a contingency table of AJCC stage (I–IV) versus hub-gene subgroup (Low, Mid, High) and applied Pearson’s 𝑥<sup>2</sup> test (using Monte Carlo approximation with 10⁵ replicates if any expected cell count was < 5). The resulting 𝑥<sup>2</sup> statistic and p-value are printed beneath each panel. Of the eleven cancer types analyzed, eight showed statistically significant associations (p < 0.05), while COAD, LUSC, and PRAD did not. Accordingly, we have make the revision in the manuscript (Line 137139).

      L171-176: When mentioning which pathways are enriched among the gene lists, please clarify whether these levels of enrichment are statistically significant or not. If the enrichment is significant, please indicate to what degree, and if not I would not mention.

      We agree that the statistical significance of pathway enrichment should be clearly stated and made the revision throughout the manuscript (Line 869, 875, 877).

      (3) Methods Section

      L406 - 418: I did not really understand, nor see it explained, what is the motivation and value of cycling through 10%, 20% bootstrapped proportions of patients in the "gradient" approach? I did not see this justified, or motivated by any pre-existing statistical methodology/results. I do not follow the benefit compared to just doing one analysis of all available samples, and using the statistical inference we get "for free" from the survival analysis p-values to quantify sampling uncertainty.

      The ten step-wise sample fractions (10 % to 100 %) allow us to transform each gene’s single log-rank P-value into a reproducibility probability: at every fraction we repeat the test 1,000 times and record the proportion of permutations in which the gene is significant. This learning-curve-style resampling not only quantifies how consistently a gene associates with survival under different power conditions but also produces the 0/1 vectors required to compute Survival-Analysis Similarity (SAS) and build the Core Survival Network. A single one-off analysis on the full cohort would yield only one P-value per gene, providing no binary vectors at all—hence no basis for calculating SAS or constructing the network. 

      L417: I assume p < 0.05 in the survival analysis means the nominal p-value, unadjusted for multiple testing. Since we are in the context of many tests please explicitly state if so.

      Yes, p < 0.05 refers to the nominal, unadjusted p-value from each log-rank test within a single permutation. In MEMORY these raw p-values are converted immediately into 0/1 “votes” and aggregated over 1 000 permutations and ten sample-size tiers; only the resulting reproducibility probability (𝐴<sub>𝑖𝑗</sub>) is carried forward. No multiple-testing adjustment is applied at the individual-test level, because a per-iteration FDR or BH step would not materially affect the final 𝐴<sub>𝑖𝑗</sub> ranking. We have revised the manuscript (Line 396)

      L419-426: I did not see defined what the rows are and what the columns are in the "significant-probability matrix". Are rows genes, columns cancer types? Consequently I was not really sure what actually makes a "GEAR". Is it achieving a significance probability of 0.8 across all 15 cancer subtypes? Or in just one of the tumour datasets?

      In the significant-probability matrix, each row represents a gene, and each column corresponds to a sampling gradient (i.e., increasing sample-size tiers from ~10% to 100%) within a single cancer type. The matrix is constructed independently for each cancer.

      GEAR is defined as achieving a significance probability of 0.8 within a single tumor type. Not need to achieve significance probability across all 15 cancer subtypes.

      L426: The significance probability threshold of 0.8 across 1,000 bootstrapped nominal tests --- used to define the GEAR lists --- has, as far as I can tell, no formal justification. Conceptually, the "significance probability" reflects uncertainty in the patients being used (if I follow their procedure correctly), but as mentioned above, a classical p-value is also designed to reflect sampling uncertainty. So why use the bootstrapping at all?

      Moreover, the 0.8 threshold is applied on a per-gene basis, so there is no apparent procedure "built in" to adapt to (and account for) different total numbers of genes being tested. Can the authors quantify the false discovery rate associated with this GEAR selection procedure e.g. by running for data with permuted outcome labels? And why do the gradient / bootstrapping at all --- why not just run the nominal survival p-values through a simple Benjamini-Hochberg procedure, and then apply and FDR threshold to define the GEAR lists? Then you would have both multiplicity and error control for the final lists. As it stands, with no form of error control or quantification of noise rates in the GEAR lists I would not recommend promoting their use. There is a long history of variable selection techniques, and various options the authors could have used that would have provided formal error rates for the final GEAR lists (see seminal reviews by eg Heinze et al 2018 Biometrical

      Journal, or O'Hara and Sillanpaa, 2009, Bayesian Analysis), including, as I say, simple application of a Benjamini-Hochberg to achive multiplicity adjusted FDR control.

      Thank you. We chose the 10 × 1,000 resampling scheme to ask a different question from a single Benjamini–Hochberg scan: does a gene keep re-appearing as significant when cohort composition and statistical power vary from 10 % to 100 % of the data? Converting the 1,000 nominal p-values at each sample fraction into a reproducibility probability 𝐴<sub>𝑖𝑗</sub> allows us to screen for signals that are stable across wide sampling uncertainty rather than relying on one pass through the full cohort. The 0.8 cut-off is an intentionally strict, empirically accepted robustness threshold (analogous to stability-selection); under the global null the chance of exceeding it in 1,000 draws is effectively zero, so the procedure is already highly conservative even before any gene-wise multiplicity correction [1]. Once MEMORY moves beyond this exploratory stage and a final, clinically actionable GEAR catalogue is required, we will add a formal FDR layer after the robustness screen, but for the present proof-of-concept study, we retain the resampling step specifically to capture stability rather than to serve as definitive error control.

      L427-433: I gathered that SAS reflects, for a particular pair of genes, how likely they are to be jointly significant across bootstraps. If so, perhaps this description or similar could be added since I found a "conceptual" description lacking which would have helped when reading through the maths. Does it make sense to also reflect joint significance across multiple cancer types in the SAS? Or did I miss it and this is already reflected?

      SAS is indeed meant to quantify, within a single cancer type, how consistently two genes are jointly significant across the 1,000 bootstrap resamples performed at a given sample-size tier. In another words, SAS is the empirical probability that the two genes “co-light-up” in the same permutation, providing a measure of shared prognostic behavior beyond what either gene shows alone. We have added this plain language description to the ‘Methods’ (Lines 405-418).

      In the current implementation SAS is calculated separately for each cancer type; it does not aggregate cosignificance across different cancers. Extending SAS to capture joint reproducibility across multiple tumor types is an interesting idea, especially for identifying pan-cancer gene pairs, and we note this as a potential future enhancement of the MEMORY pipeline.

      L432: "The SAS of significant genes with total genes was calculated, and the significant survival network was constructed" Are the "significant genes" the "GEAR" list extracted above according to the 0.8 threshold? If so, and this is a bit pedantic, I do not think they should be referred to as "significant genes" and that this phrase should be reserved for formal statistical significance.

      We have replaced “significant genes” with “GEAR genes” to avoid any confusion (Lines 421-422).

      L434: "some SAS values at the top of the rankings were extracted, and the SAS was visualized to a network by Cytoscape. The network was named core survival network (CSN)". I did not see it explicitly stated which nodes actually go into the CSN. The entire GEAR list? What threshold is applied to SAS values in order to determine which edges to include? How was that threshold chosen? Was it data driven? For readers not familiar with what Cytoscape is and how it works could you offer more of an explanation in-text please? I gather it is simply a piece of network visualisation/wrangling software and does not annotate additional information (e.g. external experimental data), which I think is an important point to clarify in the article without needing to look up the reference.

      We have now clarified these points in the revised ‘Methods’ section, including how the SAS threshold was selected and which nodes were included in the Core Survival Network (CSN). Specifically, the CSN was constructed using the top 1,000 gene pairs with the highest SAS values. This threshold was not determined by a fixed numerical cutoff, but rather chosen empirically after comparing networks built with varying numbers of edges (250, 500, 1,000, 2,000, 6,000, and 8,000; see Reviewer-only Figure 1). We observed that, while increasing the number of edges led to denser networks, the set of hub genes remained largely stable. Therefore, we selected 1,000 edges as a balanced compromise between capturing sufficient biological information and maintaining computational efficiency and interpretability.

      The resulting node list (i.e., the genes present in those top-ranked pairs) is provided in Supplementary Table 4. Cytoscape was used solely as a network visualization platform, and no external annotations or experimental data were added at this stage. We have added a brief clarification in the main text to help readers understand.

      L437: "The effect of molecular classification by hub genes is indicated that 1000 to 2000 was a range that the result of molecular classification was best." Can you clarify how "best" is assessed here, i.e. by what metric and with which data?

      We apologize for the confusion. Upon constructing the network, we observed that the number of edges affected both the selection of hub genes and the computational complexity. We analyzed the networks with 250, 500, 1,000, 2,000, 6,000 and 8,000 edges, and found that the differences in selected hub genes were small (Author response image 1). Although the networks with fewer edges had lower computational complexity, the choice of 1000 edges was a compromise to the balance between sufficient biological information and manageable computational complexity. Thus, we chose the network with 1,000 edges as it offered a practical balance between computational efficiency and the biological relevance of the hub genes.

      Author response image 1.

      The intersection of the network constructed by various number of edges.

      References

      (1) Gebski, V., Garès, V., Gibbs, E. & Byth, K. Data maturity and follow-up in time-to-event analyses.International Journal of Epidemiology 47, 850–859 (2018).

    1. Author response:

      We thank the reviewers for their comments. We are paraphrasing their three main criticisms below and provide responses and outlines of how we are going to address them.

      Criticism 1: Actin binding by Shot may not be required for Shot's function in dendritic microtubule organization (Point 1 by Reviewer 1, points 6-8 by reviewer 2).

      This criticism is mainly based on our finding that, while a version of Shot lacking just the high affinity actin binding site cannot rescue the pruning and orientation defects of shot<sup>3</sup> mutants, expression of a construct harboring just the microtubule and EB1 binding sites can. The reviewers also point out that a Shot construct lacking one of its actin binding domains (deltaCH1), causes pruning defects when overexpressed in wild type cells.

      We thank the reviewers for this comment. We concede that we did not properly explain our reasoning and conclusions regarding the role of actin binding in Shot dendritic function. From the literature, there is evidence that Shot fragments containing the C-terminal microtubule binding domain alone have positive effects on neuronal microtubule stability and organization by a gain-of-function mechanism. This is likely due to two reasons: firstly, the activity of these constructs is unrestrained by localization. For example, in axons, full length Shot localizes adjacent to the membrane and to growth cones, while a Shot C-terminal construct (lacking the actin-binding and spectrin-repeat domains) decorates axonal microtubules [1]. Secondly, the actin binding site appears to inhibit microtubule binding by an intramolecular mechanism that is relieved by actin binding [2]. Overexpression of such a construct also dramatically improves axonal microtubule defects in aged neurons [3]. Thus, actin recruitment may locally activate Shot's microtubule binding activity.

      To address this criticism, we will test if other UAS-Shot transgenes lacking the actin binding or microtubule binding domains can rescue the defects of Shot mutants. We will also try to provide more evidence that the C-terminal Shot construct exerts a gain-of-function effect on microtubules. We will adjust our interpretation accordingly.

      Criticism 2: The relationship between reversal of dendritic microtubule orientation and dendrite pruning defects could be correlative rather than causal (paragraph 1 by Reviewer 1, point 5 by reviewer 2).

      This criticism is based on our finding that Mical overexpression causes a partial reversal of dendritic microtubule orientation but no apparent dendrite pruning defects.

      We thank the reviewers for this comment. In fact, knockdown of EB1, which affects dendritic microtubule organisation via kinesin-2 [4], does not cause dendrite pruning defects by itself either, but strongly enhances the pruning defects caused by other microtubule manipulations [5]. This is likely because loss of EB1 destabilizes the dendritic cytoskeleton and thus also promotes dendrite degeneration. All other conditions that cause dendritic microtubule reversal also cause dendrite pruning defects [5 - 9]. As Mical is a known pruning factor [10], its overexpression may actually also destabilize dendrites, e. g., by severing actin filaments. However, we showed in the current manuscript that Mical overexpression causes a partial reversal of dendritic microtubule polarity and strongly enhances the dendrite pruning defects caused by Shot knockdown.

      To address this criticism, we will rephrase the corresponding section of our manuscript and specify that conditions that cause reversal of dendritic microtubule orientation either cause dendrite pruning defects, or act as genetic enhancers of pruning defects caused by other microtubule regulators. This wording better explains the relationship between dendritic microtubule orientation and dendrite pruning and also includes the Mical overexpression condition.

      Criticism 3: The presented data do not prove that Shot, Rab11 and Patronin act in a common pathway to establish dendritic plus end-in microtubule orientation (paragraphs 2-3 by Reviewer 1, point 1-4 by reviewer 2).

      While these factors genetically interact with each other during dendrite pruning, it is not clear whether (1) they colocalize at the tips of growing dendrites during early growth stages; (2) their respective localizations depend on each other; (3) they act at the same developmental stage in microtubule orientation.  

      We thank the reviewers for this comment. For technical reasons (e. g., incompatible transgenes, GAL4 drivers too weak), we could only partially address these questions at the time. We have now expanded our toolkit with additional drivers and fluorescently tagged transgenes. We will therefore test whether Shot and Rab11 or Patronin and Rab11 colocalize in growing dendrites during the early L1 stage, and if loss of Shot affects the localization or the activity of Patronin and Rab11 in dendrites. We will adapt our interpretation accordingly, and also add a comprehensive model.

      References

      (1) Alves Silva et al. (2012) J. Neurosci. 32:9143

      (2) Applewhite et al. (2013) Mol. Biol. Cell 24:2885

      (3) Okenve-Ramos et al. (2024) PLoS Biol. 22:e3002504

      (4) Mattie et al. (2010) Curr. Biol. 20:2169

      (5) Herzmann et al. (2018) Development 145:dev156950

      (6) Wang et al. (2019) eLife 8:e39964

      (7) Rui et al. (2020) EMBO Rep. 21:e48843

      (8) Tang et al. (2020) EMBO J. 39:e103549

      (9) Bu et al. (2022) Cell Rep. 39:110887

      (10) Kirilly et al. (2009) Nat. Neurosci. 12:1497

    1. Author response:

      Reviewer #1 (Public review):

      Weaknesses:

      The technical approach is strong and the conceptual framing is compelling, but several aspects of the evidence remain incomplete. In particular, it is unclear whether the reported changes in connectivity truly capture causal influences, as the rank metrics remain correlational and show discrepancies with the manipulation results.

      We agree that our functional connectivity ranking analyses cannot establish causal influences. As discussed in the manuscript, besides learning-related activity changes, the functional connectivity may also be influenced by neuromodulatory systems and internal state fluctuations. In addition, the spatial scope of our recordings is still limited compared to the full network implicated in visual discrimination learning, which may bias the ranking estimates. In future, we aim to achieve broader region coverage and integrate multiple complementary analyses to address the causal contribution of each region.

      The absolute response onset latencies also appear slow for sensory-guided behavior in mice, and it is not clear whether this reflects the method used to define onset timing or factors such as task structure or internal state.

      We believe this may be primarily due to our conservative definition of onset timing. Specifically, we required the firing rate to exceed baseline (t-test, p < 0.05) for at least 3 consecutive 25-ms time windows. This might lead to later estimates than other studies, such as using the latency to the first spike after visual stimulus onset (~50-60 ms, Siegle et al., Nature, 2023) or the time to half-max response (~65 ms, Goldbach et al., eLife, 2021).

      Furthermore, the small number of animals, combined with extensive repeated measures, raises questions about statistical independence and how multiple comparisons were controlled.

      We agree that a larger sample size would strengthen the robustness of the findings. However, as noted above, the current dataset has inherent limitations in both the number of recorded regions and the behavioral paradigm. Given the considerable effort required to achieve sufficient unit yields across all targeted regions, we wish to adjust the set of recorded regions, improve behavioral task design, and implement better analyses in future studies. This will allow us to both increase the number of animals and extract more precise insights into mesoscale dynamics during learning.

      The optogenetic experiments, while intended to test the functional relevance of rank increasing regions, leave it unclear how effectively the targeted circuits were silenced. Without direct evidence of reliable local inhibition, the behavioral effects or lack thereof are difficult to interpret.

      We appreciate this important point. Due to the design of the flexible electrodes and the implantation procedure, bilateral co-implantation of both electrodes and optical fibers was challenging, which prevented us from directly validating the inhibition effect in the same animals used for behavior. In hindsight, we could have conducted parallel validations using conventional electrodes, and we will incorporate such controls in future work to provide direct evidence of manipulation efficacy.

      Details on spike sorting are limited.

      We will provide more details on spike sorting, including the exact parameters used in the automated sorting algorithm and the subsequent manual curation criteria.

      Reviewer #2 (Public review):

      Weaknesses:

      I had several major concerns:

      (1) The number of mice was small for the ephys recordings. Although the authors start with 7 mice in Figure 1, they then reduce to 5 in panel F. And in their main analysis, they minimize their analysis to 6/7 sessions from 3 mice only. I couldn't find a rationale for this reduction, but in the methods they do mention that 2 mice were used for fruitless training, which I found no mention in the results. Moreover, in the early case, all of the analysis is from 118 CR trials taken from 3 mice. In general, this is a rather low number of mice and trial numbers. I think it is quite essential to add more mice.

      We apologize for the confusion. As described in the Methods section, 7 mice (Figure 1B) were used for behavioral training without electrode array or optical fiber implants to establish learning curves, and an additional 5 mice underwent electrophysiological recordings (3 for visual-based decision-making learning and 2 for fruitless learning).

      As we noted in our response to Reviewer #1, the current dataset has inherent limitations in both the number of recorded regions and the behavioral paradigm. Given the considerable effort required to achieve high-quality unit yields across all targeted regions, we wish to adjust the set of recorded regions, improve behavioral task design, and implement better analyses in future studies. These improvements will enable us to collect data from a larger sample size and extract more precise insights into mesoscale dynamics during learning.

      (2) Movement analysis was not sufficient. Mice learning a go/no-go task establish a movement strategy that is developed throughout learning and is also biased towards Hit trials. There is an analysis of movement in Figure S4, but this is rather superficial. I was not even sure that the 3 mice in Figure S4 are the same 3 mice in the main figure. There should be also an analysis of movement as a function of time to see differences. Also for Hits and FAs. I give some more details below. In general, most of the results can be explained by the fact that as mice gain expertise, they move more (also in CR during specific times) which leads to more activation in frontal cortex and more coordination with visual areas. More needs to be done in terms of analysis, or at least a mention of this in the text.

      Due to the limitation in the experimental design and implementation, movement tracking was not performed during the electrophysiological recordings, and the 3 mice shown in Figure S4 were from a separate group. We have carefully examined the temporal profiles of mouse movements and found it did not fully match the rank dynamics, and we will add these results and related discussion in the revised manuscript. However, we acknowledge that without synchronized movement recordings in the main dataset, we cannot fully disentangle movement-related neural activity from task-related signals. We will make this limitation explicit in the revised manuscript and discuss it as a potential confound, along with possible approaches to address it in future work.

      (3) Most of the figures are over-detailed, and it is hard to understand the take-home message. Although the text is written succinctly and rather short, the figures are mostly overwhelming, especially Figures 4-7. For example, Figure 4 presents 24 brain plots! For rank input and output rank during early and late stim and response periods, for early and expert and their difference. All in the same colormap. No significance shown at all. The Δrank maps for all cases look essentially identical across conditions. The division into early and late time periods is not properly justified. But the main take home message is positive Δrank in OFC, V2M, V1 and negative Δrank in ThalMD and Str. In my opinion, one trio map is enough, and the rest could be bumped to the Supplementary section, if at all. In general, the figure in several cases do not convey the main take home messages. See more details below.

      We thank the reviewer for this valuable critique. The statistical significance corresponding to the brain plots (Figure 4 and Figure 5) was presented in Figure S3 and S5, but we agree that the figure can be simplified to focus on the key results. In the revised manuscript, we will condense these figures to focus on the most important comparisons and relocate secondary plots to the Supplementary section. This will make the visual presentation more concise and the take-home message clearer.

      (4) The analysis is sometimes not intuitive enough. For example, the rank analysis of input and output rank seemed a bit over complex. Figure 3 was hard to follow (although a lot of effort was made by the authors to make it clearer). Was there any difference between the output and input analysis? Also, the time period seems redundant sometimes. Also, there are other network analysis that can be done which are a bit more intuitive. The use of rank within the 10 areas was not the most intuitive. Even a dimensionality reduction along with clustering can be used as an alternative. In my opinion, I don't think the authors should completely redo their analysis, but maybe mention the fact that other analyses exist

      We appreciate the reviewer’s comment. In brief, the input- and output-rank analyses yielded largely similar patterns across regions in CR trials, although some differences were observed in certain areas (e.g., striatum in Hit trials) where the magnitude of rank change was not identical between input and output measures. We agree that the division into multiple time periods sometimes led to redundant results; we will combine overlapping results in the revision to improve clarity.

      We did explore dimensionality reduction applied to the ranking data. However, the results were not intuitive and required additional interpretation, which did not bring more insights. Still, we acknowledge that other analysis approaches might provide complementary insights. While we do not plan to completely reanalyze the dataset at this stage, we will include a discussion of these alternative methods and their potential advantages in the revised manuscript.

      Reviewer #3 (Public review):

      Weaknesses:

      The weakness is also related to the strength provided by the method. It is demonstrated in the original method that this approach in principle can track individual units for four months (Luan et al, 2017). The authors have not showed chronically tracked neurons across learning. Without demonstrating that and taking advantage of analyzing chronically tracked neurons, this approach is not different from acute recording across multiple days during learning. Many studies have achieved acute recording across learning using similar tasks. These studies have recorded units from a few brain areas or even across brain-wide areas.

      We appreciate the reviewer’s important point. We did attempt to track the same neurons across learning in this project. However, due to the limited number of electrodes implanted in each brain region, the number of chronically tracked neurons in each region was insufficient to support statistically robust analyses. Concentrating probes in fewer regions would allow us to obtain enough units tracked across learning in future studies to fully exploit the advantages of this method.

      Another weakness is that major results are based on analyses of functional connectivity that is calculated using the cross-correlation score of spiking activity (TSPE algorithm). Functional connection strengthen across areas is then ranked 1-10 based on relative strength. Without ground truth data, it is hard to judge the underlying caveats. I'd strongly advise the authors to use complementary methods to verify the functional connectivity and to evaluate the mesoscale change in subnetworks. Perhaps the authors can use one key information of anatomy, i.e. the cortex projects to the striatum, while the striatum does not directly affect other brain structures recorded in this manuscript

      We agree that the functional connectivity measured in this study relies on statistical correlations rather than direct anatomical connections. We plan to test the functional connection data with shorter cross-correlation delay criteria to see whether the results are consistent with anatomical connections and whether the original findings still hold.

    1. Author response:

      We thank the reviewers for their thoughtful public feedback. Our revision will clarify scope and methods/statistics, as well as streamline the narrative so the central message is clear: wild-type flies exhibit anticipatory alignment of fuel selection with circadian time, whereas short-sleep and clock mutants show reactive or misaligned metabolism under our conditions.

      Major conceptual and experimental revisions:

      (1) We will define “anticipatory” (clock-aligned, pre-emptive substrate choice) and “reactive” (post-hoc substrate shifts) up front and use these terms consistently. We will clearly distinguish diurnal (LD) from circadian (DD) regulation and avoid implying that DD abolishes rhythmicity. Claims will be limited to the tested genotypes (fmn, sss, and per<sup>01</sup>) without generalizing to all forms of sleep loss or to mammals (although we will speculate in the discussion about translation and generalizability). We will temper language around external entrainment in DD to “contributes strongly under our conditions in flies.”

      (2) We will expand the respirometry and rhythmicity sections (RAIN/JTK parameters, period/phase outputs, multiple-testing control). We will clarify that each measurement is an average of 300 flies per genotype (25 flies/chamber, 4 chambers/experiment, 3 experimental days) and specify the chamber as the experimental unit with n and error structure in each figure legend. For metabolomics–respirometry correlations, we will briefly describe dataset parameters, time-matching across ZT, normalization, Spearman correlations, and lag interpretation.

      (3) We are performing additional experimental measurements through tissue respirometry of gut tissues and ROS staining to support our claims of “mitochondrial stress” in the short sleeping mutants. We note that this has already been shown for fmn in Vaccaro et al (Cell, 2020) and we will extend this to the other mutants studied in our work.

      Reviewer-specific points

      Reviewer #1.

      We will clarify the circadian/diurnal framing, fully report rhythmicity analyses (parameters, n, q-values, phases), and better explain the metabolomics-respiration coupling with a concise workflow figure and supplementary table. The conclusion that sleep and clock systems align substrate selection with energy demand will be presented as supported under our tested conditions and positioned as groundwork for future mechanistic studies.

      Reviewer #2.

      We will state explicitly that findings may be gene-specific and avoid inferring generality to all sleep loss. We will soften cross-species language about external entrainment and add a brief note on species differences. For behavioral context (activity/feeding/sleep in fmn andsss), we will cite our related manuscript in revision (Malik et al, https://www.biorxiv.org/content/10.1101/2023.10.30.564837v2) in which we have measured both activity and feeding for fmn, sss, and wt flies. We will add a concise description of LC-MS processing and pathway analysis and define “anticipatory”/“reactive” early, using them consistently.

      Reviewer #3.

      We acknowledge that metabolomics were repurposed and emphasize the novelty of integrating continuous VCO2 and VO2 respirometry with temporal lag analysis. We will report replication clearly (chambers as the unit, n per genotype) and acknowledge locomotor activity as a potential confound, pointing to the related manuscript (Malik et al) for independent activity/feeding measurements and experimental measures of mitochondrial stress as outlined above. We will also further note that only males were studied, outlining this as a limitation and a future direction.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      The authors present a substantial improvement to their existing tool, MorphoNet, intended to facilitate assessment of 3D+t cell segmentation and tracking results, and curation of high-quality analysis for scientific discovery and data sharing. These tools are provided through a user-friendly GUI, making them accessible to biologists who are not experienced coders. Further, the authors have re-developed this tool to be a locally installed piece of software instead of a web interface, making the analysis and rendering of large 3D+t datasets more computationally efficient. The authors evidence the value of this tool with a series of use cases, in which they apply different features of the software to existing datasets and show the improvement to the segmentation and tracking achieved. 

      While the computational tools packaged in this software are familiar to readers (e.g., cellpose), the novel contribution of this work is the focus on error correction. The MorphoNet 2.0 software helps users identify where their candidate segmentation and/or tracking may be incorrect. The authors then provide existing tools in a single user-friendly package, lowering the threshold of skill required for users to get maximal value from these existing tools. To help users apply these tools effectively, the authors introduce a number of unsupervised quality metrics that can be applied to a segmentation candidate to identify masks and regions where the segmentation results are noticeably different from the majority of the image. 

      This work is valuable to researchers who are working with cell microscopy data that requires high-quality segmentation and tracking, particularly if their data are 3D time-lapse and thus challenging to segment and assess. The MorphoNet 2.0 tool that the authors present is intended to make the iterative process of segmentation, quality assessment, and re-processing easier and more streamlined, combining commonly used tools into a single user interface.   

      We sincerely thank the reviewer for their thorough and encouraging evaluation of our work. We are grateful that they highlighted both the technical improvements of MorphoNet 2.0 and its potential impact for the broader community working with complex 3D+t microscopy datasets. We particularly appreciate the recognition of our efforts to make advanced segmentation and tracking tools accessible to non-expert users through a user-friendly and locally installable interface, and for pointing out the importance of error detection and correction in the iterative analysis workflow. The reviewer’s appreciation of the value of integrating unsupervised quality metrics to support this process is especially meaningful to us, as this was a central motivation behind the development of MorphoNet 2.0. We hope the tool will indeed facilitate more rigorous and reproducible analyses, and we are encouraged by the reviewer’s positive assessment of its utility for the community.

      One of the key contributions of the work is the unsupervised metrics that MorphoNet 2.0 offers for segmentation quality assessment. These metrics are used in the use cases to identify low-quality instances of segmentation in the provided datasets, so that they can be improved with plugins directly in MorphoNet 2.0. However, not enough consideration is given to demonstrating that optimizing these metrics leads to an improvement in segmentation quality. For example, in Use Case 1, the authors report their metrics of interest (Intensity offset, Intensity border variation, and Nuclei volume) for the uncurated silver truth, the partially curated and fully curated datasets, but this does not evidence an improvement in the results. Additional plotting of the distribution of these metrics on the Gold Truth data could help confirm that the distribution of these metrics now better matches the expected distribution. 

      Similarly, in Use Case 2, visual inspection leads us to believe that the segmentation generated by the Cellpose + Deli pipeline (shown in Figure 4d) is an improvement, but a direct comparison of agreement between segmented masks and masks in the published data (where the segmentations overlap) would further evidence this. 

      We agree that demonstrating the correlation between metric optimization and real segmentation improvement is essential. We have added new analysis comparing the distributions of the unsupervised metrics with the gold truth data before and after curation. Additionally, we provided overlap scores where ground truth annotations are available, confirming the improvement. We also explicitly discussed the limitation of relying solely on unsupervised metrics without complementary validation.

      We would appreciate the authors addressing the risk of decreasing the quality of the segmentations by applying circular logic with their tool; MorphoNet 2.0 uses unsupervised metrics to identify masks that do not fit the typical distribution. A model such as StarDist can be trained on the "good" masks to generate more masks that match the most common type. This leads to a more homogeneous segmentation quality, without consideration for whether these metrics actually optimize the segmentation 

      We thank the reviewer for this important and insightful comment. It raises a crucial point regarding the risk of circular logic in our segmentation pipeline. Indeed, relying on unsupervised metrics to select “good” masks and using them to train a model like StarDist could lead to reinforcing a particular distribution of shapes or sizes, potentially filtering out biologically relevant variability. This homogenization may improve consistency with the chosen metrics, but not necessarily with the true underlying structures.

      We fully agree that this is a key limitation to be aware of. We have revised the manuscript to explicitly discuss this risk, emphasizing that while our approach may help improve segmentation quality according to specific criteria, it should be complemented with biological validation and, when possible, expert input to ensure that important but rare phenotypes are not excluded.

      In Use case 5, the authors include details that the errors were corrected by "264 MorphoNet plugin actions ... in 8 hours actions [sic]". The work would benefit from explaining whether this is 8 hours of human work, trying plugins and iteratively improving, or 8 hours of compute time to apply the selected plugins. 

      We clarified that the “8 hours” refer to human interaction time, including exploration, testing, and iterative correction using plugins. 

      Reviewer #2 (Public review):

      Summary: 

      This article presents Morphonet 2.0, a software designed to visualise and curate segmentations of 3D and 3D+t data. The authors demonstrate their capabilities on five published datasets, showcasing how even small segmentation errors can be automatically detected, easily assessed, and corrected by the user. This allows for more reliable ground truths, which will in turn be very much valuable for analysis and training deep learning models. Morphonet 2.0 offers intuitive 3D inspection and functionalities accessible to a non-coding audience, thereby broadening its impact. 

      Strengths: 

      The work proposed in this article is expected to be of great interest to the community by enabling easy visualisation and correction of complex 3D(+t) datasets. Moreover, the article is clear and well written, making MorphoNet more likely to be used. The goals are clearly defined, addressing an undeniable need in the bioimage analysis community. The authors use a diverse range of datasets, successfully demonstrating the versatility of the software. 

      We would also like to highlight the great effort that was made to clearly explain which type of computer configurations are necessary to run the different datasets and how to find the appropriate documentation according to your needs. The authors clearly carefully thought about these two important problems and came up with very satisfactory solutions. 

      We would like to sincerely thank the reviewer for their positive and thoughtful feedback. We are especially grateful that they acknowledged the clarity of the manuscript and the potential value of MorphoNet 2.0 for the community, particularly in facilitating the visualization and correction of complex 3D(+t) datasets. We also appreciate the reviewer’s recognition of our efforts to provide detailed guidance on hardware requirements and access to documentation—two aspects we consider crucial to ensuring the tool is both usable and widely adopted. Their comments are very encouraging and reinforce our commitment to making MorphoNet 2.0 as accessible and practical as possible for a broad range of users in the bioimage analysis community.

      Weaknesses: 

      There is still one concern: the quantification of the improvement of the segmentations in the use cases and, therefore, the quantification of the potential impact of the software. While it appears hard to quantify the quality of the correction, the proposed work would be significantly improved if such metrics could be provided. 

      The authors show some distributions of metrics before and after segmentations to highlight the changes. This is a great start, but there seem to be two shortcomings: first, the comparison and interpretation of the different distributions does not appear to be trivial. It is therefore difficult to judge the quality of the improvement from these. Maybe an explanation in the text of how to interpret the differences between the distributions could help. A second shortcoming is that the before/after metrics displayed are the metrics used to guide the correction, so, by design, the scores will improve, but does that accurately represent the improvement of the segmentation? It seems to be the case, but it would be nice to maybe have a better assessment of the improvement of the quality. 

      We thank the reviewer for this constructive and important comment. We fully agreed that assessing the true quality improvement of segmentation after correction is a central and challenging issue. While we initially focused on changes in the unsupervised quality metrics to illustrate the effect of the correction, we acknowledged that interpreting these distributions was not always straightforward, and that relying solely on the metrics used to guide the correction introduced an inherent bias in the evaluation.

      To address the first point, we revised the manuscript to provide clearer guidance on how to interpret the changes in metric distributions before and after correction, with additional examples to make this interpretation more intuitive.

      Regarding the second point, we agreed that using independent, external validation was necessary to confirm that the segmentation had genuinely improved. To this end, we included additional assessments using complementary evaluation strategies on selected datasets where ground truth was accessible, to compare pre- and post-correction segmentations with an independent reference. These results reinforced the idea that the corrections guided by unsupervised metrics generally led to more accurate segmentations, but we also emphasized their limitations and the need for biological validation in real-world cases.

      Reviewer #3 (Public review): 

      Summary: 

      A very thorough technical report of a new standalone, open-source software for microscopy image processing and analysis (MorphoNet 2.0), with a particular emphasis on automated segmentation and its curation to obtain accurate results even with very complex 3D stacks, including timelapse experiments. 

      Strengths: 

      The authors did a good job of explaining the advantages of MorphoNet 2.0, as compared to its previous web-based version and to other software with similar capabilities. What I particularly found more useful to actually envisage these claimed advantages is the five examples used to illustrate the power of the software (based on a combination of

      Python scripting and the 3D game engine Unity). These examples, from published research, are very varied in both types of information and image quality, and all have their complexities, making them inherently difficult to segment. I strongly recommend the readers to carefully watch the accompanying videos, which show (although not thoroughly) how the software is actually used in these examples. 

      We sincerely thanked the reviewer for their thoughtful and encouraging feedback. We were particularly pleased that the reviewer appreciated the comparative analysis of MorphoNet 2.0 with both its earlier version and existing tools, as well as the relevance of the five diverse and complex use cases we had selected. Demonstrating the software’s versatility and robustness across a variety of challenging datasets was a key goal of this work, and we were glad that this aspect came through clearly. We also appreciated the reviewer’s recommendation to watch the accompanying videos, which we had designed to provide a practical sense of how the tool was used in real-world scenarios. Their positive assessment was highly motivating and reinforced the value of combining scripting flexibility with an interactive 3D interface.

      Weaknesses: 

      Being a technical article, the only possible comments are on how methods are presented, which is generally adequate, as mentioned above. In this regard, and in spite of the presented examples (chosen by the authors, who clearly gave them a deep thought before showing them), the only way in which the presented software will prove valuable is through its use by as many researchers as possible. This is not a weakness per se, of course, but just what is usual in this sort of report. Hence, I encourage readers to download the software and give it time to test it on their own data (which I will also do myself).   

      We fully agreed that the true value of MorphoNet 2.0 would be demonstrated through its practical use by a wide range of researchers working with complex 3D and 3D+t datasets. In this regard, we improved the user documentation and provided a set of example datasets to help new users quickly familiarize themselves with the platform. We were also committed to maintaining and updating MorphoNet 2.0 based on user feedback to further support its usability and impact.

      In conclusion, I believe that this report is fundamental because it will be the major way of initially promoting the use of MorphoNet 2.0 by the objective public. The software itself holds the promise of being very impactful for the microscopists' community. 

      Reviewer #1 (Recommendations for the authors): 

      (1) In Use Case 1, when referring to Figure 3a, they describe features of 3b? 

      We corrected the mismatch between Figure 3a and 3b descriptions.

      (2) In Figure 3g-I, columns for Curated Nuclei and All Nuclei appear to be incorrectly labelled, and should be the other way around. 

      We corrected  the label swapped between “Curated Nuclei” and “All Nuclei.”

      (3) Some mention of how this will be supported in the future would be of interest. 

      We added a note on long-term support plans  

      (4) Could Morphonet be rolled into something like napari and integrated into its environment with access to its plugins and tools? 

      We thank the reviewer for this pertinent suggestion. We fully recognize the growing importance of interoperability within the bioimage analysis community, and we have been working on establishing a bridge between MorphoNet and napari to enable data exchange and complementary use of the two tools. As a platform, all new developments are first evaluated by our beta testers before being officially released to the user community and subsequently documented. The interoperability component is still under active development and will be announced shortly in a beta-testing phase. For this reason, we were not able to include it in the present manuscript, but we plan to document it in a future release.

      (5) Can meshes be extracted/saved in another format? 

      We agreed that the ability to extract and save meshes in standard formats was highly useful for interoperability with other tools. We implemented this feature in the new version of MorphoNet, allowing users to export meshes in commonly used formats such as OBJ or STL. Response: We thank the reviewer for this pertinent suggestion. We fully recognize the growing importance of interoperability within the bioimage analysis community, and we have been working on establishing a bridge between MorphoNet and napari to enable data exchange and complementary use of the two tools. As a platform, all new developments are first evaluated by our beta testers before being officially released to the user community and subsequently documented. The interoperability component is still under active development and will be announced shortly in a beta-testing phase. For this reason, we were not able to include it in the present manuscript, but we plan to document it in a future release.

      Reviewer #2 (Recommendations for the authors): 

      As a comment, since the authors mentioned the recent progress in 3D segmentation of various biological components, including organelles, it could be interesting to have examples of Morphonet applied to investigate subcellular structures. These present different challenges in visualization and quantification due to their smaller scale.

      We thank the reviewer for this insightful suggestion. We fully agree that applying MorphoNet 2.0 to the analysis of sub-cellular structures is a promising direction, particularly given the specific challenges these datasets present in terms of resolution, visualization, and quantification. While our current use cases focus on cellular and tissue-level segmentation, we are actively interested in extending the applicability of the tool to finer scales. We are currently exploring plugins for spot detection and curation in single-molecule FISH data. However, this requires more time to properly validate relevant use cases, and we plan to include this functionality in the next release.

      Another comment is that the authors briefly mention two other state-of-the-art softwares (namely FIJI and napari) but do not really position MorphoNet against them. The text would likely benefit from such a comparison so the users can better decide which one to use or not. 

      We agreed that providing a clearer comparison between MorphoNet 2.0 and other widely used tools such as FIJI and Napari would greatly benefit readers and potential users. In response, we included a new paragraph in the supplementary materials of the revised manuscript, highlighting the main features, strengths, and limitations of each tool in the context of 3D+t segmentation, visualization, and correction workflows. This addition helped users better understand the positioning of MorphoNet 2.0 and make informed choices based on their specific needs.

      Minor comments: 

      L 439: The Deli plugin is mentioned but not introduced in the main text; it could be helpful to have an idea of what it is without having to dive into the supplementary material. 

      We included a brief description in the main text and thoroughly revise the help pages to improve clarity

      Figure 4: It is not clear how the potential holes created by the removal of objects are handled. Are the empty areas filled by neighboring cells, for example, are they left empty? 

      We clarified in the figure legend of Figure 4.

      Please remove from the supplementary the use cases that are already in the main text. 

      We cleaned up redundant use case descriptions.

      Typos: 

      L 22: the end of the sentence is missing. 

      L 51: There are two "."   

      L 370: replace 'et' with 'and'.   

      L 407-408, Figure 3: panels g-i, the columns 'curated nuclei' and 'all nuclei' seem to be inverted. 

      L 549: "four 4". 

      Reviewer #3 (Recommendations for the authors): 

      Dear Authors, what follows are "minor comments" (the only sort of comment I have for this nice report): 

      Minor issues: 

      (1) Not being a user of MorphoNet, I found that reading the manuscript was a bit hard due to the several names of plugins or tools that are mentioned, many times without a clear explanation of what they do. One way of improving this could be to add a table, a sort of glossary, with those names, a brief explanation of what they are, and a link to their "help" page on the web. 

      We understood that the manuscript might be difficult to follow for readers unfamiliar with MorphoNet, especially due to the numerous plugin and tool names referenced. To address this, we carried out a complete overhaul of the help pages to make them clearer, more structured, and easier to navigate.

      (2) Figure 4d, orthogonal view: It is claimed that this segmentation is correct according to the original intensity image, but it is not clear why some cells in the border actually appear a lot bigger than other cells in the embryo. It does look like an incomplete segmentation due to the poor image quality at the border. Whether this is the case or if the authors consider the contrary, it should be somehow explained/discussed in the figure legend or the main text. 

      We revised the figure legend and main text to acknowledge the challenge of segmenting peripheral regions with low signal-to-noise ratios and discussed how this affects segmentation.

      Small writing issues I could spot:   

      Line 247: there is a double point after "Sup. Mat..". 

      Line 329: probably a diagrammation error of the pdf I use to review, there is a loose sentence apparently related to a figure: "Vegetal view ofwith smoothness". 

      Line 393 (and many other places): avoid using numbers when it is not a parameter you are talking about, and the number is smaller than 10. In this case, it should be: "The five steps...". 

      Line 459: Is "opposite" referring to "Vegetal", like in g? In addition, it starts with lower lowercase. 

      Lines 540-541: Check if redaction is correct in "...projected the values onto the meshed dual of the object..." (it sounds obscure to me). 

      Lines 548-549: Same thing for "...included two groups of four 4 nuclei and one group of 3 fused nuclei.". 

      Line 637: Should it be "Same view as b"? 

      Line 646: "The property highlights..."? 

      Line 651: In the text, I have seen a "propagation plugin" named as "Prope", "Propa", and now "Propi". Are they all different? Is it a mistake? Please, see my first "Minor issue", which might help readers navigate through this sort of confusing nomenclature. 

      Line 702: I personally find the use of the term "eco-system" inappropriate in this context. We scientists know what an ecosystem is, and the fact that it has now become a fashionable word for politicians does not make it correct in any context. 

      We thank the reviewer for their careful reading of the manuscript and for pointing out these writing and typographic issues. We corrected all the mentioned points in the revised version, including punctuation, sentence clarity, consistent naming of tools (e.g., the propagation plugin), and appropriate use of terms such as “ecosystem.” We also appreciated the suggestion to avoid numerals for numbers under ten when not referring to parameters, and we ensured consistency throughout the text. These corrections improved the clarity and readability of the manuscript, and we were grateful for the reviewer’s attention to detail.

    1. Author Response:

      Thank you for forwarding these helpful and thoughtful reviews - at a time when the review process in some journals can be a bit of a 'bloodsport', it is refreshing to receive such constructive and excellent comments.  We essentially agree with the key points the reviewers have made, and as an interim response provide clarification of two areas:

      1) As the reviewers highlighted, genome-wide analysis of ChIP-seq data from Foxc1 over-expression is indeed very worthwhile, and may offer insights for diverse malignancies where FOXC1 is over-expressed.  We have a manuscript in preparation integrating this data set with ATAC-and RNA-seq data to identify genes transcriptionally regulated by elevated levels of Foxc1.  In the interim, our full ChIP-seq data are available via the GEO accession number listed in the manuscript.

      2) Analysis in neuroblastoma cell lines and then xenografts is equally important. Experiments manipulating ARHGAP36 levels in human neuroblastoma cell lines are underway, however a detailed mechanistic understanding of how ARHGAP36 influences neuroblastoma prognosis will take time, and lies beyond the scope of the current manuscript.

    1. Author response:

      Reviewer #1:

      We thank the reviewer for their thoughtful summary of this manuscript. It is important to note that DHA-PPQ did show antagonism in RSAs. In this modified RSA, 200 nM PPQ alone inhibited growth of PPQ-sensitive parasites approximately 20%. If DHA and PPQ were additive, then we would expect that addition of 200 nM PPQ would shift the DHA dose response curve to the left and result in a lower DHA IC50. Please refer to Figure 4a and b as examples of additive relationships in dose-response assays. We observed no significant shift in IC50 values between DHA alone and DHA + PPQ. This suggests antagonism, albeit not to the extent seen with CQ. We will modify the manuscript to emphasize this point. As the reviewer pointed out, it is fortunate that despite being antagonistic, clinically used artemisinin-4-aminoquinoline combinations are effective, provided that parasites are sensitive to the 4-aminoquinoline. It is possible that superantagonism is required to observe a noticeable effect on treatment efficacy (Sutherland et al. 2003 and Kofoed et al. 2003), but that classical antagonism may still have silent consequences. For example, if PPQ blocks some DHA activation, this might result in DHA-PPQ acting more like a pseudo-monotherapy. However, as the reviewer pointed out, while our data suggest that DHA-PPQ and AS-ADQ are “non-optimal” combinations, the clinical consequences of these interactions are unclear. We will modify the manuscript to emphasize the later point.

      While the Ac-H-FluNox and ubiquitin data point to a likely mechanism for DHA-quinoline antagonism, we agree that there are other possible mechanisms to explain this interaction.  We will temper the title and manuscript to reflect these limitations. Though we tried to measure DHA activation in parasites directly, these attempts were unsuccessful. We acknowledge that the chemistry of DHA and Ac-H-FluNox activation is not identical and that caution should be taken when interpreting these data. Nevertheless, we believe that Ac-H-FluNox is the best currently available tool to measure “active heme” in live parasites and is the best available proxy to assess DHA activation in live parasites. Both in vitro and in parasite studies point to a roll for CQ in modulating heme, though an exact mechanism will require further examination. Similar to the reviewer, we were perplexed by the differences observed between in vitro and in parasite assays with PPQ and MFQ. We proposed possible hypotheses to explain these discrepancies in the discussion section. Interestingly, our data corelate well with hemozoin inhibition assays in which all three antimalarials inhibit hemozoin formation in solution, but only CQ and PPQ inhibit hemozoin formation in parasites. In both assays, in-parasite experiments are likely to be more informative for mechanistic assessment.

      It remains unclear why K13 genotype influences RSA values, but not early ring DHA IC50 values. In K13<sup>WT</sup> parasites, both RSA values and DHA IC50 values were increased 3-5 fold upon addition of CQ. This suggests that CQ-mediated resistance is more robust than that conferred by K13 genotype. However, this does not necessarily suggest a different resistance mechanism. We acknowledge that in addition to modulating heme, it is possible that CQ may enhance DHA survival by promoting parasite stress responses. Future studies will be needed to test this alternative hypothesis. This limitation will be acknowledged in the manuscript. We will also address the reviewer’s point that other factors, including poor pharmacokinetic exposure, contributed to OZ439-PPQ treatment failure.

      Reviewer #2:

      We appreciate the positive feedback. We agree that there have been previous studies, many of which we cited, assessing interactions of these antimalarials. We also acknowledge that previous work, including our own, has shown that parasite genetics can alter drug-drug interactions. We will include the author’s recommended citations to the list of references that we cited. Importantly, our work was unique not only for utilizing a pulsing format, but also for revealing a superantagonistic phenotype, assessing interactions in an RSA format, and investigating a mechanism to explain these interactions. We agree with the reviewer that implications from this in vitro work should be cautious, but hope that this work contributes another dimension to critical thinking about drug-drug interactions for future combination therapies. We will modify the manuscript to temper any unintended recommendations or implications.

      The reviewer notes that we conclude “artemisinins are predominantly activated in the cytoplasm”. We recognize that the site of artemisinin activation is contentious. We were very clear to state that our data combined with others suggest that artemisinins can be activated in the parasite cytoplasm. We did not state that this is the primary site of activation. We were clear to point out that technical limitations may prevent Ac-H-FluNox signal in the digestive vacuole, but determined that low pH alone could not explain the absence of a digestive vacuole signal.

      With regard to the “reproducibility” and “mechanistic definition” of superantagonism, we observed what we defined as a one-sided superantagonistic relationship for three different parasites (Dd2, Dd2 PfCRT<sup>Dd2</sup>, and Dd2 K13<sup>R539T</sup>) for a total of nine independent replicates. In the text, we define that these isoboles are unique in that they had mean ΣFIC50 values > 2.4 and peak ΣFIC50 values >4 with points extending upward instead of curving back to the axis. As further evidence of the reproducibility of this relationship, we show that CQ has a significant rescuing effect on parasite survival to DHA as assessed by RSAs and IC50 values in early rings.

      Reviewer #3:

      We thank the reviewer for their positive feedback. We acknowledge that no combinations tested in this manuscript were synergistic. However, two combinations, DHA-MFQ and DHA-LM, were additive, which provides context for contextualizing antagonistic relationships. We have previously reported synergistic and additive isobolograms for peroxide-proteasome inhibitor combinations using this same pulsing format (Rosenthal and Ng 2021). These published results will be cited in the manuscript.

      We believe that these findings are specific to 4-aminoquinoline-peroxide combinations, and that these findings cannot be generalized to antimalarials with different mechanisms of action. Note that the aryl amino alcohols, MFQ and LM, were additive with DHA. Since the mechanism of action of MFQ and LM are poorly understood, it is difficult to speculate on a mechanism underlying these interactions.

      We agree with the reviewer that while the heme probe may provide some mechanistic insight to explain DHA-quinoline interactions, there is much more to learn about CQ-heme chemistry, particularly within parasites.

      The focus of this manuscript was to add a new dimension to considerations about pairings for combination therapies. It is outside the scope of this manuscript to suggest alternative combinations. However, we agree that synergistic combinations would likely be more strategic clinically.

      An in vitro setup allows us to eliminate many confounding variables in order to directly assess the impact of partner drugs on DHA activity. However, we agree that in vivo conditions are incredibly more complex, and explicitly state this.

      We agree that in the future, modeling studies could provide insight into how antagonism may contribute to real-world efficacy. This is outside the scope of our studies.

    1. Author response:

      We thank the reviewers and editors for their assessment and for identifying the main issues of our framework for automated classification of social interactions in animal groups. Based on the reviewers’ feedback, we would like to briefly summarize three areas in which we aim to improve both our manuscript and the software package.

      Firstly, we will revise our manuscript to better define the scope of our classification pipeline. As reviewer #1 correctly points out, our framework is built around the scoring and analysis of dyadic interactions within groups, rather than emergent group-level or collective behavior. This structure more faithfully reflects the way that researchers score social behaviors within groups, following focal individuals while logging all directed interactions of interest (e.g., grooming, aggression or courtship), and with whom these interactions are performed. Indeed, animal groups are often described as social networks of interconnected nodes (individuals), in which the connections between these nodes are derived from pairwise metrics, for example proximity or interaction frequency. For this reason, vassi does not aim to classify higher-level group behavior (i.e., the emergent, collective state of all group members) but rather the pair-wise interactions typically measured. Our classification pipeline replicates this structure, and therefore produces raw data that is familiar to researchers that study social animal groups with a focus on pairwise interactions. Since this may be seen as a limitation when studying group-level behavior (with more than two individuals involved, usually undirected), we will make this distinction between different forms of social interaction more clear in the introduction.

      Secondly, we acknowledge the low performance of our classification pipeline on the cichlid group dataset. We included analyses in the first version of our manuscript that, in our opinion, can justify the use of our pipeline in such cases (comparison to proximity networks), but we understand the reviewers' concerns. Based on their comments, we will perform additional analyses to further assess whether the use of vassi on this dataset results in valid behavioral metrics. This may, for example, include a comparison of per-individual SNA metrics between pipeline results and ground truth, or equivalent comparisons on the level of group structure (e.g., hierarchy derived from aggression counts). We thank reviewer #2 for these suggestions. As the reviewers further point out, there is no consensus yet on when the performance of behavioral classifiers is sufficient for reliable downstream analyses, and although this manuscript does not have the scope to discuss this for the field, it may help to substantiate discussion in future research.

      Finally, we appreciate the reviewers feedback on vassi as a methodological framework and will address the remaining software-related issues by improving the documentation and accessibility of our example scripts. This will reduce the technical hurdle to use vassi in further research. Additionally, we aim to incorporate a third dataset to demonstrate how our framework can be used for iterative training on a sparsely annotated dataset of groups, while broadening the taxonomic scope of our manuscript.

  2. Oct 2025
    1. Author response:

      eLife Assessment

      This study provides useful insights into the ways in which germinal center B cell metabolism, particularly lipid metabolism, affects cellular responses. The authors use sophisticated mouse models to demonstrate that ether lipids are relevant for B cell homeostasis and efficient humoral responses. Although the data were collected from in vitro and in vivo experiments and analyzed using solid and validated methodology, more careful experiments and extensive revision of the manuscript will be required to strengthen the authors' conclusions.

      In addition to praise for the eLife system and transparency (public posting of the reviews; along with an opportunity to address them), we are grateful for the decision of the Editors to select this submission for in-depth peer review and to the referees for the thoughtful and constructive comments.

      In overview, we mostly agree with the specific comments and evaluation of strengths of what the work adds as well as with indications of limitations and caveats that apply to the breadth of conclusions. One can view these as a combination of weaknesses, of instances of reading more into the work than what it says, and of important future directions opened up by the findings we report. Regarding the positives, we appreciate the reviewers' appraisal that our work unveils a novel mechanism in which the peroxisomal enzyme PexRAP mediates B cell intrinsic ether lipid synthesis and promotes a humoral immune response. We are gratified by a recognition that a main contribution of the work is to show that a spatial lipidomic analysis can set the stage for discovery of new molecular processes in biology that are supported by using 2-dimensional imaging mass spectrometry techniques and cell type specific conditional knockout mouse models.

      By and large, the technical issues are items we will strive to improve. Ultimately, an over-arching issue in research publications in this epoch are the questions "when is enough enough?" and "what, or how much, advance will be broadly important in moving biological and biomedical research forward?" It appears that one limitation troubling the reviews centers on whether the mechanism of increased ROS and multi-modal death - supported most by the in vitro evidence - applies to germinal center B cells in situ, versus either a mechanism for decreased GC that mostly applies to the pre-GC clonal amplification (or recruitment into GC). Overall, we agree that this leap could benefit from additional evidence - but as resources ended we instead leave that question for the future other than the findings with S1pr2-CreERT2-driven deletion leading to less GC B cells. While we strove to be very careful in framing such a connection as an inference in the posted manuscript, we will revisit the matter via rechecking the wording when revising the text after trying to get some specific evidence.  

      In the more granular part of this provisional response (below), we will outline our plan prompted by the reviewers but also comment on a few points of disagreement or refinement (longer and more detailed explanation). The plan includes more detailed analysis of B cell compartments, surface level of immunoglobulin, Tfh cell population, a refinement of GC B cell markers, and the ex vivo GC B cell analysis for ROS, proliferation, and cell death. We will also edit the text to provide more detailed information and clarify our interpretation to prevent the confusion of our results.  At a practical level, some evidence likely is technologically impractical, and an unfortunate determinant is the lack of further sponsored funding for further work. The detailed point-by-point response to the reviewer’s comments is below.  

      Public Reviews:

      Reviewer #1 (Public review):

      In this manuscript, Sung Hoon Cho et al. presents a novel investigation into the role of PexRAP, an intermediary in ether lipid biosynthesis, in B cell function, particularly during the Germinal Center (GC) reaction. The authors profile lipid composition in activated B cells both in vitro and in vivo, revealing the significance of PexRAP. Using a combination of animal models and imaging mass spectrometry, they demonstrate that PexRAP is specifically required in B cells. They further establish that its activity is critical upon antigen encounter, shaping B cell survival during the GC reaction.

      Mechanistically, they show that ether lipid synthesis is necessary to modulate reactive oxygen species (ROS) levels and prevent membrane peroxidation.

      Highlights of the Manuscript:

      The authors perform exhaustive imaging mass spectrometry (IMS) analyses of B cells, including GC B cells, to explore ether lipid metabolism during the humoral response. This approach is particularly noteworthy given the challenge of limited cell availability in GC reactions, which often hampers metabolomic studies. IMS proves to be a valuable tool in overcoming this limitation, allowing detailed exploration of GC metabolism.

      The data presented is highly relevant, especially in light of recent studies suggesting a pivotal role for lipid metabolism in GC B cells. While these studies primarily focus on mitochondrial function, this manuscript uniquely investigates peroxisomes, which are linked to mitochondria and contribute to fatty acid oxidation (FAO). By extending the study of lipid metabolism beyond mitochondria to include peroxisomes, the authors add a critical dimension to our understanding of B cell biology.

      Additionally, the metabolic plasticity of B cells poses challenges for studying metabolism, as genetic deletions from the beginning of B cell development often result in compensatory adaptations. To address this, the authors employ an acute loss-of-function approach using two conditional, cell-type-specific gene inactivation mouse models: one targeting B cells after the establishment of a pre-immune B cell population (Dhrs7b^f/f, huCD20-CreERT2) and the other during the GC reaction (Dhrs7b^f/f; S1pr2-CreERT2). This strategy is elegant and well-suited to studying the role of metabolism in B cell activation.

      Overall, this manuscript is a significant contribution to the field, providing robust evidence for the fundamental role of lipid metabolism during the GC reaction and unveiling a novel function for peroxisomes in B cells.

      We appreciate these positive reactions and response, and agree with the overview and summary of the paper's approaches and strengths.

      However, several major points need to be addressed:

      Major Comments:

      Figures 1 and 2

      The authors conclude, based on the results from these two figures, that PexRAP promotes the homeostatic maintenance and proliferation of B cells. In this section, the authors first use a tamoxifen-inducible full Dhrs7b knockout (KO) and afterwards Dhrs7bΔ/Δ-B model to specifically characterize the role of this molecule in B cells. They characterize the B and T cell compartments using flow cytometry (FACS) and examine the establishment of the GC reaction using FACS and immunofluorescence. They conclude that B cell numbers are reduced, and the GC reaction is defective upon stimulation, showing a reduction in the total percentage of GC cells, particularly in the light zone (LZ).

      The analysis of the steady-state B cell compartment should also be improved. This includes a more detailed characterization of MZ and B1 populations, given the role of lipid metabolism and lipid peroxidation in these subtypes.

      Suggestions for Improvement:

      B Cell compartment characterization: A deeper characterization of the B cell compartment in non-immunized mice is needed, including analysis of Marginal Zone (MZ) maturation and a more detailed examination of the B1 compartment. This is especially important given the role of specific lipid metabolism in these cell types. The phenotyping of the B cell compartment should also include an analysis of immunoglobulin levels on the membrane, considering the impact of lipids on membrane composition.

      Although the manuscript is focused on post-ontogenic B cell regulation in Ab responses, we believe we will be able to polish a revised manuscript through addition of results of analyses suggested by this point in the review: measurement of surface IgM on and phenotyping of various B cell subsets, including MZB and B1 B cells, to extend the data in Supplemental Fig 1H and I. Depending on the level of support, new immunization experiments to score Tfh and analyze a few of their functional molecules as part of a B cell paper may be feasible.  

      - GC Response Analysis Upon Immunization: The GC response characterization should include additional data on the T cell compartment, specifically the presence and function of Tfh cells. In Fig. 1H, the distribution of the LZ appears strikingly different. However, the authors have not addressed this in the text. A more thorough characterization of centroblasts and centrocytes using CXCR4 and CD86 markers is needed.

      The gating strategy used to characterize GC cells (GL7+CD95+ in IgD− cells) is suboptimal. A more robust analysis of GC cells should be performed in total B220+CD138− cells.

      We first want to apologize the mislabeling of LZ and DZ in Fig 1H. The greenish-yellow colored region (GL7<sup>+</sup> CD35<sup>+</sup>) indicate the DZ and the cyan-colored region (GL7<sup>+</sup> CD35<sup>+</sup>) indicates the LZ.

      As a technical note, we experienced high background noise with GL7 staining uniquely with PexRAP deficient (Dhrs7b<sup>f/f</sup>; Rosa26-CreER<sup>T2</sup>) mice (i.e., not WT control mice). The high background noise of GL7 staining was not observed in B cell specific KO of PexRAP (Dhrs7b<sup>f/f</sup>; huCD20-CreER<sup>T2</sup>). Two formal possibilities to account for this staining issue would be if either the expression of the GL7 epitope were repressed by PexRAP or the proper positioning of GL7<sup>+</sup> cells in germinal center region were defective in PexRAP-deficient mice (e.g., due to an effect on positioning cues from cell types other than B cells). In a revised manuscript, we will fix the labeling error and further discuss the GL7 issue, while taking care not to be thought to conclude that there is a positioning problem or derepression of GL7 (an activation antigen on T cells as well as B cells).

      While the gating strategy for an overall population of GC B cells is fairly standard even in the current literature, the question about using CD138 staining to exclude early plasmablasts (i.e., analyze B220<sup>+</sup> CD138<sup>neg</sup> vs B220<sup>+</sup> CD138<sup>+</sup>) is interesting. In addition, some papers like to use GL7<sup>+</sup> CD38<sup>neg</sup> for GC B cells instead of GL7<sup>+</sup> Fas (CD95)<sup>+</sup>, and we thank the reviewer for suggesting the analysis of centroblasts and centrocytes. For the revision, we will try to secure resources to revisit the immunizations and analyze them for these other facets of GC B cells (including CXCR4/CD86) and for their GL7<sup>+</sup> CD38<sup>neg</sup>. B220<sup>+</sup> CD138<sup>-</sup> and B220<sup>+</sup> CD138<sup>+</sup> cell populations. 

      We agree that comparison of the Rosa26-CreERT2 results to those with B cell-specific loss-of-function raise a tantalizing possibility that Tfh cells also are influenced by PexRAP. Although the manuscript is focused on post-ontogenic B cell regulation in Ab responses, we hope to add a new immunization experiments that scores Tfh and analyzes a few of their functional molecules could be added to this B cell paper, depending on the ability to wheedle enough support / fiscal resources.

      - The authors claim that Dhrs7b supports the homeostatic maintenance of quiescent B cells in vivo and promotes effective proliferation. This conclusion is primarily based on experiments where CTV-labeled PexRAP-deficient B cells were adoptively transferred into μMT mice (Fig. 2D-F). However, we recommend reviewing the flow plots of CTV in Fig. 2E, as they appear out of scale. More importantly, the low recovery of PexRAP-deficient B cells post-adoptive transfer weakens the robustness of the results and is insufficient to conclusively support the role of PexRAP in B cell proliferation in vivo.

      In the revision, we will edit the text and try to adjust the digitized cytometry data to allow more dynamic range to the right side of the upper panels in Fig. 2E, and otherwise to improve the presentation of the in vivo CTV result. However, we feel impelled to push back respectfully on some of the concern raised here. First, it seems to gloss over the presentation of multiple facets of evidence. The conclusion about maintenance derives primarily from Fig. 2C, which shows a rapid, statistically significant decrease in B cell numbers (extending the finding of Fig. 1D, a more substantial decrease after a bit longer a period). As noted in the text, the rate of de novo B cell production does not suffice to explain the magnitude of the decrease.

      In terms of proliferation, we will improve presentation of the Methods but the bottom line is that the recovery efficiency is not bad (comparing to prior published work) inasmuch as transferred B cells do not uniformly home to spleen. In a setting where BAFF is in ample supply in vivo, we transferred equal numbers of cells that were equally labeled with CTV and counted B cells.  The CTV result might be affected by lower recovered B cell with PexRAP deficiency, generally, the frequencies of CTV<sup>low</sup> divided population are not changed very much. However, it is precisely because of the pitfalls of in vivo analyses that we included complementary data with survival and proliferation in vitro. The proliferation was attenuated in PexRAP-deficient B cells in vitro; this evidence supports the conclusion that proliferation of PexRAP knockout B cells is reduced. It is likely that PexRAP deficient B cells also have defect in viability in vivo as we observed the reduced B cell number in PexRAP-deficient mice. As the reviewer noticed, the presence of a defect in cycling does, in the transfer experiments, limit the ability to interpret a lower yield of B cell population after adoptive transfer into µMT recipient mice as evidence pertaining to death rates. We will edit the text of the revision with these points in mind.

      - In vitro stimulation experiments: These experiments need improvement. The authors have used anti-CD40 and BAFF for B cell stimulation; however, it would be beneficial to also include anti-IgM in the stimulation cocktail. In Fig. 2G, CTV plots do not show clear defects in proliferation, yet the authors quantify the percentage of cells with more than three divisions. These plots should clearly display the gating strategy. Additionally, details about histogram normalization and potential defects in cell numbers are missing. A more in-depth analysis of apoptosis is also required to determine whether the observed defects are due to impaired proliferation or reduced survival.

      As suggested by reviewer, testing additional forms of B cell activation can help explore the generality (or lack thereof) of findings. We plan to test anti-IgM stimulation together with anti-CD40 + BAFF as well as anti-IgM + TLR7/8, and add the data to a revised and final manuscript.

      With regards to Fig. 2G (and 2H), in the revised manuscript we will refine the presentation (add a demonstration of the gating, and explicate histogram normalization of FlowJo).

      It is an interesting issue in bioscience, but in our presentation 'representative data' really are pretty representative, so a senior author is reminded of a comment Tak Mak made about a reduction (of proliferation, if memory serves) to 0.7 x control. [His point in a comment to referees at a symposium related that to a salary reduction by 30% :) A mathematical alternative is to point out that across four rounds of division for WT cells, a reduction to 0.7x efficiency at each cycle means about 1/4 as many progeny.] 

      We will try to edit the revision (Methods, Legends, Results, Discussion] to address better the points of the last two sentences of the comment, and improve the details that could assist in replication or comparisons (e.g., if someone develops a PexRAP inhibitor as potential therapeutic).

      For the present, please note that the cell numbers at the end of the cultures are currently shown in Fig 2, panel I. Analogous culture results are shown in Fig 8, panels I, J, albeit with harvesting at day 5 instead of day 4. So, a difference of ≥ 3x needs to be explained. As noted above, a division efficiency reduced to 0.7x normal might account for such a decrease, but in practice the data of Fig. 2I show that the number of PexRAP-deficient B cells at day 4 is similar to the number plated before activation, and yet there has been a reasonable amount of divisions. So cell numbers in the culture of  mutant B cells are constant because cycling is active but decreased and insufficient to allow increased numbers ("proliferation" in the true sense) as programmed death is increased. In line with this evidence, Fig 8G-H document higher death rates [i.e., frequencies of cleaved caspase3<sup>+</sup> cell and Annexin V<sup>+</sup> cells] of PexRAP-deficient B cells compared to controls. Thus, the in vitro data lead to the conclusion that both decreased division rates and increased death operate after this form of stimulation.

      An inference is that this is the case in vivo as well - note that recoveries differed by ~3x (Fig. 2D), and the decrease in divisions (presentation of which will be improved) was meaningful but of lesser magnitude (Fig. 2E, F).  

      Reviewer #2 (Public review):

      Summary:

      In this study, Cho et al. investigate the role of ether lipid biosynthesis in B cell biology, particularly focusing on GC B cell, by inducible deletion of PexRAP, an enzyme responsible for the synthesis of ether lipids.

      Strengths:

      Overall, the data are well-presented, the paper is well-written and provides valuable mechanistic insights into the importance of PexRAP enzyme in GC B cell proliferation.

      We appreciate this positive response and agree with the overview and summary of the paper's approaches and strengths.

      Weaknesses:

      More detailed mechanisms of the impaired GC B cell proliferation by PexRAP deficiency remain to be further investigated. In the minor part, there are issues with the interpretation of the data which might cause confusion for the readers.

      Issues about contributions of cell cycling and divisions on the one hand, and susceptibility to death on the other, were discussed above, amplifying on the current manuscript text. The aggregate data support a model in which both processes are impacted for mature B cells in general, and mechanistically the evidence and work focus on the increased ROS and modes of death. Although the data in Fig. 7 do provide evidence that GC B cells themselves are affected, we agree that resource limitations had militated against developing further evidence about cycling specifically for GC B cells. We will hope to be able to obtain sufficient data from some specific analysis of proliferation in vivo (e.g., Ki67 or BrdU) as well as ROS and death ex vivo when harvesting new samples from mice immunized to analyze GC B cells for CXCR4/CD86, CD38, CD138 as indicated by Reviewer 1.  As suggested by Reviewer 2, we will further discuss the possible mechanism(s) by which proliferation of PexRAP-deficient B cells is impaired. We also will edit the text of a revision where to enhance clarity of data interpretation - at a minimum, to be very clear that caution is warranted in assuming that GC B cells will exhibit the same mechanisms as cultures in vitro-stimulated B cells.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      Summary

      The authors develop a set of biophysical models to investigate whether a constant area hypothesis or a constant curvature hypothesis explains the mechanics of membrane vesiculation during clathrin-mediated endocytosis.

      Strengths

      The models that the authors choose are fairly well-described in the field and the manuscript is wellwritten.

      Thank you for your positive comments on our work.

      Weaknesses

      One thing that is unclear is what is new with this work. If the main finding is that the differences are in the early stages of endocytosis, then one wonders if that should be tested experimentally. Also, the role of clathrin assembly and adhesion are treated as mechanical equilibrium but perhaps the process should not be described as equilibria but rather a time-dependent process. Ultimately, there are so many models that address this question that without direct experimental comparison, it's hard to place value on the model prediction.

      Thank you for your insightful questions. We fully agree that distinguishing between the two models should ultimately be guided by experimental tests. This is precisely the motivation for including Fig. 5 in our manuscript, where we compare our theoretical predictions with experimental data. In the middle panel of Fig. 5, we observe that the predicted tip radius as a function of 𝜓<sub>𝑚𝑎𝑥</sub> from the constant curvature model (magenta curve) deviates significantly from both the experimental data points and the rolling median, highlighting the inconsistency of this model with the data.

      Regarding our treatment of clathrin assembly and membrane adhesion as mechanical equilibrium processes, our reasoning is based on a timescale separation argument. Clathrin assembly typically occurs over approximately 1 minute. In contrast, the characteristic relaxation time for a lipid membrane to reach mechanical equilibrium is given by , where 𝜇∼5 × 10<sup>-9</sup> 𝑁𝑠𝑚<sup>-1</sup> is the membrane viscosity, 𝑅<sub>0</sub> =50𝑛𝑚 is the vesicle size, 𝜅=20 𝑘<sub>𝐵</sub>𝑇 is the bending rigidity. This yields a relaxation time of 𝜏≈1.5 × 10<sup>−4</sup>𝑠, which is several orders of magnitude shorter than the timescale of clathrin assembly. Therefore, it is reasonable to treat the membrane shape as being in mechanical equilibrium throughout the assembly process.

      We believe the value of our model lies in the following key novelties:

      (1) Model novelty: We introduce an energy term associated with curvature generation, a contribution that is typically neglected in previous models.

      (2) Methodological novelty: We perform a quantitative comparison between theoretical predictions and experimental data, whereas most earlier studies rely on qualitative comparisons.

      (3) Results novelty: Our quantitative analysis enables us to unambiguously exclude the constant curvature hypothesis based on time-independent electron microscopy data.

      In the revised manuscript (line 141), we have added a statement about why we treat the clathrin assembly as in mechanical equilibrium.

      While an attempt is made to do so with prior published EM images, there is excessive uncertainty in both the data itself as is usually the case but also in the methods that are used to symmetrize the data. This reviewer wonders about any goodness of fit when such uncertainty is taken into account.

      Author response: We thank the reviewer for raising this important point. We agree that there is uncertainty in the experimental data. Our decision to symmetrize the data is based on the following considerations:

      (1) The experimental data provide a one-dimensional membrane profile corresponding to a cross-sectional view. To reconstruct the full two-dimensional membrane surface, we must assume rotational symmetry.

      (2)In addition to symmetrization, we also average membrane profiles within a certain range of 𝜓<sub>𝑚𝑎𝑥</sub> values (see Fig. 5d). This averaging helps reduce the uncertainty (due to biological and experimental variability) inherent to individual measurements.

      (3)To further address the noise in the experimental data, we compare our theoretical predictions not only with individual data points but also with a rolling median, which provides a smoothed representation of the experimental trends.

      These steps are taken to ensure a more robust and meaningful comparison between theory and experiments.

      In the revised manuscript (line 338), we have explained why we have to symmetrize the data:

      “To facilitate comparison between the axisymmetric membrane shapes predicted by the model and the non-axisymmetric profiles obtained from electron microscopy, we apply a symmetrization procedure to the experimental data, which consist of one-dimensional membrane profiles extracted from cross-sectional views, as detailed in Appendix 3 (see also Appendix 3--Fig. 1).”

      Reviewer #2:

      Summary

      In this manuscript, the authors employ theoretical analysis of an elastic membrane model to explore membrane vesiculation pathways in clathrin-mediated endocytosis. A complete understanding of clathrin-mediated endocytosis requires detailed insight into the process of membrane remodeling, as the underlying mechanisms of membrane shape transformation remain controversial, particularly regarding membrane curvature generation. The authors compare constant area and constant membrane curvature as key scenarios by which clathrins induce membrane wrapping around the cargo to accomplish endocytosis. First, they characterize the geometrical aspects of the two scenarios and highlight their differences by imposing coating area and membrane spontaneous curvature. They then examine the energetics of the process to understand the driving mechanisms behind membrane shape transformations in each model. In the latter part, they introduce two energy terms: clathrin assembly or binding energy, and curvature generation energy, with two distinct approaches for the latter. Finally, they identify the energetically favorable pathway in the combined scenario and compare their results with experiments, showing that the constant-area pathway better fits the experimental data.

      Thank you for your clear and comprehensive summary of our work.

      Strengths

      The manuscript is well-written, well-organized, and presents the details of the theoretical analysis with sufficient clarity. The calculations are valid, and the elastic membrane model is an appropriate choice for addressing the differences between the constant curvature and constant area models.

      The authors' approach of distinguishing two distinct free energy terms-clathrin assembly and curvature generation-and then combining them to identify the favorable pathway is both innovative and effective in addressing the problem.

      Notably, their identification of the energetically favorable pathways, and how these pathways either lead to full endocytosis or fail to proceed due to insufficient energetic drives, is particularly insightful.

      Thank you for your positive remarks regarding the innovative aspects of our work.

      Weaknesses and Recommendations

      Weakness: Membrane remodeling in cellular processes is typically studied in either a constant area or constant tension ensemble. While total membrane area is preserved in the constant area ensemble, membrane area varies in the constant tension ensemble. In this manuscript, the authors use the constant tension ensemble with a fixed membrane tension, σe. However, they also use a constant area scenario, where 'area' refers to the surface area of the clathrin-coated membrane segment. This distinction between the constant membrane area ensemble and the constant area of the coated membrane segment may cause confusion.

      Recommendation: I suggest the authors clarify this by clearly distinguishing between the two concepts by discussing the constant tension ensemble employed in their theoretical analysis.

      Thank you for raising this question.

      In the revised manuscript (line 136), we have added a sentence, emphasizing the implication of the term “constant area model”:

      “We emphasize that the constant area model refers to the assumption that the clathrin-coated area 𝑎<sub>0</sub> remains fixed. Meanwhile, the membrane tension 𝜎<sub>𝑒</sub> at the base is held constant, allowing the total membrane area 𝐴𝐴 to vary in response to deformations induced by the clathrin coat.”

      Weakness: As mentioned earlier, the theoretical analysis is performed in the constant membrane tension ensemble at a fixed membrane tension. The total free energy E_tot of the system consists of membrane bending energy E_b and tensile energy E_t, which depends on membrane tension, σe. Although the authors mention the importance of both E_b and E_t, they do not present their individual contributions to the total energy changes. Comparing these contributions would enable readers to cross-check the results with existing literature, which primarily focuses on the role of membrane bending rigidity and membrane tension.

      Recommendation: While a detailed discussion of how membrane tension affects their results may fall outside the scope of this manuscript, I suggest the authors at least discuss the total membrane area variation and the contribution of tensile energy E_t for the singular value of membrane tension used in their analysis.

      Thank you for the insightful suggestion. In the revised manuscript (line 916), we have added Appendix 6 and a supplementary figure to compare the bending energy 𝐸<sub>𝑏</sub> and the tension energy 𝐸<sub>𝑡</sub>. Our analysis shows that both energy components exhibit an energy barrier between the flat and vesiculated membrane states, with the tension energy contributing more significantly than the bending energy.

      In the revised manuscript (line 151), we have also added one paragraph explaining why we set the dimensionless tension . This choice is motivated by our use of the characteristic length as the length scale, and as the energy scale. In this way, the dimensionless tension energy is written as

      Where is the dimensionless area.

      Weakness: The authors introduce two different models, (1,1) and (1,2), for generating membrane curvature. Model 1 assumes a constant curvature growth, corresponding to linear curvature growth, while Model 2 relates curvature growth to its current value, resembling exponential curvature growth. Although both models make physical sense in general, I am concerned that Model 2 may lead to artificial membrane bending at high curvatures. Normally, for intermediate bending, ψ > 90, the bending process is energetically downhill and thus proceeds rapidly. The bending process is energetically downhill and thus proceeds rapidly. However, Model 2's assumption would accelerate curvature growth even further. This is reflected in the endocytic pathways represented by the green curves in the two rightmost panels of Fig. 4a, where the energy steeply increases at large ψ. I believe a more realistic version of Model 2 would require a saturation mechanism to limit curvature growth at high curvatures.

      Recommendation 1: I suggest the authors discuss this point and highlight the pros and cons of Model 2. Specifically, addressing the potential issue of artificial membrane bending at high curvatures and considering the need for a saturation mechanism to limit excessive curvature growth. A discussion on how Model 2 compares to Model 1 in terms of physical relevance, especially in the context of high curvature scenarios, would provide valuable insights for the reader.

      Thank you for raising the question of excessive curvature growth in our models and the constructive suggestion of introducing a saturation mechanism. In the revised manuscript (line 405), following your recommendation, we have added a subsection “Saturation effect at high membrane curvatures” in the discussion to clarify the excessive curvature issue and a possible way to introduce a saturation mechanism:

      “Note that our model involves two distinct concepts of curvature growth. The first is the growth of imposed curvature — referred to here as intrinsic curvature and denoted by the parameter 𝑐<sub>0</sub> — which is driven by the reorganization of bonds between clathrin molecules within the coat. The second is the growth of the actual membrane curvature, reflected by the increasing value of 𝜓<sub>𝑚𝑎𝑥</sub>.

      The latter process is driven by the former.

      Models (1,1) and (1,2) incorporate energy terms (Equation 6) that promote the increase of intrinsic curvature 𝑐<sub>0</sub>, which in turn drives the membrane to adopt a more curved shape (increasing 𝜓<sub>𝑚𝑎𝑥</sub>). In the absence of these energy contributions, the system faces an energy barrier separating a weakly curved membrane state (low 𝜓<sub>𝑚𝑎𝑥</sub>) from a highly curved state (high 𝜓<sub>𝑚𝑎𝑥</sub>). This barrier can be observed, for example, in the red curves of Figure 3(a–c) and in Appendix 6—Figure 1. As a result, membrane bending cannot proceed spontaneously and requires additional energy input from clathrin assembly.

      The energy terms described in Equation 6 serve to eliminate this energy barrier by lowering the energy difference between the uphill and downhill regions of the energy landscape. However, these same terms also steepen the downhill slope, which may lead to overly aggressive curvature growth.

      To mitigate this effect, one could introduce a saturation-like energy term of the form:

      where 𝑐<sub>𝑠</sub> represents a saturation curvature. Importantly, adding such a term would not alter the conclusions of our study, since the energy landscape already favors high membrane curvature (i.e., it is downward sloping) even without the additional energy terms. “

      Recommendation 2: Referring to the previous point, the green curves in the two rightmost panels of Fig. 4a seem to reflect a comparison between slow and fast bending regimes. The initial slow vesiculation (with small curvature growth) in the left half of the green curves is followed by much more rapid curvature growth beyond a certain threshold. A similar behavior is observed in Model 1, as shown by the green curves in the two rightmost panels of Fig. 4b. I believe this transition between slow and fast bending warrants a brief discussion in the manuscript, as it could provide further insight into the dynamic nature of vesiculation.

      Thank you for your constructive suggestion regarding the transition between slow and fast membrane bending. As you pointed out, in both Fig. 4a (model (1,2)) and Fig. 4b (model (1,1)), the green curves tend to extend vertically at the late stage. This suggests a significant increase in 𝑐<sub>0</sub> on the free energy landscape. However, we remain cautious about directly interpreting this vertical trend as indicative of fast endocytic dynamics, since our model is purely energetic and does not explicitly incorporate kinetic details. Meanwhile, we agree with your observation that the steep decrease in free energy along the green curve could correspond to an acceleration in dynamics. To address this point, we have added a paragraph in the revised manuscript (in Subsection “Cooperativity in the curvature generation process”) discussing this potential transition and its consistency with experimental observations (line 395):

      “Furthermore, although our model is purely energetic and does not explicitly incorporate dynamics, we observe in Figure 3(a) that along the green curve—representing the trajectory predicted by model (1,2)—the total free energy (𝐸<sub>𝑡𝑜𝑡</sub>) exhibits a much sharper decrease at the late stage (near the vesiculation line) compared to the early stage (near the origin). This suggests a transition from slow to fast dynamics during endocytosis. Such a transition is consistent with experimental observations, where significantly fewer number of images with large 𝜓<sub>𝑚𝑎𝑥</sub> are captured compared to those with small 𝜓<sub>𝑚𝑎𝑥</sub> (Mund et al., 2023).”

      The geometrical properties of both the constant-area and constant-curvature scenarios, as well depicted in Fig. 1, are somewhat straightforward. I wonder what additional value is presented in Fig. 2. Specifically, the authors solve differential shape equations to show how Rt and Rcoat vary with the angle ψ, but this behavior seems predictable from the simple schematics in Fig. 1. Using a more complex model for an intuitively understandable process may introduce counter-intuitive results and unnecessary complications, as seen with the constant-curvature model where Rt varies (the tip radius is not constant, as noted in the text) despite being assumed constant. One could easily assume a constant-curvature model and plot Rt versus ψ. I wonder What is the added value of solving shape equations to measure geometrical properties, compared to a simpler schematic approach (without solving shape equations) similar to what they do in App. 5 for the ratio of the Rt at ψ=30 and 150.

      Thank you for raising this important question. While simple and intuitive theoretical models are indeed convenient to use, their validity must be carefully assessed. The approximate model becomes inaccurate when the clathrin shell significantly deviates from its intrinsic shape, namely a spherical cap characterized by intrinsic curvature 𝑐<sub>0</sub>. As shown in the insets of Fig. 2b and 2c (red line and black points), our comparison between the simplified model and the full model demonstrates that the simple model provides a good approximation under the constant-area constraint. However, it performs poorly under the constant-curvature constraint, and the deviation between the full model and the simplified model becomes more pronounced as 𝑐<sub>0</sub> increases.

      In the revised manuscript, we have added a sentence emphasizing the discrepancy between the exact calculation with the idealized picture for the constant curvature model (line 181):

      “For the constant-curvature model, the ratio remains close to 1 only at small values of 𝑐<sub>0</sub>, as expected from the schematic representation of the model in Figure 1. However, as 𝑐<sub>0</sub> increases, the deviation from this idealized picture becomes increasingly pronounced.”

      Recommendation: The clathrin-mediated endocytosis aims at wrapping cellular cargos such as viruses which are typically spherical objects which perfectly match the constant-curvature scenario. In this context, wrapping nanoparticles by vesicles resembles constant-curvature membrane bending in endocytosis. In particular analogous shape transitions and energy barriers have been reported (similar to Fig.3 of the manuscript) using similar theoretical frameworks by varying membrane particle binding energy acting against membrane bending:

      DOI: 10.1021/la063522m

      DOI: 10.1039/C5SM01793A

      I think a short comparison to particle wrapping by vesicles is warranted.

      Thank you for your constructive suggestion to compare our model with particle wrapping. In the revised manuscript (line 475), we have added a subsection “Comparison with particle wrapping” in the discussion:

      “The purpose of the clathrin-mediated endocytosis studied in our work is the recycling of membrane and membrane-protein, and the cellular uptake of small molecules from the environment — molecules that are sufficiently small to bind to the membrane or be encapsulated within a vesicle. In contrast, the uptake of larger particles typically involves membrane wrapping driven by adhesion between the membrane and the particle, a process that has also been studied previously (Góźdź, 2007; Bahrami et al., 2016). In our model, membrane bending is driven by clathrin assembly, which induces curvature. In particle wrapping, by comparison, the driving force is the adhesion between the membrane and a rigid particle. In the absence of adhesion, wrapping increases both bending and tension energies, creating an energy barrier that separates the flat membrane state from the fully wrapped state. This barrier can hinder complete wrapping, resulting in partial or no engulfment of the particle. Only when the adhesion energy is sufficiently strong can the process proceed to full wrapping. In this context, adhesion plays a role analogous to curvature generation in our model, as both serve to overcome the energy barrier. If the particle is spherical, it imposes a constant-curvature pathway during wrapping. However, the role of clathrin molecules in this process remains unclear and will be the subject of future investigation.”

      Minor points:

      Line 20, abstract, "....a continuum spectrum ..." reads better.

      Line 46 "...clathrin results in the formation of pentagons ...." seems Ito be grammatically correct.

      Line 106, proper citation of the relevant literature is warranted here.

      Line 111, the authors compare features (plural) between experiments and calculations. I would write "....compare geometric features calculated by theory with those ....".

      Line 124, "Here, we choose a ..." (with comma after Here).

      Line 134, "The membrane tension \sigma_e and bending rigidity \kappa define a ...."

      Line 295, "....tip radius, and invagination ...." (with comma before and).

      Line 337, "abortive tips, and ..." (with comma before and).

      We thank you for your thorough review of our manuscript and have corrected all the issues raised.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Recommendations for the Authors:

      (1) Clarify Mechanistic Interpretations

      (a) Provide stronger evidence or a more cautious interpretation regarding whether intracellular BK-CaV1.3 ensembles are precursors to plasma membrane complexes.

      This is an important point. We adjusted the interpretation regarding intracellular BKCa<sub>V</sub>1.3 hetero-clusters as precursors to plasma membrane complexes to reflect a more cautious stance, acknowledging the limitations of available data. We added the following to the manuscript.

      “Our findings suggest that BK and Ca<sub>V</sub>1.3 channels begin assembling intracellularly before reaching the plasma membrane, shaping their spatial organization and potentially facilitating functional coupling. While this suggests a coordinated process that may contribute to functional coupling, further investigation is needed to determine the extent to which these hetero-clusters persist upon membrane insertion.”

      (b) Discuss the limitations of current data in establishing the proportion of intracellular complexes that persist on the cell surface.

      We appreciate the suggestion. We expanded the discussion to address the limitations of current data in determining the proportion of intracellular complexes that persist on the cell surface. We added the following to the manuscript.

      “Our findings highlight the intracellular assembly of BK-Ca<sub>V</sub>1.3 hetero-clusters, though limitations in resolution and organelle-specific analysis prevent precise quantification of the proportion of intracellular complexes that ultimately persist on the cell surface. While our data confirms that hetero-clusters form before reaching the plasma membrane, it remains unclear whether all intracellular hetero-clusters transition intact to the membrane or undergo rearrangement or disassembly upon insertion. Future studies utilizing live cell tracking and high resolution imaging will be valuable in elucidating the fate and stability of these complexes after membrane insertion.”

      (2) Refine mRNA Co-localization Analysis

      (a) Include appropriate controls using additional transmembrane mRNAs to better assess the specificity of BK and CaV1.3 mRNA co-localization.

      We agree with the reviewers that these controls are essential. We explain better the controls used to address this concern. We added the following to the manuscript. 

      “To explore the origins of the initial association, we hypothesized that the two proteins are translated near each other, which could be detected as the colocalization of their mRNAs (Figure 5A and B). The experiment was designed to detect single mRNA molecules from INS-1 cells in culture. We performed multiplex in situ hybridization experiments using an RNAScope fluorescence detection kit to be able to image three mRNAs simultaneously in the same cell and acquired the images in a confocal microscope with high resolution. To rigorously assess the specificity of this potential mRNA-level organization, we used multiple internal controls. GAPDH mRNA, a highly expressed housekeeping gene with no known spatial coordination with channel mRNAs, served as a baseline control for nonspecific colocalization due to transcript abundance. To evaluate whether the spatial proximity between BK mRNA (KCNMA1) and Ca<sub>V</sub>1.3 mRNA (CACNA1D) was unique to functionally coupled channels, we also tested for Na<sup>V</sup>1.7 mRNA (SCN9A), a transmembrane sodium channel expressed in INS-1 cells but not functionally associated with BK. This allowed us to determine whether the observed colocalization reflected a specific biological relationship rather than shared expression context. Finally, to test whether this proximity might extend to other calcium sources relevant to BK activation, we probed the mRNA of ryanodine receptor 2 (RyR2), another Ca<sup>2+</sup> channel known to interact structurally with BK channels [32]. Together, these controls were chosen to distinguish specific mRNA colocalization patterns from random spatial proximity, shared subcellular distribution, or gene expression level artifacts.”

      (b) Quantify mRNA co-localization in both directions (e.g., BK with CaV1.3 and vice versa) and account for differences in expression levels.

      We thank the reviewer for this suggestion. We chose to quantify mRNA co-localization in the direction most relevant to the formation of functionally coupled hetero-clusters, namely, the proximity of BK (KCNMA1) mRNA to Ca<sub>V</sub>1.3 (CACNA1D) mRNA. Since BK channel activation depends on calcium influx provided by nearby Ca<sub>V</sub>1.3 channels, this directional analysis more directly informs the hypothesis of spatially coordinated translation and channel assembly. To address potential confounding effects of transcript abundance, we implemented a scrambled control approach in which the spatial coordinates of KCNMA1 mRNAs were randomized while preserving transcript count. This control resulted in significantly lower colocalization with CACNA1D mRNA, indicating that the observed proximity reflects a specific spatial association rather than expressiondriven overlap. We also assessed colocalization of CACNA1D with both KCNMA1, GAPDH mRNAs and SCN9 (NaV1.7); as you can see in the graph below these data support t the same conclusion but were not included in the manuscript.

      Author response image 1.

      (c) Consider using ER labeling as a spatial reference when analyzing mRNA localization

      We thank the reviewers for this suggestion. Rather than using ER labeling as a spatial reference, we assess BK and CaV1.3 mRNA localization using fluorescence in situ hybridization (smFISH) alongside BK protein immunostaining. This approach directly identifies BK-associated translation sites, ensuring that observed mRNA localization corresponds to active BK synthesis rather than general ER association. By evaluating BK protein alongside its mRNA, we provide a more functionally relevant measure of spatial organization, allowing us to assess whether BK is synthesized in proximity to CaV1.3 mRNA within micro-translational complexes. The results added to the manuscript is as follows.

      “To further investigate whether KCNMA1 and CACNA1D are localized in regions of active translation (Figure 7A), we performed RNAScope targeting KCNMA1 and CACNA1D alongside immunostaining for BK protein. This strategy enabled us to visualize transcript-protein colocalization in INS-1 cells with subcellular resolution. By directly evaluating sites of active BK translation, we aimed to determine whether newly synthesized BK protein colocalized with CACNA1D mRNA signals (Figure 7A). Confocal imaging revealed distinct micro-translational complex where KCNMA1 mRNA puncta overlapped with BK protein signals and were located adjacent to CACNA1D mRNA (Figure 7B). Quantitative analysis showed that 71 ± 3% of all KCNMA1 colocalized with BK protein signal which means that they are in active translation. Interestingly, 69 ± 3% of the KCNMA1 in active translation colocalized with CACNA1D (Figure 7C), supporting the existence of functional micro-translational complexes between BK and Ca<sub>V</sub>1.3 channels.”

      (3) Improve Terminology and Definitions

      (a) Clarify and consistently use terms like "ensemble," "cluster," and "complex," especially in quantitative analyses.

      We agree with the reviewers, and we clarified terminology such as 'ensemble,' 'cluster,' and 'complex' and used them consistently throughout the manuscript, particularly in quantitative analyses, to enhance precision and avoid ambiguity.  

      (b) Consider adopting standard nomenclature (e.g., "hetero-clusters") to avoid ambiguity.

      We agree with the reviewers, and we adapted standard nomenclature, such as 'heteroclusters,' in the manuscript to improve clarity and reduce ambiguity.

      (4) Enhance Quantitative and Image Analysis

      (a) Clearly describe how colocalization and clustering were measured in super-resolution data.

      We thank the reviewers for this suggestion. We have modified the Methods section to provide a clearer description of how colocalization and clustering were measured in our super-resolution data. Specifically, we now detail the image processing steps, including binary conversion, channel multiplication for colocalization assessment, and density-based segmentation for clustering analysis. These updates ensure transparency in our approach and improve accessibility for readers, and we added the following to the manuscript.

      “Super-resolution imaging: 

      Direct stochastic optical reconstruction microscopy (dSTORM) images of BK and 1.3 overexpressed in tsA-201 cells were acquired using an ONI Nanoimager microscope equipped with a 100X oil immersion objective (1.4 NA), an XYZ closed-loop piezo 736 stage, and triple emission channels split at 488, 555, and 640 nm. Samples were imaged at 35°C. For singlemolecule localization microscopy, fixed and stained cells were imaged in GLOX imaging buffer containing 10 mM β-mercaptoethylamine (MEA), 0.56 mg/ml glucose oxidase, 34 μg/ml catalase, and 10% w/v glucose in Tris-HCl buffer. Single-molecule localizations were filtered using NImOS software (v.1.18.3, ONI). Localization maps were exported as TIFF images with a pixel size of 5 nm. Maps were further processed in ImageJ (NIH) by thresholding and binarization to isolate labeled structures. To assess colocalization between the signal from two proteins, binary images were multiplied. Particles smaller than 400 nm<sup>2</sup> were excluded from the analysis to reflect the spatial resolution limit of STORM imaging (20 nm) and the average size of BK channels. To examine spatial localization preference, binary images of BK were progressively dilated to 20 nm, 40 nm, 60 nm, 80 nm, 100 nm, and 200 nm to expand their spatial representation. These modified images were then multiplied with the Ca<sub>V</sub>1.3 channel to quantify colocalization and determine BK occupancy at increasing distances from Ca<sub>V</sub>1.3. To ensure consistent comparisons across distance thresholds, data were normalized using the 200 nm measurement as the highest reference value, set to 1.”

      (b) Where appropriate, quantify the proportion of total channels involved in ensembles within each compartment.

      We thank the reviewers for this comment. However, our method does not allow for direct quantification of the total number of BK and Ca<sub>V</sub>1.3 channels expressed within the ER or ER exit sites, as we rely on proximity-based detection rather than absolute fluorescence intensity measurements of individual channels. Traditional methods for counting total channel populations, such as immunostaining or single-molecule tracking, are not applicable to our approach due to the hetero-clusters formation process. Instead, we focused on the relative proportion of BK and Ca<sub>V</sub>1.3 hetero-clusters within these compartments, as this provides meaningful insights into trafficking dynamics and spatial organization. By assessing where hetero-cluster preferentially localize rather than attempting to count total channel numbers, we can infer whether their assembly occurs before plasma membrane insertion. While this approach does not yield absolute quantification of ER-localized BK and Ca<sub>V</sub>1.3 channels, it remains a robust method for investigating hetero-cluster formation and intracellular trafficking pathways. To reflect this limitation, we added the following to the manuscript.

      “Finally, a key limitation of this approach is that we cannot quantify the proportion of total BK or Ca<sub>V</sub>1.3 channels engaged in hetero-clusters within each compartment. The PLA method provides proximity-based detection, which reflects relative localization rather than absolute channel abundance within individual organelles”.

      (5) Temper Overstated Claims

      (a) Revise language that suggests the findings introduce a "new paradigm," instead emphasizing how this study extends existing models.

      We agree with the reviewers, and we have revised the language to avoid implying a 'new paradigm.' The following is the significance statement.

      “This work examines the proximity between BK and Ca<sub>V</sub>1.3 molecules at the level of their mRNAs and newly synthesized proteins to reveal that these channels interact early in their biogenesis. Two cell models were used: a heterologous expression system to investigate the steps of protein trafficking and a pancreatic beta cell line to study the localization of endogenous channel mRNAs. Our findings show that BK and Ca<sub>V</sub>1.3 channels begin assembling intracellularly before reaching the plasma membrane, revealing new aspects of their spatial organization. This intracellular assembly suggests a coordinated process that contributes to functional coupling.”

      (b) Moderate conclusions where the supporting data are preliminary or correlative.

      We agree with the reviewers, and we have moderated conclusions in instances where the supporting data are preliminary or correlative, ensuring a balanced interpretation. We added the following to the manuscript. 

      “This study provides novel insights into the organization of BK and Ca<sub>V</sub>1.3 channels in heteroclusters, emphasizing their assembly within the ER, at ER exit sites, and within the Golgi. Our findings suggest that BK and Ca<sub>V</sub>1.3 channels begin assembling intracellularly before reaching the plasma membrane, shaping their spatial organization, and potentially facilitating functional coupling. While this suggests a coordinated process that may contribute to functional coupling, further investigation is needed to determine the extent to which these hetero-clusters persist upon membrane insertion. While our study advances the understanding of BK and Ca<sub>V</sub>1.3 heterocluster assembly, several key questions remain unanswered. What molecular machinery drives this colocalization at the mRNA and protein level? How do disruptions to complex assembly contribute to channelopathies and related diseases? Additionally, a deeper investigation into the role of RNA binding proteins in facilitating transcript association and localized translation is warranted”.

      (6) Address Additional Technical and Presentation Issues

      (a) Include clearer figure annotations, especially for identifying PLA puncta localization (e.g., membrane vs. intracellular).

      We agree with the reviewers, and we have updated the figures to include clearer annotations that distinguish PLA puncta localized at the membrane versus those within intracellular compartments.

      (b) Reconsider the scale and arrangement of image panels to better showcase the data.

      We agree with the reviewers, and we have adjusted the scale and layout of the image panels to enhance data visualization and readability. Enlarged key regions now provide better clarity of critical features.

      (c) Provide precise clone/variant information for BK and CaV1.3 channels used.

      We thank the reviewers for their suggestion, and we now provide precise information regarding the BK and Ca<sub>V</sub>1.3 channel constructs used in our experiments, including their Addgene plasmid numbers and relevant variant details. These have been incorporated into the Methods section to ensure reproducibility and transparency. We added the following to the manuscript. 

      “The Ca<sub>V</sub>1.3 α subunit construct used in our study corresponds to the rat Ca<sub>V</sub>1.3e splice variant containing exons 8a, 11, 31b, and 42a, with a deletion of exon 32. The BK channel construct used in this study corresponds to the VYR splice variant of the mouse BKα subunit (KCNMA1)”.

      (d) Correct typographical errors and ensure proper figure/supplementary labeling throughout.

      Typographical errors have been corrected, and figure/supplementary labeling has been reviewed for accuracy throughout the manuscript.

      (7) Expand the Discussion

      (a) Include a brief discussion of findings such as BK surface expression in the absence of CaV1.3.

      We thank the reviewers for their suggestion. We expanded the Discussion to include a brief analysis of BK surface expression in the absence of Ca<sub>V</sub>1.3. We included the following in the manuscript. 

      “BK Surface Expression and Independent Trafficking Pathways

      BK surface expression in the absence of Ca<sub>V</sub>1.3 indicates that its trafficking does not strictly rely on Ca<sub>V</sub>1.3-mediated interactions. Since BK channels can be activated by multiple calcium sources, their presence in intracellular compartments suggests that their surface expression is governed by intrinsic trafficking mechanisms rather than direct calcium-dependent regulation. While some BK and Ca<sub>V</sub>1.3 hetero-clusters assemble into signaling complexes intracellularly, other BK channels follow independent trafficking pathways, demonstrating that complex formation is not obligatory for all BK channels. Differences in their transport kinetics further reinforce the idea that their intracellular trafficking is regulated through distinct mechanisms. Studies have shown that BK channels can traffic independently of Ca<sub>V</sub>1.3, relying on alternative calcium sources for activation [13, 41]. Additionally, Ca<sub>V</sub>1.3 exhibits slower synthesis and trafficking kinetics than BK, emphasizing that their intracellular transport may not always be coordinated. These findings suggest that BK and Ca<sub>V</sub>1.3 exhibit both independent and coordinated trafficking behaviors, influencing their spatial organization and functional interactions”.

      (b) Clarify why certain colocalization comparisons (e.g., ER vs. ER exit sites) are not directly interpretable.

      We thank the reviewer for their suggestion. A clarification has been added to the result section and discussion of the manuscript explaining why colocalization comparisons, such as ER versus ER exit sites, are not directly interpretable. We included the following in the manuscript.

      “Result:

      ER was not simply due to the extensive spatial coverage of ER labeling, we labeled ER exit sites using Sec16-GFP and probed for hetero-clusters with PLA. This approach enabled us to test whether the hetero-clusters were preferentially localized to ER exit sites, which are specialized trafficking hubs that mediate cargo selection and direct proteins from the ER into the secretory pathway. In contrast to the more expansive ER network, which supports protein synthesis and folding, ER exit sites ensure efficient and selective export of proteins to their target destinations”.

      “By quantifying the proportion of BK and Ca<sub>V</sub>1.3 hetero-clusters relative to total channel expression at ER exit sites, we found 28 ± 3% colocalization in tsA-201 cells and 11 ± 2% in INS-1 cells (Figure 3F). While the percentage of colocalization between hetero-clusters and the ER or ER exit sites alone cannot be directly compared to infer trafficking dynamics, these findings reinforce the conclusion that hetero-clusters reside within the ER and suggest that BK and Ca<sub>V</sub>1.3 channels traffic together through the ER and exit in coordination”.

      “Colocalization and Trafficking Dynamics

      The colocalization of BK and Ca<sub>V</sub>1.3 channels in the ER and at ER exit sites before reaching the Golgi suggests a coordinated trafficking mechanism that facilitates the formation of multi-channel complexes crucial for calcium signaling and membrane excitability [37, 38]. Given the distinct roles of these compartments, colocalization at the ER and ER exit sites may reflect transient proximity rather than stable interactions. Their presence in the Golgi further suggests that posttranslational modifications and additional assembly steps occur before plasma membrane transport, providing further insight into hetero-cluster maturation and sorting events. By examining BK-Ca<sub>V</sub>1.3 hetero-cluster distribution across these trafficking compartments, we ensure that observed colocalization patterns are considered within a broader framework of intracellular transport mechanisms [39]. Previous studies indicate that ER exit sites exhibit variability in cargo retention and sorting efficiency [40], emphasizing the need for careful evaluation of colocalization data. Accounting for these complexities allows for a robust assessment of signaling complexes formation and trafficking pathways”.

      Reviewer #1 (Recommendations for the authors):

      In addition to the general aspects described in the public review, I list below a few points with the hope that they will help to improve the manuscript: 

      (1) Page 3: "they bind calcium delimited to the point of entry at calcium channels", better use "sources" 

      We agree with the reviewer. The phrasing on Page 3 has been updated to use 'sources' instead of 'the point of entry at calcium channels' for clarity.

      (2) Page 3 "localized supplies of intracellular calcium", I do not like this term, but maybe this is just silly.

      We agree with the reviewer. The term 'localized supplies of intracellular calcium' on Page 3 has been revised to “Localized calcium sources”

      (3) Regarding the definitions stated by the authors: How do you distinguish between "ensembles" corresponding to "coordinated collection of BK and Cav channels" and "assembly of BK clusters with Cav clusters"? I believe that hetero-clusters is more adequate. The nomenclature does not respond to any consensus in the protein biology field, and I find that it introduces bias more than it helps. I would stick to heteroclusters nomenclature that has been used previously in the field. Moreover, in some discussion sections, the term "ensemble" is used in ways that border on vague, especially when talking about "functional signaling complexes" or "ensembles forming early." It's still acceptable within context but could benefit from clearer language to distinguish ensemble (structural proximity) from complex (functional consequence).

      We agree with the reviewer, and we recognize the importance of precise nomenclature and have adopted hetero-clusters instead of ensembles to align with established conventions in the field. This term specifically refers to the spatial organization of BK and Ca<sub>V</sub>1.3 channels, while functional complexes denote mechanistic interactions. We have revised sections where ensemble was used ambiguously to ensure clear distinction between structure and function.

      The definition of "cluster" is clearly stated early but less emphasized in later quantitative analyses (e.g., particle size discussions in Figure 7). Figure 8 is equally confusing, graphs D and E referring to "BK ensembles" and "Cav ensembles", but "ensembles" should refer to combinations of both channels, whereas these seem to be "clusters". In fact, the Figure legend mentions "clusters".

      We agree with the reviewer. Terminology has been revised throughout the manuscript to ensure consistency, with 'clusters' used appropriately in quantitative analyses and figure descriptions.

      (4) Methods: how are clusters ("ensembles") analysed from the STORM data? What is the logarithm used for? More info about this is required. Equally, more information and discussion about how colocalization is measured and interpreted in superresolution microscopy are required.

      We thank the reviewer for their suggestion, and additional details have been incorporated into the Methods section to clarify how clusters ('ensembles') are analyzed from STORM data, including the role of the logarithm in processing. Furthermore, we have expanded the discussion to provide more information on how colocalization is measured and interpreted in super resolution microscopy. We include the following in the manuscript.

      “Direct stochastic optical reconstruction microscopy (dSTORM) images of BK and Ca<sub>V</sub>1.3 overexpressed in tsA-201 cells were acquired using an ONI Nanoimager microscope equipped with a 100X oil immersion objective (1.4 NA), an XYZ closed-loop piezo 736 stage, and triple emission channels split at 488, 555, and 640 nm. Samples were imaged at 35°C. For singlemolecule localization microscopy, fixed and stained cells were imaged in GLOX imaging buffer containing 10 mM β-mercaptoethylamine (MEA), 0.56 mg/ml glucose oxidase, 34 μg/ml catalase, and 10% w/v glucose in Tris-HCl buffer. Single-molecule localizations were filtered using NImOS software (v.1.18.3, ONI). Localization maps were exported as TIFF images with a pixel size of 5 nm. Maps were further processed in ImageJ (NIH) by thresholding and binarization to isolate labeled structures. To assess colocalization between the signal from two proteins, binary images were multiplied. Particles smaller than 400 nm<sup>2</sup> were excluded from the analysis to reflect the spatial resolution limit of STORM imaging (20 nm) and the average size of BK channels. To examine spatial localization preference, binary images of BK were progressively dilated to 20 nm, 40 nm, 60 nm, 80 nm, 100 nm, and 200 nm to expand their spatial representation. These modified images were then multiplied with the Ca<sub>V</sub>1.3 channel to quantify colocalization and determine BK occupancy at increasing distances from Ca<sub>V</sub>1.3. To ensure consistent comparisons across distance thresholds, data were normalized using the 200 nm measurement as the highest reference value, set to 1”.

      (5) Related to Figure 2:

      (a) Why use an antibody to label GFP when PH-PLCdelta should be a membrane marker? Where is the GFP in PH-PKC-delta (intracellular, extracellular? Images in Figure 2E are confusing, there is a green intracellular signal.

      We thank the reviewer for their feedback. To clarify, GFP is fused to the N-terminus of PH-PLCδ and primarily localizes to the inner plasma membrane via PIP2 binding. Residual intracellular GFP signal may reflect non-membrane-bound fractions or background from anti-GFP immunostaining. We added a paragraph explaining the use of the antibody anti GFP in the Methods section Proximity ligation assay subsection. 

      (b) The images in Figure 2 do not help to understand how the authors select the PLA puncta located at the plasma membrane. How do the authors do this? A useful solution would be to indicate in Figure 2 an example of the PLA signals that are considered "membrane signals" compared to another example with "intracellular signals". Perhaps this was intended with the current Figure, but it is not clear.

      We agree with the reviewer. We have added a sentence to explain how the number of PLA puncta at the plasma membrane was calculated. 

      “We visualized the plasma membrane with a biological sensor tagged with GFP (PHPLCδ-GFP) and then probed it with an antibody against GFP (Figure 2E). By analyzing the GFP signal, we created a mask that represented the plasma membrane. The mask served to distinguish between the PLA puncta located inside the cell and those at the plasma membrane, allowing us to calculate the number of PLA puncta at the plasma membrane”.

      (c) Figure 2C: What is the negative control? Apologies if it is described somewhere, but I seem not to find it in the manuscript.

      We thank the reviewer for their suggestion. For the negative control in Figure 2C, BK was probed using the primary antibody without co-staining for Ca<sub>V</sub>1.3 or other proteins, ensuring specificity and ruling out non-specific antibody binding or background fluorescence. A sentence clarifying the negative control for Figure 2C has been added to the Results section, specifying that BK was probed using the primary antibody without costaining for Ca<sub>V</sub>1.3 or other proteins to ensure specificity. 

      “To confirm specificity, a negative control was performed by probing only for BK using the primary antibody, ensuring that detected signals were not due to non-specific binding or background fluorescence”.

      (d) What is the resolution in z of the images shown in Figure 2? This is relevant for the interpretation of signal localization.

      The z-resolution of the images shown in Figure 2 was approximately 270–300 nm, based on the Zeiss Airyscan system’s axial resolution capabilities. Imaging was performed with a step size of 300 nm, ensuring adequate sampling for signal localization while maintaining optimal axial resolution.

      “In a different experiment, we analyzed the puncta density for each focal plane of the cell (step size of 300 nm) and compared the puncta at the plasma membrane to the rest of the cell”.

      (e) % of total puncta in PM vs inside cell are shown for transfected cells, what is this proportion in INS-1 cells?

      This quantification was performed for transfected cells; however, we have not conducted the same analysis in INS-1 cells. Future experiments could address this to determine potential differences in puncta distribution between endogenous and overexpressed conditions.

      (6) Related to Figure 3:

      (a) Figure 3B: is this antibody labelling or GFP fluorescence? Why do they use GFP antibody labelling, if the marker already has its own fluorescence? This should at least be commented on in the manuscript.

      We thank the reviewer for their concern. In Figure 3B, GFP was labeled using an antibody rather than relying on its intrinsic fluorescence. This approach was necessary because GFP fluorescence does not withstand the PLA protocol, resulting in significant fading. Antibody labeling provided stronger signal intensity and improved resolution, ensuring optimal signal-to-noise ratio for accurate analysis.

      A clarification regarding the use of GFP antibody labeling in Figure 3B has been added to the Methods section, explaining that intrinsic GFP fluorescence does not endure the PLA protocol, necessitating antibody-based detection for improved signal and resolution.We added the following to the manuscript. 

      “For PLA combined with immunostaining, PLA was followed by a secondary antibody incubation with Alexa Fluor-488 at 2 μg/ml for 1 hour at 21˚C. Since GFP fluorescence fades significantly during the PLA protocol, resulting in reduced signal intensity and poor image resolution, GFP was labeled using an antibody rather than relying on its intrinsic fluorescence”.

      (b) Why is it relevant to study the ER exit sites? Some explanation should be included in the main text (page 11) for clarification to non-specialized readers. Again, the quantification should be performed on the proportion of clusters/ensembles out of the total number of channels expressed at the ER (or ER exit sites).

      We thank the reviewer for their feedback. We have modified this section to include a more detailed explanation of the relevance of ER exit sites to protein trafficking. ER exit sites serve as specialized sorting hubs that regulate the transition of proteins from the ER to the secretory pathway, distinguishing them from the broader ER network, which primarily facilitates protein synthesis and folding. This additional context clarifies why studying ER exit sites provides valuable insights into ensemble trafficking dynamics.

      Regarding quantification, our method does not allow for direct measurement of the total number of BK and Ca<sub>V</sub>1.3 channels expressed at the ER or ER exit sites. Instead, we focused on the proportion of hetero-clusters localized within these compartments, which provides insight into trafficking pathways despite the limitation in absolute channel quantification. We included the following in the manuscript in the Results section. 

      “To determine whether the observed colocalization between BK–Ca<sub>V</sub>1.3 hetero-clusters and the ER was not simply due to the extensive spatial coverage of ER labeling, we labeled ER exit sites using Sec16-GFP and probed for hetero-clusters with PLA. This approach enabled us to test whether the hetero-clusters were preferentially localized to ER exit sites, which are specialized trafficking hubs that mediate cargo selection and direct proteins from the ER into the secretory pathway. In contrast to the more expansive ER network, which supports protein synthesis and folding, ER exit sites ensure efficient and selective export of proteins to their target destinations”.

      “By quantifying the proportion of BK and Ca<sub>V</sub>1.3 hetero-clusters relative to total channel expression at ER exit sites, we found 28 ± 3% colocalization in tsA-201 cells and 11 ± 2% in INS-1 cells (Figure 3F). While the percentage of colocalization between hetero-clusters and the ER or ER exit sites alone cannot be directly compared to infer trafficking dynamics, these findings reinforce the conclusion that hetero-clusters reside within the ER and suggest that BK and Ca<sub>V</sub>1.3 channels traffic together through the ER and exit in coordination”.

      (7) Related to Figure 4:

      A control is included to confirm that the formation of BK-Cav1.3 ensembles is not unspecific. Association with a protein from the Golgi (58K) is tested. Why is this control only done for Golgi? No similar experiment has been performed in the ER. This aspect should be commented on.

      We thank the reviewer for their suggestion. We selected the Golgi as a control because it represents the final stage of protein trafficking before proteins reach their functional destinations. If BK and Ca<sub>V</sub>1.3 hetero-cluster formation is specific at the Golgi, this suggests that their interaction is maintained throughout earlier trafficking steps, including within the ER. While we did not perform an equivalent control experiment in the ER, the Golgi serves as an effective checkpoint for evaluating specificity within the broader protein transport pathway. We included the following in the manuscript.

      “We selected the Golgi as a control because it represents the final stage of protein trafficking, ensuring that hetero-cluster interactions observed at this point reflect specificity maintained throughout earlier trafficking steps, including within the ER”.

      (8) How is colocalization measured, eg, in Figure 6? Are the images shown in Figure 6 representative? This aspect would benefit from a clearer description.

      We thank the reviewer for their suggestion. A section clarifying colocalization measurement and the representativeness of Figure 6 images has been added to the Methods under Data Analysis. We included the following in the manuscript.

      For PLA and RNAscope experiments, we used custom-made macros written in ImageJ. Processing of PLA data included background subtraction. To assess colocalization, fluorescent signals were converted into binary images, and channels were multiplied to identify spatial overlap.

      (9) The text should be revised for typographical errors, for example:

      (a) Summary "evidence of" (CHECK THIS ONE)

      We agree with the reviewer, and we corrected the typographical errors

      (b) Table 1, row 3: "enriches" should be "enrich"

      We agree with the reviewer. The term 'enriches' in Table 1, row 3 has been corrected to 'enrich'.

      (c) Figure 2B "priximity"

      We agree with the reviewer. The typographical errors in Figure 2B has been corrected from 'priximity' to 'proximity'.

      (d) Legend of Figure 7 (C) "size of BK and Cav1.3 channels". Does this correspond to individual channels or clusters?

      We agree with the reviewer. The legend of Figure 7C has been clarified to indicate that 'size of BK and Cav1.3 channels' refers to clusters rather than individual channels.

      (e) Methods: In the RNASCOPE section, "Fig.4-supp1" should be "Fig. 5-supp1"

      (f) Page 15, Figure 5B is cited, should be Figure 6B

      We agree with the reviewer. The reference in the RNASCOPE section has been updated from 'Fig.4-supp1' to 'Fig. 5-supp1,' and the citation on Page 15 has been corrected from Figure 5B to Figure 6B.

      Reviewer #2 (Recommendations for the authors):

      (1) The abstract could be more accessible for a wider readership with improved flow.

      We thank the reviewer for their suggestion. We modified the summary as follows to provide a more coherent flow for a wider readership. 

      “Calcium binding to BK channels lowers BK activation threshold, substantiating functional coupling with calcium-permeable channels. This coupling requires close proximity between different channel types, and the formation of BK–Ca<sub>V</sub>1.3 hetero-clusters at nanometer distances exemplifies this unique organization. To investigate the structural basis of this interaction, we tested the hypothesis that BK and Ca<sub>V</sub>1.3 channels assemble before their insertion into the plasma membrane. Our approach incorporated four strategies: (1) detecting interactions between BK and Ca<sub>V</sub>1.3 proteins inside the cell, (2) identifying membrane compartments where intracellular hetero-clusters reside, (3) measuring the proximity of their mRNAs, and (4) assessing protein interactions at the plasma membrane during early translation. These analyses revealed that a subset of BK and Ca<sub>V</sub>1.3 transcripts are spatially close in micro-translational complexes, and their newly synthesized proteins associate within the endoplasmic reticulum (ER) and Golgi. Comparisons with other proteins, transcripts, and randomized localization models support the conclusion that BK and Ca<sub>V</sub>1.3 hetero-clusters form before their insertion at the plasma membrane”.

      (2) Figure 2B - spelling of proximity.

      We agree with the reviewer. The typographical errors in Figure 2B has been corrected from 'priximity' to 'proximity'.

      Reviewer #3 (Recommendations for the authors):

      Minor issues to improve the manuscript:

      (1) For completeness, the authors should include a few sentences and appropriate references in the Introduction to mention that BK channels are regulated by auxiliary subunits.

      We agree with the reviewer. We have revised the Introduction to include a brief discussion of how BK channel function is modulated by auxiliary subunits and provided appropriate references to ensure completeness. These additions highlight the broader regulatory mechanisms governing BK channel activity, complementing the focus of our study. We included the following in the manuscript. 

      “Additionally, BK channels are modulated by auxiliary subunits, which fine-tune BK channel gating properties to adapt to different physiological conditions. β and γ subunits regulate BK channel kinetics, altering voltage sensitivity and calcium responsiveness [18]. These interactions ensure precise control over channel activity, allowing BK channels to integrate voltage and calcium signals dynamically in various cell types. Here, we focus on the selective assembly of BK channels with Ca<sub>V</sub>1.3 and do not evaluate the contributions of auxiliary subunits to BK channel organization.”

      (2) Insert a space between 'homeostasis' and the square bracket at the end of the Introduction's second paragraph.

      We agree with the reviewer. A space has been inserted between 'homeostasis' and the square bracket in the second paragraph of the Introduction for clarity.

      (3) The images presented in Figures 2-5 should be increased in size (if permitted by the Journal) to allow the reader to clearly see the puncta in the fluorescent images. This would necessitate reconfiguring the figures into perhaps a full A4 page per figure, but I think the quality of the images presented really do deserve to "be seen". For example, Panels A & B could be at the top of Figure 2, with C & D presented below them. However, I'll leave it up to the authors to decide on the most aesthetically pleasing way to show these.

      We agree with the reviewer. We have increased the size of Figures 2–8 to enhance the visibility of fluorescent puncta, as suggested. To accommodate this, we reorganized the panel layout for each figure—for example, in Figure 2, Panels A and B are now placed above Panels C and D to support a more intuitive and aesthetically coherent presentation. We believe this revised configuration highlights the image quality and improves readability while conforming to journal layout constraints.

      (4) I think that some of the sentences could be "toned down"

      (a) eg, in the first paragraph below Figure 2, the authors state "that 46(plus minus)3% of the puncta were localised on intracellular membranes" when, at that stage, no data had been presented to confirm this. I think changing it to "that 46(plus minus)3% of the puncta were localised intracellularly" would be more precise.

      (b) Similarly, please consider replacing the wording of "get together at membranes inside the cell" to "co-localise intracellularly".

      (c) In the paragraph just before Figure 5, the authors mention that "the abundance of KCNMA1 correlated more with the abundance of CACNA1D than ... with GAPDH." Although this is technically correct, the R2 value was 0.22, which is exceptionally poor. I don't think that the paper is strengthened by sentences such as this, and perhaps the authors might tone this down to reflect this.

      (d) The authors clearly demonstrate in Figure 8 that a significant number of BK channels can traffic to the membrane in the absence of Cav1.3. Irrespective of the differences in transcription/trafficking time between the two channel types, the authors should insert a few lines into their discussion to take this finding into account.

      We appreciate the reviewer’s feedback regarding the clarity and precision of our phrasing.

      Our responses for each point are below.

      (a) We have modified the statement in the first paragraph below Figure 2, changing '46 ± 3% of the puncta were localized on intracellular membranes' to '46 ± 3% of the puncta were localized ‘intracellularly’ to ensure accuracy in the absence of explicit data confirming membrane association.

      (b) Similarly, we have replaced 'get together at membranes inside the cell' with 'colocalize intracellularly' to maintain clarity and avoid unintended implications. 

      (c) Regarding the correlation between KCNMA1 and CACNA1D abundance, we recognize that the R² value of 0.22 is relatively low. To reflect this appropriately, we have revised the phrasing to indicate that while a correlation exists, it is modest. We added the following to the manuscript. 

      “Interestingly, the abundance of KCNMA1 transcripts correlated more with the abundance of CACNA1D transcripts than with the abundance of GAPDH, a standard housekeeping gene, though with a modest R² value.”

      (d) To incorporate the findings from Figure 8, we have added discussion acknowledging that a substantial number of BK channels traffic to the membrane independently of Ca<sub>V</sub>1.3. This addition provides context for potential trafficking mechanisms that operate separately from ensemble formation.

      (5) For clarity, please insert the word "total" in the paragraph after Figure 3 "..."63{plus minus}3% versus 50%{plus minus}6% of total PLA puncta were localised at the ER". I know this is explicitly stated later in the manuscript, but I think it needs to be clarified earlier.

      We agree with the reviewer. The word 'total' has been inserted in the paragraph following Figure 3 to clarify the percentage of PLA puncta localized at the ER earlier in the manuscript

      (6) In the discussion, I think an additional (short) paragraph needs to be included to clarify to the reader why the % "colocalization between ensembles and the ER or the ER exit sites can't be compared or used to understand the dynamics of the ensembles". This may permit the authors to remove the last sentence of the paragraph just before the results section, "BK and Cav1.3 ensembles go through the Golgi."

      We thank the reviewer for their suggestion. We have added a short paragraph in the discussion to clarify why colocalization percentages between ensembles and the ER or ER exit sites cannot be compared to infer ensemble dynamics. This allowed us to remove the final sentence of the paragraph preceding the results section ('BK and Cav1.3 ensembles go through the Golgi).

      (7) In the paragraph after Figure 6, Figure 5B is inadvertently referred to. Please correct this to Figure 6B.

      We agree with the reviewer. The reference to Figure 5B in the paragraph after Figure 6 has been corrected to Figure 6B.

      (8) In the discussion under "mRNA co-localisation and Protein Trafficking", please insert a relevant reference illustrating that "disruption in mRNA localization... can lead to ion channel mislocalization".

      We agree with the reviewer. We have inserted a relevant reference under 'mRNA Colocalization and Protein Trafficking' to illustrate that disruption in mRNA localization can lead to ion channel mislocalization.

      (9) The supplementary Figures appear to be incorrectly numbered. Please correct and also ensure that they are correctly referred to in the text.

      We agree with the reviewer. The numbering of the supplementary figures has been corrected, and all references to them in the text have been updated accordingly.

      (10) The final panels of the currently labelled Figure 5-Supplementary 2 need to have labels A-F included on the image.

      We agree with the reviewer. Labels A-F have been added to the final panels of Figure 5-Supplementary 2.

      References

      (1) Shah, K.R., X. Guan, and J. Yan, Structural and Functional Coupling of Calcium-Activated BK Channels and Calcium-Permeable Channels Within Nanodomain Signaling Complexes. Frontiers in Physiology, 2022. Volume 12 - 2021.

      (2) Chen, A.L., et al., Calcium-Activated Big-Conductance (BK) Potassium Channels Traffic through Nuclear Envelopes into Kinocilia in Ray Electrosensory Cells. Cells, 2023. 12(17): p. 2125.

      (3) Berkefeld, H., B. Fakler, and U. Schulte, Ca2+-activated K+ channels: from protein complexes to function. Physiol Rev, 2010. 90(4): p. 1437-59.

      (4) Loane, D.J., P.A. Lima, and N.V. Marrion, Co-assembly of N-type Ca2+ and BK channels underlies functional coupling in rat brain. J Cell Sci, 2007. 120(Pt 6): p. 98595.

      (5) Boncompain, G. and F. Perez, The many routes of Golgi-dependent trafficking. Histochemistry and Cell Biology, 2013. 140(3): p. 251-260.

      (6) Kurokawa, K. and A. Nakano, The ER exit sites are specialized ER zones for the transport of cargo proteins from the ER to the Golgi apparatus. The Journal of Biochemistry, 2019. 165(2): p. 109-114.

      (7) Chen, G., et al., BK channel modulation by positively charged peptides and auxiliary γ subunits mediated by the Ca2+-bowl site. Journal of General Physiology, 2023. 155(6).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary

      Lysine acetoacetylation (Kacac) is a recently discovered histone post-translational modification (PTM) connected to ketone body metabolism. This research outlines a chemo-immunological method for detecting Kacac, eliminating the requirement for creating new antibodies. The study demonstrates that acetoacetate acts as the precursor for Kacac, which is catalyzed by the acyltransferases GCN5, p300, and PCAF, and removed by the deacetylase HDAC3. AcetoacetylCoA synthetase (AACS) is identified as a central regulator of Kacac levels in cells. A proteomic analysis revealed 139 Kacac sites across 85 human proteins, showing the modification's extensive influence on various cellular functions. Additional bioinformatics and RNA sequencing data suggest a relationship between Kacac and other PTMs, such as lysine βhydroxybutyrylation (Kbhb), in regulating biological pathways. The findings underscore Kacac's role in histone and non-histone protein regulation, providing a foundation for future research into the roles of ketone bodies in metabolic regulation and disease processes.

      Strengths 

      (1) The study developed an innovative method by using a novel chemo-immunological approach to the detection of lysine acetoacetylation. This provides a reliable method for the detection of specific Kacac using commercially available antibodies.

      (2) The research has done a comprehensive proteome analysis to identify unique Kacac sites on 85 human proteins by using proteomic profiling. This detailed landscape of lysine acetoacetylation provides a possible role in cellular processes.

      (3) The functional characterization of enzymes explores the activity of acetoacetyltransferase of key enzymes like GCN5, p300, and PCAF. This provides a deeper understanding of their function in cellular regulation and histone modifications.

      (4) The impact of acetyl-CoA and acetoacetyl-CoA on histone acetylation provides the differential regulation of acylations in mammalian cells, which contributes to the understanding of metabolic-epigenetic crosstalk.

      (5) The study examined acetoacetylation levels and patterns, which involve experiments using treatment with acetohydroxamic acid or lovastatin in combination with lithium acetoacetate, providing insights into the regulation of SCOT and HMGCR activities.

      We thank all the reviewers for their positive, insightful comments which have helped us improve our manuscript. We have revised the manuscript as suggested by the reviewers.

      Weakness 

      (1) There is a limitation to functional validation, related to the work on the biological relevance of identified acetoacetylation sites. Hence, the study requires certain functional validation experiments to provide robust conclusions regarding the functional implications of these modifications on cellular processes and protein function. For example, functional implications of the identified acetoacetylation sites on histone proteins would aid the interpretation of the results.

      We agree with the reviewer that investigating the functional role of individual histone Kacac sites is essential for understanding the epigenetic impact of Kacac marks on gene expression, signaling pathways, and disease mechanisms. This topic is out of the scope of this paper which focuses on biochemical studies and proteomics. Functional elucidation in specific pathways will be a critical direction for future investigation, ideally with the development of site-specific anti-Kacac antibodies.

      (2) The authors could have studied acetoacetylation patterns between healthy cells and disease models like cancer cells to investigate potential dysregulation of acetoacetylation in pathological conditions, which could provide insights into their PTM function in disease progression and pathogenesis.

      We appreciate the reviewer’s valuable suggestion. In our study, we measured Kacac levels in several types of cancer cell lines, including HCT116 (Fig. 2B), HepG2 (Supplementary Fig. S2), and HeLa cells (data not shown in the manuscript), and found that acetoacetate-mediated Kacac is broadly present in all these cancer cell lines. Our proteomics analysis linked Kacac to critical cellular functions, e.g. DNA repair, RNA metabolism, cell cycle regulation, and apoptosis, and identified promising targets that are actively involved in cancer progression such as p53, HDAC1, HMGA2, MTA2, LDHA. These findings suggest that Kacac has significant, non-negligible effects on cancer pathogenesis. We concur that exploring the acetoacetylation patterns in cancer patient samples with comparison with normal cells represents a promising direction for next-step research. We plan to investigate these big issues in future studies. 

      (3) The time-course experiments could be performed following acetoacetate treatment to understand temporal dynamics, which can capture the acetoacetylation kinetic change, thereby providing a mechanistic understanding of the PTM changes and their regulatory mechanisms.

      As suggested, time-course experiments were performed, and the data have been included in the revised manuscript (Supplementary Fig. S2A).

      (4) Though the discussion section indeed provides critical analysis of the results in the context of existing literature, further providing insights into acetoacetylation's broader implications in histone modification. However, the study could provide a discussion on the impact of the overlap of other post-translational modifications with Kacac sites with their implications on protein functions.

      We appreciate the reviewer’s helpful suggestion. We have added more discussions on the impact of the Kacac overlap with other post-translational modifications in the discussion section of the revised manuscript.

      Impact

      The authors successfully identified novel acetoacetylation sites on proteins, expanding the understanding of this post-translational modification. The authors conducted experiments to validate the functional significance of acetoacetylation by studying its impact on histone modifications and cellular functions.

      We appreciate the reviewer’s comments.

      Reviewer #2 (Public review):

      In the manuscript by Fu et al., the authors developed a chemo-immunological method for the reliable detection of Kacac, a novel post-translational modification, and demonstrated that acetoacetate and AACS serve as key regulators of cellular Kacac levels. Furthermore, the authors identified the enzymatic addition of the Kacac mark by acyltransferases GCN5, p300, and PCAF, as well as its removal by deacetylase HDAC3. These findings indicate that AACS utilizes acetoacetate to generate acetoacetyl-CoA in the cytosol, which is subsequently transferred into the nucleus for histone Kacac modification. A comprehensive proteomic analysis has identified 139 Kacac sites on 85 human proteins. Bioinformatics analysis of Kacac substrates and RNA-seq data reveals the broad impacts of Kacac on diverse cellular processes and various pathophysiological conditions. This study provides valuable additional insights into the investigation of Kacac and would serve as a helpful resource for future physiological or pathological research.

      The following concerns should be addressed:

      (1) A detailed explanation is needed for selecting H2B (1-26) K15 sites over other acetylation sites when evaluating the feasibility of the chemo-immunological method.

      The primary reason for selecting the H2B (1–26) K15acac peptide to evaluate the feasibility of our chemo-immunological method is that H2BK15acac was one of the early discovered modification sites in our preliminary proteomic screening data. The panKbhb antibody used herein is independent of peptide sequence so different modification sites on histones can all be recognized. We have added the explanation to the manuscript.

      (2) In Figure 2(B), the addition of acetoacetate and NaBH4 resulted in an increase in Kbhb levels. Specifically, please investigate whether acetoacetylation is primarily mediated by acetoacetyl-CoA and whether acetoacetate can be converted into a precursor of β-hydroxybutyryl (bhb-CoA) within cells. Additional experiments should be included to support these conclusions.

      We appreciate the reviewer’s valuable comments. In our paper, we had the data showing that acetoacetate treatment had very little effect on histone Kbhb levels in HEK293T cells, as observed in lanes 1–4 of Fig. 2A, demonstrating that acetoacetate minimally contributes to Kbhb generation. We drew the conclusion that histone Kacac is primarily mediated by acetoacetyl-CoA based on multiple pieces of evidence: first, we observed robust Kacac formation from acetoacetyl-CoA upon incubation with HATs and histone proteins or peptides, as confirmed by both western blotting (Figs. 3A, 3B; Supplementary Figs. S3C– S3F) and MALDI-MS analysis (Supplementary Fig. S4A). Second, treatment with hymeglusin—a specific inhibitor of hydroxymethylglutaryl-CoA synthase, which catalyzes the conversion of acetoacetyl-CoA to HMG-CoA—led to increased Kacac levels in HepG2 cells (PMID: 37382194). Third, we demonstrated that AACS whose function is to convert acetoacetate into acetoacetyl-CoA leads to marked histone Kacac upregulation (Fig. 2E). Collectively, these findings strongly support the conclusion that acetoacetate promotes Kacac formation primarily via acetoacetyl-CoA.

      (3) In Figure 2(E), the amount of pan-Kbhb decreased upon acetoacetate treatment when SCOT or AACS was added, whereas this decrease was not observed with NaBH4 treatment. What could be the underlying reason for this phenomenon?

      In the groups without NaBH₄ treatment (lanes 5–8, Figure 2E), the Kbhb signal decreased upon the transient overexpression of SCOT or AACS, owing to protein loading variation in these two groups (lanes 7 and 8). Both Ponceau staining and anti-H3 results showed a lower amount of histones in the AACS- or SCOT-treated samples. On the other hand, no decrease in the Kbhb signal was observed in the NaBH₄-treated groups (lanes 1–4), because NaBH₄ treatment elevated Kacac levels, thereby compensating for the reduced histone loading. The most important conclusion from this experiment is that AACS overexpression increased Kacac levels, whereas SCOT overexpression had no/little effect on histone Kacac levels in HEK293T cells.

      (4) The paper demonstrates that p300, PCAF, and GCN5 exhibit significant acetoacetyltransferase activity and discusses the predicted binding modes of HATs (primarily PCAF and GCN5) with acetoacetyl-CoA. To validate the accuracy of these predicted binding models, it is recommended that the authors design experiments such as constructing and expressing protein mutants, to assess changes in enzymatic activity through western blot analysis.

      We appreciate the reviewer’s valuable suggestion. Our computational modeling shows that acetoacetyl-CoA adopts a binding mode similar to that of acetyl-CoA in the tested HATs. This conclusion is supported by experimental results showing that the addition of acetyl-CoA significantly competed for the binding of acetoacetyl-CoA to HATs, leading to reduced enzymatic activity in mediating Kacac (Fig. 3C). Further structural biology studies to investigate the key amino acid residues involved in Kacac binding within the GCN5/PCAF binding pocket, in comparison to Kac binding—will be a key direction of future studies.

      (5) HDAC3 shows strong de-acetoacetylation activity compared to its de-acetylation activity. Specific experiments should be added to verify the molecular docking results. The use of HPLC is recommended, in order to demonstrate that HDAC3 acts as an eraser of acetoacetylation and to support the above conclusions. If feasible, mutating critical amino acids on HDAC3 (e.g., His134, Cys145) and subsequently analyzing the HDAC3 mutants via HPLC and western blot can further substantiate the findings.

      We appreciate the reviewer’s helpful suggestion. In-depth characterizations of HDAC3 and other HDACs is beyond this manuscript. We plan in the future to investigate the enzymatic activity of recombinant HDAC3, including the roles of key amino acid residues and the catalytic mechanism underlying Kacac removal, and to compare its activity with that involved in Kac removal.

      (6) The resolution of the figures needs to be addressed in order to ensure clarity and readability.

      Edits have been made to enhance figure resolutions in the revised manuscript.

      Reviewer #3 (Public review):

      Summary:

      This paper presents a timely and significant contribution to the study of lysine acetoacetylation (Kacac). The authors successfully demonstrate a novel and practical chemo-immunological method using the reducing reagent NaBH4 to transform Kacac into lysine β-hydroxybutyrylation (Kbhb).

      Strengths:

      This innovative approach enables simultaneous investigation of Kacac and Kbhb, showcasing their potential in advancing our understanding of post-translational modifications and their roles in cellular metabolism and disease.

      Weaknesses:

      The paper's main weaknesses are the lack of SDS-PAGE analysis to confirm HATs purity and loading consistency, and the absence of cellular validation for the in vitro findings through knockdown experiments. These gaps weaken the evidence supporting the conclusions.

      We appreciate the reviewer’s positive comments on the quality of this work and the importance to the field. The SDS-PAGE results of HAT proteins (Supplementary Fig. S3A) was added in the revised manuscript. The cellular roles of p300 and GCN5 as acetoacetyltransferases were confirmed in a recent study (PMID: 37382194). Their data are consistent with our studies herein and provide further support for our conclusion. We agree that knockdown experiments are essential to further validate the activities of these enzymes and plan to address this in future studies.

      Reviewer #1 (Recommendations for the authors):

      This study conducted the first comprehensive analysis of lysine acetoacetylation (Kacac) in human cells, identifying 139 acetoacetylated sites across 85 proteins in HEK293T cells. Kacac was primarily localized to the nucleus and associated with critical processes like chromatin organization, DNA repair, and gene regulation. Several previously unknown Kacac sites on histones were discovered, indicating its widespread regulatory role. Key enzymes responsible for adding and removing Kacac marks were identified: p300, GCN5, and PCAF act as acetoacetyltransferases, while HDAC3 serves as a remover. The modification depends on acetoacetate, with AACS playing a significant role in its regulation. Unlike Kbhb, Kacac showed unique cellular distribution and functional roles, particularly in gene expression pathways and metabolic regulation. Acetoacetate demonstrated distinct biological effects compared to βhydroxybutyrate, influencing lipid synthesis, metabolic pathways, and cancer cell signaling. The findings suggest that Kacac is an important post-translational modification with potential implications for disease, metabolism, and cellular regulation.

      Major Concerns

      (1) The authors could expand the study by including different cell lines and also provide a comparative study by using cell lines - such as normal vs disease (eg. Cancer cell like) - to compare and to increase the variability of acetoacetylation patterns across cell types. This could broaden the understanding of the regulation of PTMs in pathological conditions.

      We sincerely appreciate the reviewer’s valuable suggestions. We concur that a

      deeper investigation into Kacac patterns in cancer cell lines would significantly enhance understanding of Kacac in the human proteome. Nevertheless, due to constraints such as limited resource availability, we are currently unable to conduct very extensive explorations as proposed. Nonetheless, as shown in Fig. 2A, Fig. 2B, and Supplementary Fig. S2, our present data provide strong evidence for the widespread occurrence of acetoacetatemediated Kacac in both normal and cancer cell lines. Notably, our proteomic profiling identified several promising targets implicated in cancer progression, including p53, HDAC1, HMGA2, MTA2, and LDHA. We plan to conduct more comprehensive explorations of acetoacetylation patterns in cancer samples in future studies.

      (2) The paper lacks inhibition studies silencing the enzyme genes or inhibiting the enzyme using available inhibitors involved in acetoacetylation or using aceto-acetate analogues to selectively modulate acetoacetylation levels. This can validate their impact on downstream cellular pathways in cellular regulation.

      We appreciate the reviewer’s valuable suggestions. Our study, along with the previous research, has conducted initial investigations into the inhibition of key enzymes involved in the Kacac pathway. For example, inhibition of HMGCS, which catalyzes the conversion of acetoacetyl-CoA to HMG-CoA, was shown to enhance histone Kacac levels (PMID: 37382194). In our study, we examined the inhibitory effects of SCOT and HMGCR, both of which potentially influence cellular acetoacetyl-CoA levels. However, their respective inhibitors did not significantly affect histone Kacac levels. We also investigated the role of acetyl-CoA, which competes with acetoacetyl-CoA for binding to HAT enzymes and can function as a competitive inhibitor in histone Kacac generation. Furthermore, inhibition of HDAC activity by SAHA led to increased histone Kacac levels in HepG2 cells (PMID: 37382194), supporting our conclusion that HDAC3 functions as the eraser responsible for Kacac removal. These inhibition studies confirmed the functions of these enzymes and provided insights into their regulatory roles in modulating Kacac and its downstream pathways. Further in-depth investigations will explore the specific roles of these enzymes in regulating Kacac within cellular pathways.

      (3) The authors could validate the functional impact of pathways using various markers through IHC/IFC or western blot to confirm their RNA-seq analysis, since pathways could be differentially regulated at the RNA vs protein level.

      We agree that pathways can be differentially regulated at the RNA and protein levels. It is our future plan to select and fully characterize one or two gene targets to elaborate the presence and impact of Kacac marks on their functional regulation at both the gene expression and protein level.

      (4) Utilize in vitro reconstitution assays to confirm the direct effect of acetoacetylation on histone modifications and nucleosome assembly, establishing a causal relationship between acetoacetylation and chromatin regulation.

      We appreciate this suggestion, and this will be a very fine biophysics project for us and other researchers for the next step. We plan to do this and related work in a future paper to characterize the impact of lysine acetoacetylation on chromatin structure and gene expression. Technique of site-specific labelling will be required. Also, we hope to obtain monoclonal antibodies that directly recognize Kacac in histones to allow for ChIP-seq assays in cells.

      (5) The authors could provide a site-directed mutagenesis experiment by mutating a particular site, which can validate and address concerns regarding the specificity of a particular site involved in the mechanism.

      We agree that validating and characterizing the specificity of individual Kacac sites and understanding their functional implications are important for elucidating the mechanisms by which Kacac affects these substrate proteins. Such work will involve extensive biochemical and cellular studies. It is our future goal to select and fully characterize one or two gene targets in detail and in depth to elaborate the presence and impact of Kacac on their function regulation using comprehensive techniques (transfection, mutation, pulldown, and pathway analysis, etc.).

      (6) If possible, the authors could use an in vivo model system, such as mice, to validate the physiological relevance of acetoacetylation in a more complex system.  

      We currently do not have access to resources of relevant animal models. We will conduct in vivo screening and characterization of protein acetoacetylation in animal models and clinical samples in collaboration with prospective collaborators.

      Minor Concerns

      (1) The authors could discuss the overlap of Kacac sites with other post-translational modifications and their implications on protein functions. They could provide comparative studies with other PTMs, which can improvise a comprehensive understanding of acetoacetylation function in epigenetic regulation.

      We have expanded the discussion in the revised manuscript to address the overlap between Kacac and other post-translational modifications, along with their potential functional implications.

      (2) The authors could provide detailed information on the implications of their data, which would enhance the impact of the research and its relevance to the scientific community. Specifically, they could clarify the acetoacetylation (Kacac) significance in nucleosome assembly and its correlation with RNA processing.

      In the revised manuscript, we have added more elaborations on the implication and significance of Kacac in nucleosome assembly and RNA processing.

      Reviewer #3 (Recommendations for the authors):

      Major Comments:

      (1) Figures 3A, 3B, Supplementary Figures S3A-D

      I could not find the SDS-PAGE analysis results for the purified HATs used in the in vitro assay. It is imperative to display these results to confirm consistent loading amounts and sufficient purity of the HATs across experimental groups. Additionally, I did not observe any data on CBP, even though it was mentioned in the results section. If CBP-related experiments were not conducted, please remove the corresponding descriptions.

      We appreciate the reviewer’s valuable suggestion. The SDS-PAGE results for the HAT proteins have been included, and the part in the results section discussing CBP has been updated according to the reviewer’s suggestion in the revised manuscript.

      (2) Knockdown of Selected HATs and HDAC3 in cells

      The authors should perform gene knockdown experiments in cells, targeting the identified HATs and HDAC3, followed by Western blot and mass spectrometry analysis of Kacac expression levels. This would validate whether the findings from the in vitro assays are biologically relevant in cellular contexts.

      We appreciate the reviewer’s valuable suggestion. Our identified HATs, including p300 and GCN5, were reported as acetoacetyltransferases in cellular contexts by a recent study (PMID: 37382194). Their findings are precisely consistent with our biochemical results, providing additional evidence that p300 and GCN5 mediate Kacac both in vitro and in vivo. In addition, inhibition of HDAC activity by SAHA greatly increased histone Kacac levels in HepG2 cells (PMID: 37382194), supporting the role of HDAC3 as an eraser responsible for Kacac removal. We plan to further study these enzymes’ contributions to Kacac through gene knockdown experiments and investigate the specific functions of enzyme-mediated Kacac under some pathological contexts.

      Minor Comments:

      (1) Abstract accuracy

      In the Abstract, the authors state, "However, regulatory elements, substrate proteins, and epigenetic functions of Kacac remain unknown." Please revise this statement to align with the findings in Reference 22 and describe these elements more appropriately. If similar issues exist in other parts of the manuscript, please address them as well.

      The issues have been addressed in the revised manuscript based on the reviewer's comments.

      (2) Terminology issue

      GCN5 and PCAF are both members of the GNAT family. It is not accurate to describe "GCN5/PCAF/HAT1" as one family. Please refine the terminology to reflect the classification accurately.

      The description has been refined in the revised manuscript to accurately reflect the classification, in accordance with the reviewer's suggestion.

      (3) Discussion on HBO1

      Reference 22 has already established HBO1 as an acetoacetyltransferase. This paper should include a discussion of HBO1 alongside the screened p300, PCAF, and GCN5 to provide a more comprehensive perspective.

      More discussion on HBO1 alongside the other screened HATs has been added in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, the authors analyze electrophysiological data recorded bilaterally from the rat hippocampus to investigate the coupling of ripple oscillations across the hemispheres. Commensurate with the majority of previous research, the authors report that ripples tend to co-occur across both hemispheres. Specifically, the amplitude of ripples across hemispheres is correlated but their phase is not. These data corroborate existing models of ripple generation suggesting that CA3 inputs (coordinated across hemispheres via the commisural fibers) drive the sharp-wave component while the individual ripple waves are the result of local interactions between pyramidal cells and interneurons in CA1.

      Strengths:

      The manuscript is well-written, the analyses well-executed and the claims are supported by the data.

      Weaknesses:

      One question left unanswered by this study is whether information encoded by the right and left hippocampi is correlated.

      Thank you for raising this important point. While our study demonstrates ripple co-occurrence across hemispheres, we did not directly assess whether the information encoded in each hippocampus is correlated. Addressing this question would require analyses of coordinated activity patterns, such as neuronal assemblies formed during novelty exposure, which falls beyond the scope of the present study. However, we agree this is an important avenue for future work, and we now acknowledge this limitation and outlined it as a future direction in the Conclusion section (lines 796–802).

      Reviewer #2 (Public review):

      Summary:

      The authors completed a statistically rigorous analysis of the synchronization of sharp-wave ripples in the hippocampal CA1 across and within hemispheres. They used a publicly available dataset (collected in the Buzsaki lab) from 4 rats (8 sessions) recorded with silicon probes in both hemispheres. Each session contained approximately 8 hours of activity recorded during rest. The authors found that the characteristics of ripples did not differ between hemispheres, and that most ripples occurred almost simultaneously on all probe shanks within a hemisphere as well as across hemispheres. The differences in amplitude and exact timing of ripples between recording sites increased slightly with the distance between recording sites. However, the phase coupling of ripples (in the 100-250 Hz range), changed dramatically with the distance between recording sites. Ripples in opposite hemispheres were about 90% less coupled than ripples on nearby tetrodes in the same hemisphere. Phase coupling also decreased with distance within the hemisphere. Finally, pyramidal cell and interneuron spikes were coupled to the local ripple phase and less so to ripples at distant sites or the opposite hemisphere.

      Strengths:

      The analysis was well-designed and rigorous. The authors used statistical tests well suited to the hypotheses being tested, and clearly explained these tests. The paper is very clearly written, making it easy to understand and reproduce the analysis. The authors included an excellent review of the literature to explain the motivation for their study.

      Weaknesses:

      The authors state that their findings (highly coincident ripples between hemispheres), contradict other findings in the literature (in particular the study by Villalobos, Maldonado, and Valdes, 2017), but fail to explain why this large difference exists. They seem to imply that the previous study was flawed, without examining the differences between the studies.

      The paper fails to mention the context in which the data was collected (the behavior the animals performed before and after the analyzed data), which may in fact have a large impact on the results and explain the differences between the current study and that by Villalobos et al. The Buzsaki lab data includes mice running laps in a novel environment in the middle of two rest sessions. Given that ripple occurrence is influenced by behavior, and that the neurons spiking during ripples are highly related to the prior behavioral task, it is likely that exposure to novelty changed the statistics of ripples. Thus, the authors should analyze the pre-behavior rest and post-behavior rest sessions separately. The Villalobos et al. data, in contrast, was collected without any intervening behavioral task or novelty (to my knowledge). Therefore, I predict that the opposing results are a result of the difference in recent experiences of the studied rats, and can actually give us insight into the memory function of ripples.

      We appreciate this thoughtful hypothesis and have now addressed it explicitly. Our main analysis was conducted on 1-hour concatenated SWS epochs recorded before any novel environment exposure (baseline sleep). This was not clearly stated in the original manuscript, so we have now added a clarifying paragraph (lines 131–143). The main findings therefore remain unchanged.

      To directly test the reviewer’s hypothesis, we performed the suggested comparison between pre- and post-maze rest sessions, including maze-type as a factor. These new analyses are now presented in a dedicated Results subsection (lines 475 - 493) and in Supplementary Figure 5.1. While we observed a modest increase in ripple abundance after the maze sessions — consistent with known experienced-dependent changes in ripple occurrence — the key findings of interhemispheric synchrony remained unchanged. Both pre- and post-maze sleep sessions showed robust bilateral time-locking of ripple events and similar dissociations between phase and amplitude coupling across hemispheres.

      In one figure (5), the authors show data separated by session, rather than pooled. They should do this for other figures as well. There is a wide spread between sessions, which further suggests that the results are not as widely applicable as the authors seem to think. Do the sessions with small differences between phase coupling and amplitude coupling have low inter-hemispheric amplitude coupling, or high phase coupling? What is the difference between the sessions with low and high differences in phase vs. amplitude coupling? I noticed that the Buzsaki dataset contains data from rats running either on linear tracks (back and forth), or on circular tracks (unidirectionally). This could create a difference in inter-hemisphere coupling, because rats running on linear tracks would have the same sensory inputs to both hemispheres (when running in opposite directions), while rats running on a circular track would have different sensory inputs coming from the right and left (one side would include stimuli in the middle of the track, and the other would include closer views of the walls of the room). The synchronization between hemispheres might be impacted by how much overlap there was in sensory stimuli processed during the behavior epoch.

      Thank you for this insightful suggestion. In our new analyses comparing pre- and post-maze sessions, we have also addressed this question. Supplementary Figures 4.1 and 5.1 (E-F) present coupling metrics averaged per session and include coding for maze type. Additionally, we have incorporated the reviewer’s hypothesis regarding sensory input differences and their potential impact on inter-hemispheric synchronization into a new Results subsection (lines 475–493).

      The paper would be a lot stronger if the authors analyzed some of the differences between datasets, sessions, and epochs based on the task design, and wrote more about these issues. There may be more publicly available bi-hemispheric datasets to validate their results.

      To further validate our findings, we have analyzed another publicly available dataset that includes bilateral CA1 recordings (https://crcns.org/data-sets/hc/hc-18). We have added a description of this dataset and our analysis approach in the Methods section (lines 119–125 and 144-145), and present the corresponding results in a new Supplementary Figure (Supplementary Figure 4.2). These new analyses replicated our main findings, confirming robust interhemispheric time-locking of ripple events and a greater dissociation between phase and amplitude coupling in ipsilateral versus contralateral recordings.

      Reviewer #1 (Recommendations for the authors):

      My only suggestion is that the introduction can be shortened. The authors discuss in great length literature linking ripples and memory, although the findings in the paper are not linked to memory. In addition, ripples have been implicated in non-mnemonic functions such as sleep and metabolic homeostasis.

      The reviewer`s suggestion is valid and aligns with the main message of our paper. However, we believe that the relationship between ripples and memory has been extensively discussed in the literature, sometimes overshadowing other important functional roles (based on the reviewer’s comment, we now also refer to non-mnemonic functions of ripples in the revised introduction [lines 87–89]). Thus, we find it important to retain this context because highlighting the publication bias towards mnemonic interpretations helps frame the need for studies like ours that revisit still incompletely understood basic ripple mechanisms.

      We also note that, based on a suggestion from reviewer 2, we have supplemented our manuscript with a new figure demonstrating ripple abundance increases during SWS following novel environment exposure (Supplementary Figure 5.1), linking it to memory and replicating the findings of Eschenko et al. (2008), though we present this result as a covariate, aimed at controlling for potential sources of variation in ripple synchronization.

      Reviewer #2 (Recommendations for the authors):

      It would be useful to include more information about the analyzed dataset in the methods section, e.g. how long were the recordings, how many datasets per rat, did the authors analyze the entire recording epoch or sub-divide it in any way, how many ripples were detected per recording (approximately).

      We have now included more detailed information in the Methods section (lines 104 - 145).

      A few of the references to sub-figures are mislabeled (e.g. lines 327-328).

      Thank you for noticing these inconsistencies. We have carefully reviewed and corrected all figure sub-panel labels and references throughout the manuscript.

      In Figure 7 C&D, are the neurons on the left sorted by contralateral ripple phase? It doesn't look like it. It would be easier to compare to ipsilateral if they were.

      In Figures 7C and 7D, neurons are sorted by their ipsilateral peak ripple phase, with the contralateral data plotted using the same ordering to facilitate comparison. To avoid confusion, we have clarified this explicitly in the figure legend and corresponding main text (lines 544–550).

      In Figure 6, using both bin sizes 50 and 100 doesn't contribute much.

      We used both 50 ms and 100 ms bin sizes to directly compare with previous studies (Villalobos et al. 2017 used 5 ms and 100 ms; Csicsvari et al. 2000 used 5–50 ms). Because the proportion of coincident ripples is a non-decreasing function of the window size, larger bins can inflate coincidence measures. Including a mid-range bin of 50 ms allowed us to show that high coincidence levels are reached well before the 100 ms upper bound, supporting that the 100 ms window is not an overshoot. We have added clarification on this point in the Methods section on ripple coincidence (lines 204–212).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Lu & Golomb combined EEG, artificial neural networks, and multivariate pattern analyses to examine how different visual variables are processed in the brain. The conclusions of the paper are mostly well supported, but some aspects of methods and data analysis would benefit from clarification and potential extensions.

      The authors find that not only real-world size is represented in the brain (which was known), but both retinal size and real-world depth are represented, at different time points or latencies, which may reflect different stages of processing. Prior work has not been able to answer the question of real-world depth due to the stimuli used. The authors made this possible by assessing real-world depth and testing it with appropriate methodology, accounting for retinal and real-world size. The methodological approach combining behavior, RSA, and ANNs is creative and well thought out to appropriately assess the research questions, and the findings may be very compelling if backed up with some clarifications and further analyses.

      The work will be of interest to experimental and computational vision scientists, as well as the broader computational cognitive neuroscience community as the methodology is of interest and the code is or will be made available. The work is important as it is currently not clear what the correspondence between many deep neural network models and the brain is, and this work pushes our knowledge forward on this front. Furthermore, the availability of methods and data will be useful for the scientific community.

      Reviewer #2 (Public Review):

      Summary:

      This paper aims to test if neural representations of images of objects in the human brain contain a 'pure' dimension of real-world size that is independent of retinal size or perceived depth. To this end, they apply representational similarity analysis on EEG responses in 10 human subjects to a set of 200 images from a publicly available database (THINGS-EEG2), correlating pairwise distinctions in evoked activity between images with pairwise differences in human ratings of real-world size (from THINGS+). By partialling out correlations with metrics of retinal size and perceived depth from the resulting EEG correlation time courses, the paper claims to identify an independent representation of real-world size starting at 170 ms in the EEG signal. Further comparisons with artificial neural networks and language embeddings lead the authors to claim this correlation reflects a relatively 'high-level' and 'stable' neural representation.

      Strengths:

      The paper features insightful figures/illustrations and clear figures.

      The limitations of prior work motivating the current study are clearly explained and seem reasonable (although the rationale for why using 'ecological' stimuli with backgrounds matters when studying real-world size could be made clearer; one could also argue the opposite, that to get a 'pure' representation of the real-world size of an 'object concept', one should actually show objects in isolation).

      The partial correlation analysis convincingly demonstrates how correlations between feature spaces can affect their correlations with EEG responses (and how taking into account these correlations can disentangle them better).

      The RSA analysis and associated statistical methods appear solid.

      Weaknesses:

      The claim of methodological novelty is overblown. Comparing image metrics, behavioral measurements, and ANN activations against EEG using RSA is a commonly used approach to study neural object representations. The dataset size (200 test images from THINGS) is not particularly large, and neither is comparing pre-trained DNNs and language models, or using partial correlations.

      Thanks for your feedback. We agree that the methods used in our study – such as RSA, partial correlations, and the use of pretrained ANN and language models – are indeed well-established in the literature. We therefore revised the manuscript to more carefully frame our contribution: rather than emphasizing methodological novelty in isolation, we now highlight the combination of techniques, the application to human EEG data with naturalistic images, and the explicit dissociation of real-world size, retinal size, and depth representations as the primary strengths of our approach. Corresponding language in the Abstract, Introduction, and Discussion has been adjusted to reflect this more precise positioning:

      (Abstract, line 34 to 37) “our study combines human EEG and representational similarity analysis to disentangle neural representations of object real-world size from retinal size and perceived depth, leveraging recent datasets and modeling approaches to address challenges not fully resolved in previous work.”

      (Introduction, line 104 to 106) “we overcome these challenges by combining human EEG recordings, naturalistic stimulus images, artificial neural networks, and computational modeling approaches including representational similarity analysis (RSA) and partial correlation analysis …”

      (Introduction, line 108) “We applied our integrated computational approach to an open EEG dataset…”

      (Introduction, line 142 to 143) “The integrated computational approach by cross-modal representational comparisons we take with the current study…”

      (Discussion, line 550 to 552) “our study goes beyond the contributions of prior studies in several key ways, offering both theoretical and methodological advances: …”

      The claims also seem too broad given the fairly small set of RDMs that are used here (3 size metrics, 4 ANN layers, 1 Word2Vec RDM): there are many aspects of object processing not studied here, so it's not correct to say this study provides a 'detailed and clear characterization of the object processing process'.

      Thanks for pointing this out. We softened language in our manuscript to reflect that our findings provide a temporally resolved characterization of selected object features, rather than a comprehensive account of object processing:

      (line 34 to 37) “our study combines human EEG and representational similarity analysis to disentangle neural representations of object real-world size from retinal size and perceived depth, leveraging recent datasets and modeling approaches to address challenges not fully resolved in previous work.”

      (line 46 to 48) “Our research provides a temporally resolved characterization of how certain key object properties – such as object real-world size, depth, and retinal size – are represented in the brain, …”

      The paper lacks an analysis demonstrating the validity of the real-world depth measure, which is here computed from the other two metrics by simply dividing them. The rationale and logic of this metric is not clearly explained. Is it intended to reflect the hypothesized egocentric distance to the object in the image if the person had in fact been 'inside' the image? How do we know this is valid? It would be helpful if the authors provided a validation of this metric.

      We appreciate the comment regarding the real-world depth metric. Specifically, this metric was computed as the ratio of real-world size (obtained via behavioral ratings) to measured retinal size. The rationale behind this computation is grounded in the basic principles of perspective projection: for two objects subtending the same retinal size, the physically larger object is presumed to be farther away. This ratio thus serves as a proxy for perceived egocentric depth under the simplifying assumption of consistent viewing geometry across images.

      We acknowledge that this is a derived estimate and not a direct measurement of perceived depth. While it provides a useful approximation that allows us to analytically dissociate the contributions of real-world size and depth in our RSA framework, we agree that future work would benefit from independent perceptual depth ratings to validate or refine this metric. We added more discussions about this to our revised manuscript:

      (line 652 to 657) “Additionally, we acknowledge that our metric for real-world depth was derived indirectly as the ratio of perceived real-world size to retinal size. While this formulation is grounded in geometric principles of perspective projection and served the purpose of analytically dissociating depth from size in our RSA framework, it remains a proxy rather than a direct measure of perceived egocentric distance. Future work incorporating behavioral or psychophysical depth ratings would be valuable for validating and refining this metric.”

      Given that there is only 1 image/concept here, the factor of real-world size may be confounded with other things, such as semantic category (e.g. buildings vs. tools). While the comparison of the real-world size metric appears to be effectively disentangled from retinal size and (the author's metric of) depth here, there are still many other object properties that are likely correlated with real-world size and therefore will confound identifying a 'pure' representation of real-world size in EEG. This could be addressed by adding more hypothesis RDMs reflecting different aspects of the images that may correlate with real-world size.

      We thank the reviewer for this thoughtful and important point. We agree that semantic category and real-world size may be correlated, and that semantic structure is one of the plausible sources of variance contributing to real-world size representations. However, we would like to clarify that our original goal was to isolate real-world size from two key physical image features — retinal size and inferred real-world depth — which have been major confounds in prior work on this topic. We acknowledge that although our analysis disentangled real-world size from depth and retinal size, this does not imply a fully “pure” representation; therefore, we now refer to the real-world size representations as “partially disentangled” throughout the manuscript to reflect this nuance.

      Interestingly, after controlling for these physical features, we still found a robust and statistically isolated representation of real-world size in the EEG signal. This motivated the idea that realworld size may be more than a purely perceptual or image-based property — it may be at least partially semantic. Supporting this interpretation, both the late layers of ANN models and the non-visual semantic model (Word2Vec) also captured real-world size structure. Rather than treating semantic information as an unwanted confound, we propose that semantic structure may be an inherent component of how the brain encodes real-world size.

      To directly address the your concern, we conducted an additional variance partitioning analysis, in which we decomposed the variance in EEG RDMs explained by four RDMs: real-world depth, retinal size, real-world size, and semantic information (from Word2Vec). Specifically, for each EEG timepoint, we quantified (1) the unique variance of real-world size, after controlling for semantic similarity, depth, and retinal size; (2) the unique variance of semantic information, after controlling for real-world size, depth, and retinal size; (3) the shared variance jointly explained by real-world size and semantic similarity, controlling for depth and retinal size. This analysis revealed that real-world size explained unique variance in EEG even after accounting for semantic similarity. And there was also a substantial shared variance, indicating partial overlap between semantic structure and size. Semantic information also contributed unique explanatory power, as expected. These results suggest that real-world size is indeed partially semantic in nature, but also has independent neural representation not fully explained by general semantic similarity. This strengthens our conclusion that real-world size functions as a meaningful, higher-level dimension in object representation space.

      We now include this new analysis and a corresponding figure (Figure S8) in the revised manuscript:

      (line 532 to 539) “Second, we conducted a variance partitioning analysis, in which we decomposed the variance in EEG RDMs explained by three hypothesis-based RDMs and the semantic RDM (Word2Vec RDM), and we still found that real-world size explained unique variance in EEG even after accounting for semantic similarity (Figure S9). And we also observed a substantial shared variance jointly explained by real-world size and semantic similarity and a unique variance of semantic information. These results suggest that real-world size is indeed partially semantic in nature, but also has independent neural representation not fully explained by general semantic similarity.”

      The choice of ANNs lacks a clear motivation. Why these two particular networks? Why pick only 2 somewhat arbitrary layers? If the goal is to identify more semantic representations using CLIP, the comparison between CLIP and vision-only ResNet should be done with models trained on the same training datasets (to exclude the effect of training dataset size & quality; cf Wang et al., 2023). This is necessary to substantiate the claims on page 19 which attributed the differences between models in terms of their EEG correlations to one of them being a 'visual model' vs. 'visual-semantic model'.

      We argee that the choice and comparison of models should be better contextualized.

      First, our motivation for selecting ResNet-50 and CLIP ResNet-50 was not to make a definitive comparison between model classes, but rather to include two widely used representatives of their respective categories—one trained purely on visual information (ResNet-50 on ImageNet) and one trained with joint visual and linguistic supervision (CLIP ResNet-50 on image–text pairs). These models are both highly influential and commonly used in computational and cognitive neuroscience, allowing for relevant comparisons with existing work (line 181-187).

      Second, we recognize that limiting the EEG × ANN correlation analyses to only early and late layers may be viewed as insufficiently comprehensive. To address this point, we have computed the EEG correlations with multiple layers in both ResNet and CLIP models (ResNet: ResNet.maxpool, ResNet.layer1, ResNet.layer2, ResNet.layer3, ResNet.layer4, ResNet.avgpool; CLIP: CLIP.visual.avgpool, CLIP.visual.layer1, CLIP.visual.layer2, CLIP.visual.layer3, CLIP.visual.layer4, CLIP.visual.attnpool). The results, now included in Figure S4, show a consistent trend: early layers exhibit higher similarity to early EEG time points, and deeper layers show increased similarity to later EEG stages. We chose to highlight early and late layers in the main text to simplify interpretation.

      Third, we appreciate the reviewer’s point that differences in training datasets (ImageNet vs. CLIP's dataset) may confound any attribution of differences in brain alignment to the models' architectural or learning differences. We agree that the comparisons between models trained on matched datasets (e.g., vision-only vs. multimodal models trained on the same image–text corpus) would allow for more rigorous conclusions. Thus, we explicitly acknowledged this limitation in the text:

      (line 443 to 445) “However, it is also possible that these differences between ResNet and CLIP reflect differences in training data scale and domain.”

      The first part of the claim on page 22 based on Figure 4 'The above results reveal that realworld size emerges with later peak neural latencies and in the later layers of ANNs, regardless of image background information' is not valid since no EEG results for images without backgrounds are shown (only ANNs).

      We revised the sentence to clarify that this is a hypothesis based on the ANN results, not an empirical EEG finding:

      (line 491 to 495) “These results show that real-world size emerges in the later layers of ANNs regardless of image background information, and – based on our prior EEG results – although we could not test object-only images in the EEG data, we hypothesize that a similar temporal profile would be observed in the brain, even for object-only images.”

      While we only had the EEG data of human subjects viewing naturalistic images, the ANN results suggest that real-world size representations may still emerge at later processing stages even in the absence of background, consistent with what we observed in EEG under with-background conditions.

      The paper is likely to impact the field by showcasing how using partial correlations in RSA is useful, rather than providing conclusive evidence regarding neural representations of objects and their sizes.

      Additional context important to consider when interpreting this work:

      Page 20, the authors point out similarities of peak correlations between models ('Interestingly, the peaks of significant time windows for the EEG × HYP RSA also correspond with the peaks of the EEG × ANN RSA timecourse (Figure 3D,F)'. Although not explicitly stated, this seems to imply that they infer from this that the ANN-EEG correlation might be driven by their representation of the hypothesized feature spaces. However this does not follow: in EEG-image metric model comparisons it is very typical to see multiple peaks, for any type of model, this simply reflects specific time points in EEG at which visual inputs (images) yield distinctive EEG amplitudes (perhaps due to stereotypical waves of neural processing?), but one cannot infer the information being processed is the same. To investigate this, one could for example conduct variance partitioning or commonality analysis to see if there is variance at these specific timepoints that is shared by a specific combination of the hypothesis and ANN feature spaces.

      Thanks for your thoughtful observation! Upon reflection, we agree that the sentence – "Interestingly, the peaks of significant time windows for the EEG × HYP RSA also correspond with the peaks of the EEG × ANN RSA timecourse" – was speculative and risked implying a causal link that our data do not warrant. As you rightly points out, observing coincident peak latencies across different models does not necessarily imply shared representational content, given the stereotypical dynamics of evoked EEG responses. And we think even variance partitioning analysis would still not suffice to infer that ANN-EEG correlations are driven specifically by hypothesized feature spaces. Accordingly, we have removed this sentence from the manuscript to avoid overinterpretation. 

      Page 22 mentions 'The significant time-window (90-300ms) of similarity between Word2Vec RDM and EEG RDMs (Figure 5B) contained the significant time-window of EEG x real-world size representational similarity (Figure 3B)'. This is not particularly meaningful given that the Word2Vec correlation is significant for the entire EEG epoch (from the time-point of the signal 'arriving' in visual cortex around ~90 ms) and is thus much less temporally specific than the realworld size EEG correlation. Again a stronger test of whether Word2Vec indeed captures neural representations of real-world size could be to identify EEG time-points at which there are unique Word2Vec correlations that are not explained by either ResNet or CLIP, and see if those timepoints share variance with the real-world size hypothesized RDM.

      We appreciate your insightful comment. Upon reflection, we agree that the sentence – "'The significant time-window (90-300ms) of similarity between Word2Vec RDM and EEG RDMs (Figure 5B) contained the significant time-window of EEG x real-world size representational similarity (Figure 3B)" – was speculative. And we have removed this sentence from the manuscript to avoid overinterpretation. 

      Additionally, we conducted two analyses as you suggested in the supplement. First, we calculated the partial correlation between EEG RDMs and the Word2Vec RDM while controlling for four ANN RDMs (ResNet early/late and CLIP early/late) (Figure S8). Even after regressing out these ANN-derived features, we observed significant correlations between Word2Vec and EEG RDMs in the 100–190 ms and 250–300 ms time windows. This result suggests that

      Word2Vec captures semantic structure in the neural signal that is not accounted for by ResNet or CLIP. Second, we conducted an additional variance partitioning analysis, in which we decomposed the variance in EEG RDMs explained by four RDMs: real-world depth, retinal size, real-world size, and semantic information (from Word2Vec) (Figure S9). And we found significant shared variance between Word2Vec and real-world size at 130–150 ms and 180–250 ms. These results indicate a partially overlapping representational structure between semantic content and real-world size in the brain.

      We also added these in our revised manuscript:

      (line 525 to 539) “To further probe the relationship between real-world size and semantic information, and to examine whether Word2Vec captures variances in EEG signals beyond that explained by visual models, we conducted two additional analyses. First, we performed a partial correlation between EEG RDMs and the Word2Vec RDM, while regressing out four ANN RDMs (early and late layers of both ResNet and CLIP) (Figure S8). We found that semantic similarity remained significantly correlated with EEG signals across sustained time windows (100-190ms and 250-300ms), indicating that Word2Vec captures neural variance not fully explained by visual or visual-language models. Second, we conducted a variance partitioning analysis, in which we decomposed the variance in EEG RDMs explained by three hypothesis-based RDMs and the semantic RDM (Word2Vec RDM), and we still found that real-world size explained unique variance in EEG even after accounting for semantic similarity (Figure S9). And we also observed a substantial shared variance jointly explained by realworld size and semantic similarity and a unique variance of semantic information. These results suggest that real-world size is indeed partially semantic in nature, but also has independent neural representation not fully explained by general semantic similarity.”

      Reviewer #3 (Public Review):

      The authors used an open EEG dataset of observers viewing real-world objects. Each object had a real-world size value (from human rankings), a retinal size value (measured from each image), and a scene depth value (inferred from the above). The authors combined the EEG and object measurements with extant, pre-trained models (a deep convolutional neural network, a multimodal ANN, and Word2vec) to assess the time course of processing object size (retinal and real-world) and depth. They found that depth was processed first, followed by retinal size, and then real-world size. The depth time course roughly corresponded to the visual ANNs, while the real-world size time course roughly corresponded to the more semantic models.

      The time course result for the three object attributes is very clear and a novel contribution to the literature. However, the motivations for the ANNs could be better developed, the manuscript could better link to existing theories and literature, and the ANN analysis could be modernized. I have some suggestions for improving specific methods.

      (1) Manuscript motivations

      The authors motivate the paper in several places by asking " whether biological and artificial systems represent object real-world size". This seems odd for a couple of reasons. Firstly, the brain must represent real-world size somehow, given that we can reason about this question. Second, given the large behavioral and fMRI literature on the topic, combined with the growing ANN literature, this seems like a foregone conclusion and undermines the novelty of this contribution.

      Thanks for your helpful comment. We agree that asking whether the brain represents real-world size is not a novel question, given the existing behavioral and neuroimaging evidence supporting this. Our intended focus was not on the existence of real-world size representations per se, but the nature of these representations, particularly the relationship between the temporal dynamics and potential mechanisms of representations of real-world size versus other related perceptual properties (e.g., retinal size and real-world depth). We revised the relevant sentence to better reflect our focue, shifting from a binary framing (“whether or not size is represented”) to a more mechanistic and time-resolved inquiry (“how and when such representations emerge”):

      (line 144 to 149) “Unraveling the internal representations of object size and depth features in both human brains and ANNs enables us to investigate how distinct spatial properties—retinal size, realworld depth, and real-world size—are encoded across systems, and to uncover the representational mechanisms and temporal dynamics through which real-world size emerges as a potentially higherlevel, semantically grounded feature.”

      While the introduction further promises to "also investigate possible mechanisms of object realworld size representations.", I was left wishing for more in this department. The authors report correlations between neural activity and object attributes, as well as between neural activity and ANNs. It would be nice to link the results to theories of object processing (e.g., a feedforward sweep, such as DiCarlo and colleagues have suggested, versus a reverse hierarchy, such as suggested by Hochstein, among others). What is semantic about real-world size, and where might this information come from? (Although you may have to expand beyond the posterior electrodes to do this analysis).

      We thank the reviewer for this insightful comment. We agree that understanding the mechanisms underlying real-world size representations is a critical question. While our current study does not directly test specific theoretical frameworks such as the feedforward sweep model or the reverse hierarchy theory, our results do offer several relevant insights: The temporal dynamics revealed by EEG—where real-world size emerges later than retinal size and depth—suggest that such representations likely arise beyond early visual feedforward stages, potentially involving higherlevel semantic processing. This interpretation is further supported by the fact that real-world size is strongly captured by late layers of ANNs and by a purely semantic model (Word2Vec), suggesting its dependence on learned conceptual knowledge.

      While we acknowledge that our analyses were limited to posterior electrodes and thus cannot directly localize the cortical sources of these effects, we view this work as a first step toward bridging low-level perceptual features and higher-level semantic representations. We hope future work combining broader spatial sampling (e.g., anterior EEG sensors or source localization) and multimodal recordings (e.g., MEG, fMRI) can build on these findings to directly test competing models of object processing and representation hierarchy.

      We also added these to the Discussion section:

      (line 619 to 638) “Although our study does not directly test specific models of visual object processing, the observed temporal dynamics provide important constraints for theoretical interpretations. In particular, we find that real-world size representations emerge significantly later than low-level visual features such as retinal size and depth. This temporal profile is difficult to reconcile with a purely feedforward account of visual processing (e.g., DiCarlo et al., 2012), which posits that object properties are rapidly computed in a sequential hierarchy of increasingly complex visual features. Instead, our results are more consistent with frameworks that emphasize recurrent or top-down processing, such as the reverse hierarchy theory (Hochstein & Ahissar, 2002), which suggests that high-level conceptual information may emerge later and involve feedback to earlier visual areas. This interpretation is further supported by representational similarities with late-stage artificial neural network layers and with a semantic word embedding model (Word2Vec), both of which reflect learned, abstract knowledge rather than low-level visual features. Taken together, these findings suggest that real-world size is not merely a perceptual attribute, but one that draws on conceptual or semantic-level representations acquired through experience. While our EEG analyses focused on posterior electrodes and thus cannot definitively localize cortical sources, we see this study as a step toward linking low-level visual input with higher-level semantic knowledge. Future work incorporating broader spatial coverage (e.g., anterior sensors), source localization, or complementary modalities such as MEG and fMRI will be critical to adjudicate between alternative models of object representation and to more precisely trace the origin and flow of real-world size information in the brain.”

      Finally, several places in the manuscript tout the "novel computational approach". This seems odd because the computational framework and pipeline have been the most common approach in cognitive computational neuroscience in the past 5-10 years.

      We have revised relevant statements throughout the manuscript to avoid overstating novelty and to better reflect the contribution of our study.

      (2) Suggestion: modernize the approach

      I was surprised that the computational models used in this manuscript were all 8-10 years old. Specifically, because there are now deep nets that more explicitly model the human brain (e.g., Cornet) as well as more sophisticated models of semantics (e.g., LLMs), I was left hoping that the authors had used more state-of-the-art models in the work. Moreover, the use of a single dCNN, a single multi-modal model, and a single word embedding model makes it difficult to generalize about visual, multimodal, and semantic features in general.

      Thanks for your suggestion. Indeed, our choice of ResNet and CLIP was motivated by their widespread use in the cognitive and computational neuroscience area. These models have served as standard benchmarks in many studies exploring correspondence between ANNs and human brain activity. To address you concern, we have now added additional results from the more biologically inspired model, CORnet, in the supplementary (Figure S10). The results for CORnet show similar patterns to those observed for ResNet and CLIP, providing converging evidence across models.

      Regarding semantic modeling, we intentionally chose Word2Vec rather than large language models (LLMs), because our goal was to examine concept-level, context-free semantic representations. Word2Vec remains the most widely adopted approach for obtaining noncontextualized embeddings that reflect core conceptual similarity, as opposed to the contextdependent embeddings produced by LLMs, which are less directly suited for capturing stable concept-level structure across stimuli.

      (3) Methodological considerations

      (a) Validity of the real-world size measurement

      I was concerned about a few aspects of the real-world size rankings. First, I am trying to understand why the scale goes from 100-519. This seems very arbitrary; please clarify. Second, are we to assume that this scale is linear? Is this appropriate when real-world object size is best expressed on a log scale? Third, the authors provide "sand" as an example of the smallest realworld object. This is tricky because sand is more "stuff" than "thing", so I imagine it leaves observers wondering whether the experimenter intends a grain of sand or a sandy scene region. What is the variability in real-world size ratings? Might the variability also provide additional insights in this experiment?

      We now clarify the origin, scaling, and interpretation of the real-world size values obtained from the THINGS+ dataset.

      In their experiment, participants first rated the size of a single object concept (word shown on the screen) by clicking on a continuous slider of 520 units, which was anchored by nine familiar real-world reference objects (e.g., “grain of sand,” “microwave oven,” “aircraft carrier”) that spanned the full expected size range on a logarithmic scale. Importantly, participants were not shown any numerical values on the scale—they were guided purely by the semantic meaning and relative size of the anchor objects. After the initial response, the scale zoomed in around the selected region (covering 160 units of the 520-point scale) and presented finer anchor points between the previous reference objects. Participants then refined their rating by dragging from the lower to upper end of the typical size range for that object. If the object was standardized in size (e.g., “soccer ball”), a single click sufficed. These size judgments were collected across at least 50 participants per object, and final scores were derived from the central tendency of these responses. Although the final size values numerically range from 0 to 519 (after scaling), this range is not known to participants and is only applied post hoc to construct the size RDMs.

      Regarding the term “sand”: the THINGS+ dataset distinguished between object meanings when ambiguity was present. For “sand,” participants were instructed to treat it as “a grain of sand”— consistent with the intended meaning of a discrete, minimal-size reference object. 

      Finally, we acknowledge that real-world size ratings may carry some degree of variability across individuals. However, the dataset includes ratings from 2010 participants across 1854 object concepts, with each object receiving at least 50 independent ratings. Given this large and diverse sample, the mean size estimates are expected to be stable and robust across subjects. While we did not include variability metrics in our main analysis, we believe the aggregated ratings provide a reliable estimate of perceived real-world size.

      We added these details in the Materials and Method section:

      (line 219 to 230) “In the THINGS+ dataset, 2010 participants (different from the subjects in THINGS EEG2) did an online size rating task and completed a total of 13024 trials corresponding to 1854 object concepts using a two-step procedure. In their experiment, first, each object was rated on a 520unit continuous slider anchored by familiar reference objects (e.g., “grain of sand,” “microwave oven,” “aircraft carrier”) representing a logarithmic size range. Participants were not shown numerical values but used semantic anchors as guides. In the second step, the scale zoomed in around the selected region to allow for finer-grained refinement of the size judgment. Final size values were derived from aggregated behavioral data and rescaled to a range of 0–519 for consistency across objects, with the actual mean ratings across subjects ranging from 100.03 (‘grain of sand’) to 423.09 (‘subway’).”

      (b) This work has no noise ceiling to establish how strong the model fits are, relative to the intrinsic noise of the data. I strongly suggest that these are included.

      We have now computed noise ceiling estimates for the EEG RDMs across time. The noise ceiling was calculated by correlating each participant’s EEG RDM with the average EEG RDM across the remaining participants (leave-one-subject-out), at each time point. This provides an upper-bound estimate of the explainable variance, reflecting the maximum similarity that any model—no matter how complex—could potentially achieve, given the intrinsic variability in the EEG data.

      Importantly, the observed EEG–model similarity values are substantially below this upper bound. This outcome is fully expected: Each of our model RDMs (e.g., real-world size, ANN layers) captures only a specific aspect of the neural representational structure, rather than attempting to account for the totality of the EEG signal. Our goal is not to optimize model performance or maximize fit, but to probe which components of object information are reflected in the spatiotemporal dynamics of the brain’s responses.

      For clarity and accessibility of the main findings, we present the noise ceiling time courses separately in the supplementary materials (Figure S7). Including them directly in the EEG × HYP or EEG × ANN plots would conflate distinct interpretive goals: the model RDMs are hypothesis-driven probes of specific representational content, whereas the noise ceiling offers a normative upper bound for total explainable variance. Keeping these separate ensures each visualization remains focused and interpretable. 

      Reviewer #1 (Recommendations For The Authors)::

      Some analyses are incomplete, which would be improved if the authors showed analyses with other layers of the networks and various additional partial correlation analyses.

      Clarity

      (1) Partial correlations methods incomplete - it is not clear what is being partialled out in each analysis. It is possible to guess sometimes, but it is not entirely clear for each analysis. This is important as it is difficult to assess if the partial correlations are sensible/correct in each case. Also, the Figure 1 caption is short and unclear.

      For example, ANN-EEG partial correlations - "Finally, we directly compared the timepoint-bytimepoint EEG neural RDMs and the ANN RDMs (Figure 3F). The early layer representations of both ResNet and CLIP were significantly correlated with early representations in the human brain" What is being partialled out? Figure 3F says partial correlation

      We apologize for the confusion. We made several key clarifications and corrections in the revised version.

      First, we identified and corrected a labeling error in both Figure 1 and Figure 3F. Specifically, our EEG × ANN analysis used Spearman correlation, not partial correlation as mistakenly indicated in the original figure label and text. We conducted parital correlations for EEG × HYP and ANN × HYP. But for EEG × ANN, we directly calculated the correlation between EEG RDMs and ANN RDM corresponding to different layers respectively. We corrected these errors: (1) In Figure 1, we removed the erroneous “partial” label from the EEG × ANN path and updated the caption to clearly outline which comparisons used partial correlation. (2) In Figure 3F, we corrected the Y-axis label to “(correlation)”.

      Second, to improve clarity, we have now revised the Materials and Methods section to explicitly describe what is partialled out in each parital correlation analysis:

      (line 284 to 286) “In EEG × HYP partial correlation (Figure 3D), we correlated EEG RDMs with one hypothesis-based RDM (e.g., real-world size), while controlling for the other two (retinal size and real-world depth).”

      (line 303 to 305) “In ANN (or W2V) × HYP partial correlation (Figure 3E and Figure 5A), we correlated ANN (or W2V) RDMs with one hypothesis-based RDM (e.g., real-world size), while partialling out the other two.”

      Finally, the caption of Figure 1 has been expanded to clarify the full analysis pipeline and explicitly specify the partial correlation or correlation in each comparison.

      (line 327 to 332) “Figure 1 Overview of our analysis pipeline including constructing three types of RDMs and conducting comparisons between them. We computed RDMs from three sources: neural data (EEG), hypothesized object features (real-world size, retinal size, and real-world depth), and artificial models (ResNet, CLIP, and Word2Vec). Then we conducted cross-modal representational similarity analyses between: EEG × HYP (partial correlation, controlling for other two HYP features), ANN (or W2V) × HYP (partial correlation, controlling for other two HYP features), and EEG × ANN (correlation).”

      We believe these revisions now make all analytic comparisons and correlation types full clear and interpretable.

      Issues / open questions

      (2) Semantic representations vs hypothesized (hyp) RDMs (real-world size, etc) - are the representations explained by variables in hyp RDMs or are there semantic representations over and above these? E.g., For ANN correlation with the brain, you could partial out hyp RDMs - and assess whether there is still semantic information left over, or is the variance explained by the hyp RDMs?

      Thank for this suggestion. As you suggested, we conducted the partial correlation analysis between EEG RDMs and ANN RDMs, controlling for the three hypothesis-based RDMs. The results (Figure S6) revealed that the EEG×ANN representational similarity remained largely unchanged, indicating that ANN representations capture much more additional representational structure not accounted for by the current hypothesized features. This is also consistent with the observation that EEG×HYP partial correlations were themselves small, but EEG×ANN correlations were much greater.

      We also added this statement to the main text:

      (line 446 to 451) “To contextualize how much of the shared variance between EEG and ANN representations is driven by the specific visual object features we tested above, we conducted a partial correlation analysis between EEG RDMs and ANN RDMs controlling for the three hypothesis-based RDMs (Figure S6). The EEG×ANN similarity results remained largely unchanged, suggesting that ANN representations capture much more additional rich representational structure beyond these features. ”

      (3) Why only early and late layers? I can see how it's clearer to present the EEG results. However, the many layers in these networks are an opportunity - we can see how simple/complex linear/non-linear the transformation is over layers in these models. It would be very interesting and informative to see if the correlations do in fact linearly increase from early to later layers, or if the story is a bit more complex. If not in the main text, then at least in the supplement.

      Thank you for the thoughtful suggestion. To address this point, we have computed the EEG correlations with multiple layers in both ResNet and CLIP models (ResNet: ResNet.maxpool, ResNet.layer1, ResNet.layer2, ResNet.layer3, ResNet.layer4, ResNet.avgpool; CLIP:CLIP.visual.avgpool, CLIP.visual.layer1, CLIP.visual.layer2, CLIP.visual.layer3, CLIP.visual.layer4, CLIP.visual.attnpool). The results, now included in Figure S4 and S5, show a consistent trend: early layers exhibit higher similarity to early EEG time points, and deeper layers show increased similarity to later EEG stages. We chose to highlight early and late layers in the main text to simplify interpretation, but now provide the full layerwise profile for completeness.

      (4) Peak latency analysis - Estimating peaks per ppt is presumably noisy, so it seems important to show how reliable this is. One option is to find the bootstrapped mean latencies per subject.

      Thanks for your suggestion. To estimate the robustness of peak latency values, we implemented a bootstrap procedure by resampling the pairwise entries of the EEG RDM with replacement. For each bootstrap sample, we computed a new EEG RDM and recalculated the partial correlation time course with the hypothesis RDMs. We then extracted the peak latency within the predefined significant time window. Repeating this process 1000 times allowed us to get the bootstrapped mean latencies per subject as the more stable peak latency result. Notably, the bootstrapped results showed minimal deviation from the original latency estimates, confirming the robustness of our findings. Accordingly, we updated the Figure 3D and added these in the Materials and Methods section:

      (line 289 to 298) “To assess the stability of peak latency estimates for each subject, we performed a bootstrap procedure across stimulus pairs. At each time point, the EEG RDM was vectorized by extracting the lower triangle (excluding the diagonal), resulting in 19,900 unique pairwise values. For each bootstrap sample, we resampled these 19,900 pairwise entries with replacement to generate a new pseudo-RDM of the same size. We then computed the partial correlation between the EEG pseudo-RDM and a given hypothesis RDM (e.g., real-world size), controlling for other feature RDMs, and obtained a time course of partial correlations. Repeating this procedure 1000 times and extracting the peak latency within the significant time window yielded a distribution of bootstrapped latencies, from which we got the bootstrapped mean latencies per subject.”

      (5) "Due to our calculations being at the object level, if there were more than one of the same objects in an image, we cropped the most complete one to get a more accurate retinal size. " Did EEG experimenters make sure everyone sat the same distance from the screen? and remain the same distance? This would also affect real-world depth measures.

      Yes, the EEG dataset we used (THINGS EEG2; Gifford et al., 2022) was collected under carefully controlled experimental conditions. We have confirmed that all participants were seated at a fixed distance of 0.6 meters from the screen throughout the experiment. We also added this information in the method (line 156 to 157).

      Minor issues/questions - note that these are not raised in the Public Review

      (6) Title - less about rigor/quality of the work but I feel like the title could be improved/extended. The work tells us not only about real object size, but also retinal size and depth. In fact, isn't the most novel part of this the real-world depth aspect? Furthermore, it feels like the current title restricts its relevance and impact... Also doesn't touch on the temporal aspect, or processing stages, which is also very interesting. There may be something better, but simply adding something like"...disentangled features of real-world size, depth, and retinal size over time OR processing stages".

      Thanks for your suggestion! We changed our title – “Human EEG and artificial neural networks reveal disentangled representations and processing timelines of object real-world size and depth in natural images”.

      (7) "Each subject viewed 16740 images of objects on a natural background for 1854 object concepts from the THINGS dataset (Hebart et al., 2019). For the current study, we used the 'test' dataset portion, which includes 16000 trials per subject corresponding to 200 images." Why test images? Worth explaining.

      We chose to use the “test set” of the THINGS EEG2 dataset for the following two reasons:

      (1) Higher trial count per condition: In the test set, each of the 200 object images was presented 80 times per subject, whereas in the training set, each image was shown only 4 times. This much higher trial count per condition in the test set allows for substantially higher signal-tonoise ratio in the EEG data.

      (2) Improved decoding reliability: Our analysis relies on constructing EEG RDMs based on pairwise decoding accuracy using linear SVM classifiers. Reliable decoding estimates require a sufficient number of trials per condition. The test set design is thus better suited to support high-fidelity decoding and robust representational similarity analysis.

      We also added these explainations to our revised manuscript (line 161 to 164).

      (8) "For Real-World Size RDM, we obtained human behavioral real-world size ratings of each object concept from the THINGS+ dataset (Stoinski et al., 2022).... The range of possible size ratings was from 0 to 519 in their online size rating task..." How were the ratings made? What is this scale - do people know the numbers? Was it on a continuous slider?

      We should clarify how the real-world size values were obtained from the THINGS+ dataset.

      In their experiment, participants first rated the size of a single object concept (word shown on the screen) by clicking on a continuous slider of 520 units, which was anchored by nine familiar real-world reference objects (e.g., “grain of sand,” “microwave oven,” “aircraft carrier”) that spanned the full expected size range on a logarithmic scale. Importantly, participants were not shown any numerical values on the scale—they were guided purely by the semantic meaning and relative size of the anchor objects. After the initial response, the scale zoomed in around the selected region (covering 160 units of the 520-point scale) and presented finer anchor points between the previous reference objects. Participants then refined their rating by dragging from the lower to upper end of the typical size range for that object. If the object was standardized in size (e.g., “soccer ball”), a single click sufficed. These size judgments were collected across at least 50 participants per object, and final scores were derived from the central tendency of these responses. Although the final size values numerically range from 0 to 519 (after scaling), this range is not known to participants and is only applied post hoc to construct the size RDMs.

      We added these details in the Materials and Method section:

      (line 219 to 230) “In the THINGS+ dataset, 2010 participants (different from the subjects in THINGS EEG2) did an online size rating task and completed a total of 13024 trials corresponding to 1854 object concepts using a two-step procedure. In their experiment, first, each object was rated on a 520unit continuous slider anchored by familiar reference objects (e.g., “grain of sand,” “microwave oven,” “aircraft carrier”) representing a logarithmic size range. Participants were not shown numerical values but used semantic anchors as guides. In the second step, the scale zoomed in around the selected region to allow for finer-grained refinement of the size judgment. Final size values were derived from aggregated behavioral data and rescaled to a range of 0–519 for consistency across objects, with the actual mean ratings across subjects ranging from 100.03 (‘grain of sand’) to 423.09 (‘subway’).”

      (9) "For Retinal Size RDM, we applied Adobe Photoshop (Adobe Inc., 2019) to crop objects corresponding to object labels from images manually... " Was this by one person? Worth noting, and worth sharing these values per image if not already for other researchers as it could be a valuable resource (and increase citations).

      Yes, all object cropping were performed consistently by one of the authors to ensure uniformity across images. We agree that this dataset could be a useful resource to the community. We have now made the cropped object images publicly available https://github.com/ZitongLu1996/RWsize.

      We also updated the manuscript accordingly to note this (line 236 to 239).

      (10) "Neural RDMs. From the EEG signal, we constructed timepoint-by-timepoint neural RDMs for each subject with decoding accuracy as the dissimilarity index " Decoding accuracy is presumably a similarity index. Maybe 1-accuracy (proportion correct) for dissimilarity?

      Decoding accuracy is a dissimilarity index instead of a similarity index, as higher decoding accuracy between two conditions indicates that they are more distinguishable – i.e., less similar – in the neural response space. This approach aligns with prior work using classification-based representational dissimilarity measures (Grootswagers et al., 2017; Xie et al., 2020), where better decoding implies greater dissimilarity between conditions. Therefore, there is no need to invert the decoding accuracy values (e.g., using 1 - accuracy).

      Grootswagers, T., Wardle, S. G., & Carlson, T. A. (2017). Decoding dynamic brain patterns from evoked responses: A tutorial on multivariate pattern analysis applied to time series neuroimaging data. Journal of Cognitive Neuroscience, 29(4), 677-697.

      Xie, S., Kaiser, D., & Cichy, R. M. (2020). Visual imagery and perception share neural representations in the alpha frequency band. Current Biology, 30(13), 2621-2627.

      (11) Figure 1 caption is very short - Could do with a more complete caption. Unclear what the partial correlations are (what is being partialled out in each case), what are the comparisons "between them" - both in the figure and the caption. Details should at least be in the main text.

      Related to your comment (1). We revised the caption and the corresponding text.

      Reviewer #2 (Recommendations For The Authors):

      (1) Intro:

      Quek et al., (2023) is referred to as a behavioral study, but it has EEG analyses.

      We corrected this – “…, one recent study (Quek et al., 2023) …”

      The phrase 'high temporal resolution EEG' is a bit strange - isn't all EEG high temporal resolution? Especially when down-sampling to 100 Hz (40 time points/epoch) this does not qualify as particularly high-res.

      We removed this phrasing in our manuscript.

      (2) Methods:

      It would be good to provide more details on the EEG preprocessing. Were the data low-pass filtered, for example?

      We added more details to the manuscript:

      (line 167 to 174) “The EEG data were originally sampled at 1000Hz and online-filtered between 0.1 Hz and 100 Hz during acquisition, with recordings referenced to the Fz electrode. For preprocessing, no additional filtering was applied. Baseline correction was performed by subtracting the mean signal during the 100 ms pre-stimulus interval from each trial and channel separately. We used already preprocessed data from 17 channels with labels beginning with “O” or “P” (O1, Oz, O2, PO7, PO3, POz, PO4, PO8, P7, P5, P3, P1, Pz, P2) ensuring full coverage of posterior regions typically involved in visual object processing. The epoched data were then down-sampled to 100 Hz.”

      It is important to provide more motivation about the specific ANN layers chosen. Were these layers cherry-picked, or did they truly represent a gradual shift over the course of layers?

      We appreciate the reviewer’s concern and fully agree that it is important to ensure transparency in how ANN layers were selected. The early and late layers reported in the main text were not cherry-picked to maximize effects, but rather intended to serve as illustrative examples representing the lower and higher ends of the network hierarchy. To address this point directly, we have computed the EEG correlations with multiple layers in both ResNet and CLIP models (ResNet: ResNet.maxpool, ResNet.layer1, ResNet.layer2, ResNet.layer3, ResNet.layer4, ResNet.avgpool; CLIP: CLIP.visual.avgpool, CLIP.visual.layer1, CLIP.visual.layer2, CLIP.visual.layer3, CLIP.visual.layer4, CLIP.visual.attnpool). The results, now included in Figure S4, show a consistent trend: early layers exhibit higher similarity to early EEG time points, and deeper layers show increased similarity to later EEG stages.

      It is important to provide more specific information about the specific ANN layers chosen. 'Second convolutional layer': is this block 2, the ReLu layer, the maxpool layer? What is the 'last visual layer'?

      Apologize for the confusing! We added more details about the layer chosen:

      (line 255 to 257) “The early layer in ResNet refers to ResNet.maxpool layer, and the late layer in ResNet refers to ResNet.avgpool layer. The early layer in CLIP refers to CLIP.visual.avgpool layer, and the late layer in CLIP refers to CLIP.visual.attnpool layer.”

      Again the claim 'novel' is a bit overblown here since the real-world size ratings were also already collected as part of THINGS+, so all data used here is available.

      We removed this phrasing in our manuscript.

      Real-world size ratings ranged 'from 0 - 519'; it seems unlikely this was the actual scale presented to subjects, I assume it was some sort of slider?

      You are correct. We should clarify how the real-world size values were obtained from the THINGS+ dataset.

      In their experiment, participants first rated the size of a single object concept (word shown on the screen) by clicking on a continuous slider of 520 units, which was anchored by nine familiar real-world reference objects (e.g., “grain of sand,” “microwave oven,” “aircraft carrier”) that spanned the full expected size range on a logarithmic scale. Importantly, participants were not shown any numerical values on the scale—they were guided purely by the semantic meaning and relative size of the anchor objects. After the initial response, the scale zoomed in around the selected region (covering 160 units of the 520-point scale) and presented finer anchor points between the previous reference objects. Participants then refined their rating by dragging from the lower to upper end of the typical size range for that object. If the object was standardized in size (e.g., “soccer ball”), a single click sufficed. These size judgments were collected across at least 50 participants per object, and final scores were derived from the central tendency of these responses. Although the final size values numerically range from 0 to 519 (after scaling), this range is not known to participants and is only applied post hoc to construct the size RDMs.

      We added these details in the Materials and Method section:

      (line 219 to 230) “In the THINGS+ dataset, 2010 participants (different from the subjects in THINGS EEG2) did an online size rating task and completed a total of 13024 trials corresponding to 1854 object concepts using a two-step procedure. In their experiment, first, each object was rated on a 520unit continuous slider anchored by familiar reference objects (e.g., “grain of sand,” “microwave oven,” “aircraft carrier”) representing a logarithmic size range. Participants were not shown numerical values but used semantic anchors as guides. In the second step, the scale zoomed in around the selected region to allow for finer-grained refinement of the size judgment. Final size values were derived from aggregated behavioral data and rescaled to a range of 0–519 for consistency across objects, with the actual mean ratings across subjects ranging from 100.03 (‘grain of sand’) to 423.09 (‘subway’).”

      Why is conducting a one-tailed (p<0.05) test valid for EEG-ANN comparisons? Shouldn't this be two-tailed?

      Our use of one-tailed tests was based on the directional hypothesis that representational similarity between EEG and ANN RDMs would be positive, as supported by prior literature showing correspondence between hierarchical neural networks and human brain representations (e.g., Cichy et al., 2016; Kuzovkin et al., 2014). This is consistent with a large number of RSA studies which conduct one-tailed tests (i.e., testing the hypothesis that coefficients were greater than zero: e.g., Kuzovkin et al., 2018; Nili et al., 2014; Hebart et al., 2018; Kaiser et al., 2019; Kaiser et al., 2020; Kaiser et al., 2022). Thus, we specifically tested whether the similarity was significantly greater than zero.

      Cichy, R. M., Khosla, A., Pantazis, D., Torralba, A., & Oliva, A. (2016). Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Scientific reports, 6(1), 27755.

      Kuzovkin, I., Vicente, R., Petton, M., Lachaux, J. P., Baciu, M., Kahane, P., ... & Aru, J. (2018). Activations of deep convolutional neural networks are aligned with gamma band activity of human visual cortex. Communications biology, 1(1), 107.

      Nili, H., Wingfield, C., Walther, A., Su, L., Marslen-Wilson, W., & Kriegeskorte, N. (2014). A toolbox for representational similarity analysis. PLoS computational biology, 10(4), e1003553.

      Hebart, M. N., Bankson, B. B., Harel, A., Baker, C. I., & Cichy, R. M. (2018). The representational dynamics of task and object processing in humans. Elife, 7, e32816.

      Kaiser, D., Turini, J., & Cichy, R. M. (2019). A neural mechanism for contextualizing fragmented inputs during naturalistic vision. elife, 8, e48182.

      Kaiser, D., Inciuraite, G., & Cichy, R. M. (2020). Rapid contextualization of fragmented scene information in the human visual system. Neuroimage, 219, 117045.

      Kaiser, D., Jacobs, A. M., & Cichy, R. M. (2022). Modelling brain representations of abstract concepts. PLoS Computational Biology, 18(2), e1009837.

      Importantly, we note that using a two-tailed test instead would not change the significance of our results. However, we believe the one-tailed test remains more appropriate given our theoretical prediction of positive similarity between ANN and brain representations.

      The sentence on the partial correlation description (page 11 'we calculated partial correlations with one-tailed test against the alternative hypothesis that the partial correlation was positive (greater than zero)') didn't make sense to me; are you referring to the null hypothesis here?

      We revised this sentence to clarify that we tested against the null hypothesis that the partial correlation was less than or equal to zero, using a one-tailed test to assess whether the correlation was significantly greater than zero.

      (line 281 to 284) “…, we calculated partial correlations and used a one-tailed test against the null hypothesis that the partial correlation was less than or equal to zero, testing whether the partial correlation was significantly greater than zero.”

      (3) Results:

      I would prevent the use of the word 'pure', your measurement is one specific operationalization of this concept of real-world size that is not guaranteed to result in unconfounded representations. This is in fact impossible whenever one is using a finite set of natural stimuli and calculating metrics on those - there can always be a factor or metric that was not considered that could explain some of the variance in your measurement. It is overconfident to claim to have achieved some form of Platonic ideal here and to have taken into account all confounds.

      Your point is well taken. Our original use of the term “pure” was intended to reflect statistical control for known confounding factors, but we recognize that this wording may imply a stronger claim than warranted. In response, we revised all relevant language in the manuscript to instead describe the statistically isolated or relatively unconfounded representation of real-world size, clarifying that our findings pertain to the unique contribution of real-world size after accounting for retinal size and real-world depth.

      Figure 2C: It's not clear why peak latencies are computed on the 'full' correlations rather than the partial ones.

      No. The peak latency results in Figure 2C were computed on the partial correlation results – we mentioned this in the figure caption – “Temporal latencies for peak similarity (partial Spearman correlations) between EEG and the 3 types of object information.”

      SEM = SEM across the 10 subjects?

      Yes. We added this in the figure caption.

      Figure 3F y-axis says it's partial correlations but not clear what is partialled out here.

      We identified and corrected a labeling error in both Figure 1 and Figure 3F. Specifically, our EEG × ANN analysis used Spearman correlation, not partial correlation as mistakenly indicated in the original figure label and text. We conducted parital correlations for EEG × HYP and ANN × HYP. But for EEG × ANN, we directly calculated the correlation between EEG RDMs and ANN RDM corresponding to different layers respectively. We corrected these errors: (1) In Figure 1, we removed the erroneous “partial” label from the EEG × ANN path and updated the caption to clearly outline which comparisons used partial correlation. (2) In Figure 3F, we corrected the Y-axis label to “(correlation)”.

      Reviewer #3 (Recommendations For The Authors):

      (1) Several methodologies should be clarified:

      (a) It's stated that EEG was sampled at 100 Hz. I assume this was downsampled? From what original frequency?

      Yes. We added more detailed about EEG data:

      (line 167 to 174) “The EEG data were originally sampled at 1000Hz and online-filtered between 0.1 Hz and 100 Hz during acquisition, with recordings referenced to the Fz electrode. For preprocessing, no additional filtering was applied. Baseline correction was performed by subtracting the mean signal during the 100 ms pre-stimulus interval from each trial and channel separately. We used already preprocessed data from 17 channels with labels beginning with “O” or “P” (O1, Oz, O2, PO7, PO3, POz, PO4, PO8, P7, P5, P3, P1, Pz, P2) ensuring full coverage of posterior regions typically involved in visual object processing. The epoched data were then down-sampled to 100 Hz.”

      (b) Why was decoding accuracy used as the human RDM method rather than the EEG data themselves?

      Thanks for your question! We would like to address why we used decoding accuracy for EEG RDMs rather than correlation. While fMRI RDMs are typically calculated using 1 minus correlation coefficient, decoding accuracy is more commonly used for EEG RDMs (Grootswager et al., 2017; Xie et al., 2020). The primary reason is that EEG signals are more susceptible to noise than fMRI data. Correlation-based methods are particularly sensitive to noise and may not reliably capture the functional differences between EEG patterns for different conditions. Decoding accuracy, by training classifiers to focus on task-relevant features, can effectively mitigate the impact of noisy signals and capture the representational difference between two conditions.

      Grootswagers, T., Wardle, S. G., & Carlson, T. A. (2017). Decoding dynamic brain patterns from evoked responses: A tutorial on multivariate pattern analysis applied to time series neuroimaging data. Journal of Cognitive Neuroscience, 29(4), 677-697.

      Xie, S., Kaiser, D., & Cichy, R. M. (2020). Visual imagery and perception share neural representations in the alpha frequency band. Current Biology, 30(13), 2621-2627.

      We added this explanation to the manuscript:

      (line 204 to 209) “Since EEG has a low SNR and includes rapid transient artifacts, Pearson correlations computed over very short time windows yield unstable dissimilarity estimates (Kappenman & Luck, 2010; Luck, 2014) and may thus fail to reliably detect differences between images. In contrast, decoding accuracy - by training classifiers to focus on task-relevant features - better mitigates noise and highlights representational differences.”

      (c) How were the specific posterior electrodes selected?

      The 17 posterior electrodes used in our analyses were pre-selected and provided in the THINGS EEG2 dataset, and corresponding to standard occipital and parietal sites based on the 10-10 EEG system. Specifically, we included all 17 electrodes with labels beginning with “O” or “P”, ensuring full coverage of posterior regions typically involved in visual object processing (Page 7).

      (d) The specific layers should be named rather than the vague ("last visual")

      Apologize for the confusing! We added more details about the layer information:

      (line 255 to 257) “The early layer in ResNet refers to ResNet.maxpool layer, and the late layer in ResNet refers to ResNet.avgpool layer. The early layer in CLIP refers to CLIP.visual.avgpool layer, and the late layer in CLIP refers to CLIP.visual.attnpool layer.”

      (line 420 to 434) “As shown in Figure 3F, the early layer representations of both ResNet and CLIP (ResNet.maxpool layer and CLIP.visual.avgpool) showed significant correlations with early EEG time windows (early layer of ResNet: 40-280ms, early layer of CLIP: 50-130ms and 160-260ms), while the late layers (ResNet.avgpool layer and CLIP.visual.attnpool layer) showed correlations extending into later time windows (late layer of ResNet: 80-300ms, late layer of CLIP: 70-300ms). Although there is substantial temporal overlap between early and late model layers, the overall pattern suggests a rough correspondence between model hierarchy and neural processing stages.

      We further extended this analysis across intermediate layers of both ResNet and CLIP models (from early to late, ResNet: ResNet.maxpool, ResNet.layer1, ResNet.layer2, ResNet.layer3, ResNet.layer4, ResNet.avgpool; from early to late, CLIP: CLIP.visual.avgpool, CLIP.visual.layer1, CLIP.visual.layer2, CLIP.visual.layer3, CLIP.visual.layer4, CLIP.visual.attnpool).”

      (e) p19: please change the reporting of t-statistics to standard APA format.

      Thanks for the suggestion. We changed the reporting format accordingly:

      (line 392 to 394) “The representation of real-word size had a significantly later peak latency than that of both retinal size, t(9)=4.30, p=.002, and real-world depth, t(9)=18.58, p<.001. And retinal size representation had a significantly later peak latency than real-world depth, t(9)=3.72, p=.005.”

      (2) "early layer of CLIP: 50-130ms and 160-260ms), while the late layer representations of twoANNs were significantly correlated with later representations in the human brain (late layer of ResNet: 80-300ms, late layer of CLIP: 70-300ms)."

      This seems a little strong, given the large amount of overlap between these models.

      We agree that our original wording may have overstated the distinction between early and late layers, given the substantial temporal overlap in their EEG correlations. We revised this sentence to soften the language to reflect the graded nature of the correspondence, and now describe the pattern as a general trend rather than a strict dissociation:

      (line 420 to 427) “As shown in Figure 3F, the early layer representations of both ResNet and CLIP (ResNet.maxpool layer and CLIP.visual.avgpool) showed significant correlations with early EEG time windows (early layer of ResNet: 40-280ms, early layer of CLIP: 50-130ms and 160-260ms), while the late layers (ResNet.avgpool layer and CLIP.visual.attnpool layer) showed correlations extending into later time windows (late layer of ResNet: 80-300ms, late layer of CLIP: 70-300ms). Although there is substantial temporal overlap between early and late model layers, the overall pattern suggests a rough correspondence between model hierarchy and neural processing stages.”

      (3) "Also, human brain representations showed a higher similarity to the early layer representation of the visual model (ResNet) than to the visual-semantic model (CLIP) at an early stage. "

      This has been previously reported by Greene & Hansen, 2020 J Neuro.

      Thanks! We added this reference.

      (4) "ANN (and Word2Vec) model RDMs"

      Why not just "model RDMs"? Might provide more clarity.

      We chose to use the phrasing “ANN (and Word2Vec) model RDMs” to maintain clarity and avoid ambiguity. In the literature, the term “model RDMs” is sometimes used more broadly to include hypothesis-based feature spaces or conceptual models, and we wanted to clearly distinguish our use of RDMs derived from artificial neural networks and language models. Additionally, explicitly referring to ANN or Word2Vec RDMs improves clarity by specifying the model source of each RDM. We hope this clarification justifies our choice to retain the original phrasing for clarity.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      This study presents cryoEM-derived structures of the Trypanosome aquaporin AQP2, in complex with its natural ligand, glycerol, as well as two trypanocidal drugs, pentamidine and melarsoprol, which use AQP2 as an uptake route. The structures are high quality, and the density for the drug molecules is convincing, showing a binding site in the centre of the AQP2 pore. 

      The authors then continue to study this system using molecular dynamics simulations. Their simulations indicate that the drugs can pass through the pore and identify a weak binding site in the centre of the pore, which corresponds with that identified through cryoEM analysis. They also simulate the effect of drug resistance mutations, which suggests that the mutations reduce the affinity for drugs and therefore might reduce the likelihood that the drugs enter into the centre of the pore, reducing the likelihood that they progress through into the cell. 

      While the cryoEM and MD studies are well conducted, it is a shame that the drug transport hypothesis was not tested experimentally. For example, did they do cryoEM with AQP2 with drug resistance mutations and see if they could see the drugs in these maps? They might not bind, but another possibility is that the binding site shifts, as seen in Chen et al. 

      TbAQP2 from the drug-resistant mutants does not transport either melarsoprol or pentamidine and there was thus no evidence to suggest that the mutant TbAQP2 channels could bind either drug. Moreover, there is not a single mutation that is characteristic for drug resistance in TbAQP2: references 12–15 show a plethora of chimeric AQP2/3 constructs in addition to various point mutations in laboratory strains and field isolates. In reference 17 we describe a substantial number of SNPs that reduced pentamidine and melarsoprol efficacy to levels that would constitute clinical resistance to acceptable dosage regimen. It thus appears that there are many and diverse mutations that are able to modify the protein sufficiently to induce resistance, and likely in multiple different ways, including the narrowing of the pore, changes to interacting amino acids, access to the pore etc. We therefore did not attempt to determine the structures of the mutant channels because we did not think that in most cases we would see any density for the drugs in the channel, and we would be unable to define ‘the’ resistance mechanism if we did in the case of one individual mutant TbAQP2. Our MD data suggests that pentamidine binding affinity is in the range of 50-300 µM for the mutant TbAQP2s selected for that test (I110W and L258Y/L264R), i.e. >1000-fold higher than TbAQP2WT. Thus these structures will be exceedingly challenging to determine with pentamidine in the pore but, of course, until the experiment has been tried we will not know for sure.

      Do they have an assay for measuring drug binding? 

      We tried many years ago to develop a <sup>3</sup>H-pentamidine binding assay to purified wild type TbAQP2 but we never got satisfactory results even though the binding should be in the doubledigit nanomolar range. This may be for any number of technical reasons and could also be partly because flexible di-benzamidines bind non-specifically to proteins at µM concentrations giving rise to high background. Measuring binding to the mutants was not tested given that they would be binding pentamidine in the µM range. If we were to pursue this further, then isothermal titration calorimetry (ITC) may be one way forward as this can measure µM affinity binding using unlabelled compounds, although it uses a lot of protein and background binding would need to be carefully assessed; see for example our work on measuring tetracycline binding to the tetracycline antiporter TetAB (https://doi.org/10.1016/j.bbamem.2015.06.026 ). Membrane proteins are also particularly tricky for this technique as the chemical activity of the protein solution must be identical to the chemical activity of the substrate solution which titrates in the molecule binding to the protein; this can be exceedingly problematic if any free detergent remains in the purified membrane protein. Another possibility may be fluorescence polarisation spectroscopy, although this would require fluorescently labelling the drugs which would very likely affect their affinity for TbAQP2 and how they interact with the wild type and mutant proteins – see the detailed SAR analysis in Alghamdi et al. 2020 (ref. 17). As you will appreciate, it would take considerable time and effort to set up an assay for measuring drug binding to mutants and is beyond the current scope of the current work.

      I think that some experimental validation of the drug binding hypothesis would strengthen this paper. Without this, I would recommend the authors to soften the statement of their hypothesis (i.e, lines 65-68) as this has not been experimentally validated.

      We agree with the referee that direct binding of drugs to the mutants would be very nice to have, but we have neither the time nor resources to do this. We have therefore softened the statement on lines 65-68 to read ‘Drug-resistant TbAQP2 mutants are still predicted to bind pentamidine, but the much weaker binding in the centre of the channel observed in the MD simulations would be insufficient to compensate for the high energy processes of ingress and egress, hence impairing transport at pharmacologically relevant concentrations.’ 

      Reviewer #2 (Public review): 

      Summary: 

      The authors present 3.2-3.7 Å cryo-EM structures of Trypanosoma brucei aquaglyceroporin-2 (TbAQP2) bound to glycerol, pentamidine, or melarsoprol and combine them with extensive allatom MD simulations to explain drug recognition and resistance mutations. The work provides a persuasive structural rationale for (i) why positively selected pore substitutions enable diamidine uptake, and (ii) how clinical resistance mutations weaken the high-affinity energy minimum that drives permeation. These insights are valuable for chemotherapeutic re-engineering of diamidines and aquaglyceroporin-mediated drug delivery. 

      My comments are on the MD part. 

      Strengths: 

      The study 

      (1) Integrates complementary cryo-EM, equilibrium, applied voltage MD simulations, and umbrella-sampling PMFs, yielding a coherent molecular-level picture of drug permeation. 

      (2) Offers direct structural rationalisation of long-standing resistance mutations in trypanosomes, addressing an important medical problem. 

      Weaknesses: 

      Unphysiological membrane potential. A field of 0.1 V nm ¹ (~1 V across the bilayer) was applied to accelerate translocation. From the traces (Figure 1c), it can be seen that the translocation occurred really quickly through the channel, suggesting that the field might have introduced some large changes in the protein. The authors state that they checked visually for this, but some additional analysis, especially of the residues next to the drug, would be welcome. 

      This is a good point from the referee, and we thank them for raising it. It is common to use membrane potentials in simulations that are higher than the physiological value, although these are typically lower than used here. The reason we used the higher value was to speed sampling and it still took 1,400 ns for transport in the physiologically correct direction, and even then, only in 1/3 repeats. Hence this choice of voltage was probably necessary to see the effect. The exceedingly slow rate of pentamidine permeation seen in the MD simulation was consistent with the experimental observations, as discussed in Alghamdi et al (2020) [ref. 17] where we estimated that TbAQP2-mediated pentamidine uptake in T. brucei bloodstream forms proceeds at just 9.5×10<sup>5</sup> molecules/cell/h; the number of functional TbAQP2 units in the plasma membrane is not known but their location is limited to the small flagellar pocket (Quintana et al. PLoS Negl Trop Dis 14, e0008458 (2020)). 

      The referee is correct that it is important to make sure that the applied voltage is not causing issues for the protein, especially for residues in contact with the drug. We have carried out RMSF analysis to better test this. The data show that comparing our simulations with the voltage applied to the monomeric MD simulations + PNTM with no voltage reveals little difference in the dynamics of the drug-contacting residues. 

      We have added these new data as Supplementary Fig12b with a new legend (lines1134-1138) 

      ‘b, RMSF calculations were run on monomeric TbAQP2 with either no membrane voltage or a 0.1V nm<sup>-1</sup> voltage applied (in the physiological direction). Shown are residues in contact with the pentamidine molecule, coloured by RMSF value. RMSF values are shown for residues Leu122, Phe226, Ile241, and Leu264. The data suggest the voltage has little impact on the flexibility or stability of the pore lining residues.’

      We have also added the following text to the manuscript (lines 524-530):

      ‘Membrane potential simulations were run using the computational electrophysiology protocol. An electric field of 0.1 V/nm was applied in the z-axis dimension only, to create a membrane potential of about 1 V (see Fig. S10a). Note that this is higher than the physiological value of 87.1 ± 2.1 mV at pH 7.3 in bloodstream T. brucei, and was chosen to improve the sampling efficiency of the simulations. The protein and lipid molecules were visually confirmed to be unaffected by this voltage, which we quantify using RMSF analysis on pentamidine-contacting residues (Fig. S12b).’ 

      Based on applied voltage simulations, the authors argue that the membrane potential would help get the drug into the cell, and that a high value of the potential was applied merely to speed up the simulation. At the same time, the barrier for translocation from PMF calculations is ~40 kJ/mol for WT. Is the physiological membrane voltage enough to overcome this barrier in a realistic time? In this context, I do not see how much value the applied voltage simulations have, as one can estimate the work needed to translocate the substrate on PMF profiles alone. The authors might want to tone down their conclusions about the role of membrane voltage in the drug translocation.

      We agree that the PMF barriers are considerable, however we highlight that other studies have seen similar landscapes, e.g. PMID 38734677 which saw a barrier of ca. 10-15 kcal/mol (ca. 4060 kJ/mol) for PNTM transversing the channel. This was reduced by ca. 4 kcal/mol when a 0.4 V nm ¹ membrane potential was applied, so we expect a similar effect to be seen here. 

      We have updated the Results to more clearly highlight this point and added the following text (lines 274-275):

      We note that previous studies using these approaches saw energy barriers of a similar size, and that these are reduced in the presence of a membrane voltage[17,31].’ 

      Pentamidine charge state and protonation. The ligand was modeled as +2, yet pKa values might change with the micro-environment. Some justification of this choice would be welcome. 

      Pentamidine contains two diamidine groups and each are expected to have a pKa above 10 in solution (PMID: 20368397), suggesting that the molecule will carry a +2 charge. Using the +2 charge is also in line with previous MD studies (PMID: 32762841). We have added the following text to the Methods (lines 506-509):

      ‘The pentamidine molecule used existing parameters available in the CHARMM36 database under the name PNTM with a charge state of +2 to reflect the predicted pKas of >10 for these groups [73] and in line with previous MD studies[17].’

      We note that accounting for the impact of the microenvironment is an excellent point – future studies might employ constant pH calculations to address this.

      The authors state that this RMSD is small for the substrate and show plots in Figure S7a, with the bottom plot being presumably done for the substrate (the legends are misleading, though), levelling off at ~0.15 nm RMSD. However, in Figure S7a, we see one trace (light blue) deviating from the initial position by more than 0.2 nm - that would surely result in an RMSD larger than 0.15, but this is somewhat not reflected in the RMSD plots. 

      The bottom plot of Fig. S9a (previously Fig. S7a) is indeed the RMSD of the drug (in relation to the protein). We have clarified the legend with the following text (lines 1037-1038): ‘… or for the pentamidine molecule itself, i.e. in relation to the Cα of the channel (bottom).’ 

      With regards the second comment, we assume the referee is referring to the light blue trace from Fig S9c. These data are actually for the monomeric channel rather than the tetramer. We apologise for not making this clearer in the legend. We have added the word ‘monomeric’ (line 1041).

      Reviewer #3 (Public review): 

      Summary: 

      Recent studies have established that trypanocidal drugs, including pentamidine and melarsoprol, enter the trypanosomes via the glyceroaquaporin AQP2 (TbAQP2). Interestingly, drug resistance in trypanosomes is, at least in part, caused by recombination with the neighbouring gene, AQP3, which is unable to permeate pentamidine or melarsoprol. The effect of the drugs on cells expressing chimeric proteins is significantly reduced. In addition, controversy exists regarding whether TbAQP2 permeates drugs like an ion channel, or whether it serves as a receptor that triggers downstream processes upon drug binding. In this study the authors set out to achieve three objectives: 

      (1) to determine if TbAQP2 acts as a channel or a receptor,

      We should clarify here that this was not an objective of the current manuscript as the transport activity has already been extensively characterised in the literature, as described in the introduction.

      (2) to understand the molecular interactions between TbAQP2 and glycerol, pentamidine, and melarsoprol, and 

      (3) to determine the mechanism by which mutations that arise from recombination with TbAQP3 result in reduced drug permeation. 

      Indeed, all three objectives are achieved in this paper. Using MD simulations and cryo-EM, the authors determine that TbAQP2 likely permeates drugs like an ion channel. The cryo-EM structures provide details of glycerol and drug binding, and show that glycerol and the drugs occupy the same space within the pore. Finally, MD simulations and lysis assays are employed to determine how mutations in TbAQP2 result in reduced permeation of drugs by making entry and exit of the drug relatively more energy-expensive. Overall, the strength of evidence used to support the author's claims is solid. 

      Strengths: 

      The cryo-EM portion of the study is strong, and while the overall resolution of the structures is in the 3.5Å range, the local resolution within the core of the protein and the drug binding sites is considerably higher (~2.5Å). 

      I also appreciated the MD simulations on the TbAQP2 mutants and the mechanistic insights that resulted from this data. 

      Weaknesses: 

      (1) The authors do not provide any empirical validation of the drug binding sites in TbAQP2. While the discussion mentions that the binding site should not be thought of as a classical fixed site, the MD simulations show that there's an energetically preferred slot (i.e., high occupancy interactions) within the pore for the drugs. For example, mutagenesis and a lysis assay could provide us with some idea of the contribution/importance of the various residues identified in the structures to drug permeation. This data would also likely be very valuable in learning about selectivity for drugs in different AQP proteins.

      On a philosophical level, we disagree with the requirement for ‘validation’ of a structure by mutagenesis. It is unclear what such mutagenesis would tell us beyond what was already shown experimentally through <sup>3</sup>H-pentamidine transport, drug sensitivity and lysis assays i.e. a given mutation will impact permeation to a certain extent. But on the structural level, what does mutagenesis tell us? If a bulky aromatic residue that makes many van der Waals interactions with the substrate is changed to an alanine residue and transport is reduced, what does this mean? It would confirm that the phenylalanine residue is very likely indeed making van der Waals contacts to the substrate, but we knew that already from the WT structure. And if it doesn’t have any effect? Well, it could mean that the van der Waals interactions with that particular residue are not that important or it could be that the substrate has changed its positions slightly in the channel and the new pose has similar energy of interactions to that observed in the wild type channel. Regardless of the result, any data from mutagenesis would be open to interpretation and therefore would not impact on the conclusions drawn in this manuscript. We might not learn anything new unless all residues interacting with the substrate are mutated, the structure of each mutant was determined and MD simulations were performed for all, which is beyond the scope of this work. Even then, the value for understanding clinical drug resistance would be limited, as this phenomenon has been linked to various chimeric rearrangements with adjacent TbAQP3 (references 12–15), each with a structure distinct from TbAQP2 with a single SNP. We also note that the recent paper by Chen et al. did not include any mutagenesis of the drug binding sites in TbAQP2 in their analysis of TbAQP2, presumably for similar reasons as discussed above.

      (2) Given the importance of AQP3 in the shaping of AQP2-mediated drug resistance, I think a figure showing a comparison between the two protein structures/AlphaFold structures would be beneficial and appropriate

      We agree that the comparison is of considerably interest and would contribute further to our understanding of the unique permeation capacities of TbAQP2. As such, we followed the reviewer’s suggestion and made an AlphaFold model of TbAQP3 and compared it to our structures of TbAQP2. The RMSD is 0.6 Å to the pentamidine-bound TbAQP2, suggesting that the fold of TbAQP3 has been predicted well, although the side chain rotamers cannot be assessed for their accuracy. Previous work has defined the selectivity filter of TbAQP3 to be formed by W102, R256, Y250. The superposition of the TbAQP3 model and the TbAQP2 pentamidine-bound structure shows that one of the amine groups is level with R256 and that there is a clash with Y250 and the backbone carbonyl of Y250, which deviates in position from the backbone of TbAQP2 in this region. There is also a clash with Ile252. 

      Although these observations are indeed interesting, on their own they are highly preliminary and extensive further work would be necessary to draw any convincing conclusions regarding these residues in preventing uptake of pentamidine and melarsoprol. The TbAQP3 AlphaFold model would need to be verified by MD simulations and then we would want to look at how pentamidine would interact with the channel under different experimental conditions like we have done with TbAQP2. We would then want to mutate to Ala each of the residues singly and in combination and assess them in uptake assays to verify data from the MD simulations. This is a whole new study and, given the uncertainties surrounding the observations of just superimposing TbAQP2 structure and the TbAQP3 model, we feel that, regrettably, this is just too speculative to add to our manuscript. 

      (3) A few additional figures showing cryo-EM density, from both full maps and half maps, would help validate the data. 

      Two new Supplementary Figures have been made, on showing the densities for each of the secondary structure elements (the new Figure S5) and one for the half maps showing the ligands (the new Figure S6). All the remaining supplementary figures have been renamed accordingly.

      (4) Finally, this paper might benefit from including more comparisons with and analysis of data published in Chen et al (doi.org/10.1038/s41467-024-48445-4), which focus on similar objectives. Looking at all the data in aggregate might reveal insights that are not obvious from either paper on their own. For example, melarsoprol binds differently in structures reported in the two respective papers, and this may tell us something about the energy of drug-protein interactions within the pore. 

      We already made the comparisons that we felt were most pertinent and included a figure (Fig. 5) to show the difference in orientation of melarsoprol in the two structures. We do not feel that any additional comparison is sufficiently interesting to be included. As we point out, the structures are virtually identical (RMSD 0.6 Å) and therefore there are no further mechanistic insights we would like to make beyond the thorough discussion in the Chen et al paper.

      Reviewer #1 (Recommendations for the authors): 

      (1) Line 65 - I don't think that the authors have tested binding experimentally, and so rather than 'still bind', I think that 'are still predicted to bind' is more appropriate. 

      Changed as suggested

      (2) Line 69 - remove 'and' 

      Changed as suggested

      (3) Line 111 - clarify that it is the protein chain which is 'identical'. Ligands not. 

      Changed to read ‘The cryo-EM structures of TbAQP2 (excluding the drugs/substrates) were virtually identical…

      (4) Line 186 - make the heading of this section more descriptive of the conclusion than the technique? 

      We have changed the heading to read: ‘Molecular dynamics simulations show impaired pentamidine transport in mutants’

      Reviewer #2 (Recommendations for the authors): 

      (1) Methods - a rate of 1 nm per ns is mentioned for pulling simulations, is that right? 

      Yes, for the generation of the initial frames for the umbrella sampling a pull rate of 1 nm/ns was used in either an upwards or downwards z-dimension

      (2) Figure S9 and S10 have their captions swapped. 

      The captions have been swapped to their proper positions.

      (3) Methods state "40 ns per window" yet also that "the first 50 ns of each window was discarded as equilibration". 

      Well spotted - this line should have read “the first 5 ns of each window was discarded as equilibration”. This has been corrected (line 541).

      Reviewer #3 (Recommendations for the authors): 

      (1) Abstract, line 68-70: incomplete sentence.

      The sentence has been re-written: ‘The structures of drug-bound TbAQP2 represent a novel paradigm for drug-transporter interactions and are a new mechanism for targeting drugs in pathogens and human cells.

      (2) Line 312-313: The paper you mention here came out in May 2024 - a year ago. I appreciate that they reported similar structural data, but for the benefit of the readers and the field, I would recommend a more thorough account of the points by which the two pieces of work differ. Is there some knowledge that can be gleaned by looking at all the data in the two papers together? For example, you report a glycerol-bound structure while the other group provides an apo one. Are there any mechanistic insights that can be gained from a comparison?

      We already made the comparisons that we felt were most pertinent and included a figure (Fig. 5) to show the difference in orientation of melarsoprol in the two structures. We do not feel that any additional comparison is sufficiently interesting to be included. As we point out, the structures are virtually identical (RMSD 0.6 Å) and therefore there are no further mechanistic insights we would like to make beyond the thorough discussion in the Chen et al paper.

      (3) Similarly, you can highlight the findings from your MD simulations on the TbAQP2 drug resistance mutants, which are unique to your study. How can this data help with solving the drug resistance problem?

      New drugs will need to be developed that can be transported by the mutant chimera AQP2s and the models from the MD simulations will provide a starting point for molecular docking studies. Further work will then be required in transport assays to optimise transport rather than merely binding. However, the fact that drug resistance can also arise through deletion of the AQP2 gene highlights the need for developing new drugs that target other proteins.

      (4) A glaring question that one has as a reader is why you have not attempted to solve the structures of the drug resistance mutants, either in complex with the two compounds or in their apo/glycerol-bound form? To be clear, I am not requesting this data, but it might be a good idea to bring this up in the discussion.

      TbAQP2 containing the drug-resistant mutants does not transport either melarsoprol or pentamidine (Munday et al., 2014; Alghamdi et al., 2020); there was thus no evidence to suggest that the mutant TbAQP2 channels could bind either drug. We therefore did not attempt to determine the structures of the mutant channels because we did not think that we would see any density for the drugs in the channel. Our MD data suggests that pentamidine binding affinity is in the range of 50-300 µM for the mutant TbAQP2, supporting the view that getting these structures would be highly challenging, but of course until the experiment is tried we will not know for sure.

      We also do not think we would learn anything new about doing structures of the drug-free structures of the transport-negative mutants of TbAQP2. The MD simulations have given novel insights into why the drugs are not transported and we would rather expand effort in this direction and look at other mutants rather than expend further effort in determining new structures.

      (5) Line 152-156: Is there a molecular explanation for why the TbAQP2 has 2 glycerol molecules captured in the selectivity filter while the PfAQP2 and the human AQP7 and AQP10 have 3?

      The presence of glycerol molecules represents local energy minima for binding, which will depend on the local disposition of appropriate hydrogen bonding atoms and hydrophobic regions, in conjunction with the narrowness of the channel to effectively bind glycerol from all sides. It is noticeable that the extracellular region of the channel is wider in TbAQP2 than in AQP7 and AQP10, so this may be one reason why additional ordered glycerol molecules are absent, and only two are observed. Note also that the other structures were determined by X-ray crystallography, and the environment of the crystal lattice may have significantly decreased the rate of diffusion of glycerol, increasing the likelihood of observing their electron densities.

      (6) I would also think about including the 8JY7 (TbAQP2 apo) structure in your analysis.

      We included 8JY7 in our original analyses, but the results were identical to 8JY6 and 8JY8 in terms of the protein structure, and, in the absence of any modelled substrates in 8JY7 (the interesting part for our manuscript), we therefore have not included the comparison.

      (7) I also think, given the importance of AQP3 in this context, it would be really useful to have a comparison with the AQP3 AlphaFold structure in order to examine why it does not permeate drugs.

      We made an AlphaFold model of TbAQP3 and compared it to our structures of TbAQP2. The RMSD is 0.6 Å to the pentamidine-bound TbAQP2, suggesting that the fold of TbAQP3 has been predicted well, although the side chain rotamers cannot be assessed for their accuracy. Previous work has defined the selectivity filter of TbAQP3 to be formed by W102, R256, Y250. The superposition of the TbAQP3 model and the TbAQP2 pentamidine-bound structure shows that one of the amine groups is level with R256 and that there is a clash with Y250 and the backbone carbonyl of Y250, which deviates in position from the backbone of TbAQP2 in this region. There is also a clash with Ile252. 

      Although these observations are interesting, on their own they are preliminary in the extreme and extensive further work will be necessary to draw any convincing conclusions regarding these residues in preventing uptake of pentamidine and melarsoprol. The TbAQP3 AlphaFold model would need to be verified by MD simulations and then we would want to look at how pentamidine would interact with the channel under different experimental conditions like we have done with TbAQP2. We would then want to mutate to Ala each of the residues singly and in combination and assess them in uptake assays to verify data from the MD simulations. This is a whole new study and, given the uncertainties surrounding the observations of just superimposing TbAQP2 structure and the TbAQP3 model, we feel this is just too speculative to add to our manuscript. 

      (8) To validate the densities representing glycerol and the compounds, you should show halfmap densities for these. 

      A new figure, Fig S6 has been made to show the half-map densities for the glycerol and drugs.

      (9) I would also like to see the density coverage of the individual helices/structural elements. 

      A new figure, Fig S5 has been made to show the densities for the structural elements.

      (10) While the LigPlot figure is nice, I think showing the data (including the cryo-EM density) is necessary validation.

      The LigPlot figure is a diagram (an interpretation of data) and does not need the densities as these have already been shown in Fig. 1c (the data).

      (11) I would recommend including a figure that illustrates the points described in lines 123-134.

      All of the points raised in this section are already shown in Fig. 2a, which was referred to twice in this section. We have added another reference to Fig.2a on lines 134-135 for completeness.

      (12) Line 202: I would suggest using "membrane potential/voltage" to avoid confusion with mitochondrial membrane potential. 

      We have changed this to ‘plasma membrane potential’ to differentiate it from mitochondrial membrane potential.

      (13) Figure 4: Label C.O.M. in the panels so that the figure corresponds to the legend. 

      We have altered the figure and added and explanation in the figure legend (lines 716-717):

      ‘Cyan mesh shows the density of the molecule across the MD simulation. and the asterisk shows the position of the centre of mass (COM).’

      (14) Figure S2: Panels d and e appear too similar, and it is difficult to see the stick representation of the compound. I would recommend either using different colours or showing a close-up of the site.

      We have clarified the figure by including two close-up views of the hot-spot region, one with melarsoprol overlaid and one with pentamidine overlaid

      (15) Figure S2: Typo in legend: 8YJ7 should be 8JY7.

      Changed as suggested  

      (16) Figure S3 and Figure S4: Please clarify which parts of the process were performed in cryoSPARC and which in Relion. 

      Figure S3 gives an overview of the processing and has been simplified to give the overall picture of the procedures. All of the details were included in the Methods section as other programmes are used, not just cryoSPARC and Relion. Given the complexities of the processing, we have referred the readers to the Methods section rather than giving confusing information in Fig. S3.

      We have updated the figure legend to Fig. S4 as requested.

      (17) Figure S9 and Figure S10: The legends are swapped in these two figures.

      The captions have been swapped to their proper positions.

      (18) For ease of orientation and viewing, I would recommend showing a vertical HOLE plot aligned with an image of the AQP2 pore. 

      The HOLE plot has been re-drawn as suggest (Fig. S2)

    1. Author response:

      We are very pleased to hear the overall positive views and constructive criticisms of eLife Editors and Reviewers on our work. In particular, we appreciate their global assessment that the work is important for understanding how body plan cues shape sensorimotor behavioural patterns, that the strength of evidence is solid, and their views that our experimental toolkit will be useful to others. We also very much appreciate eLife’s assessment that our findings will be of broad interest to researchers studying neural circuits, developmental genetics, and the evolution of behaviour.

      Regarding Reviewer 1, we thank them for their positive comments on the value of our study, highlighting that our paper addresses an important question using an elegant and innovative combination of methods, which leads to clear insights into the sensory biology of self-righting, which they consider shall be useful for others in the field. We are also very pleased to hear that they consider that our study makes a substantial contribution to understanding how animals correct their body position and that the manuscript is very clearly written and couched in interesting biology. In a revised version of the manuscript, we will consider some of the interesting points raised by Rev1, including the possibility of conducting new experiments using neuronal subset-specific Gal4s, to establish whether daIV sensory neurons are also acting in a regionally specific manner along the A-P axis.

      Turning to the comments by Rev2, we are grateful to them for considering that our experimental design is elegant, and that it introduces innovative methods that will likely benefit the fly behavior community, and the results are robustly supported. In connection to other comments, in a revised manuscript we will consider addressing the question of whether normal levels of expression of the Hox gene Antennapedia within the daIV domain are essential for self-righting. We will also seek to add technical replicates to our Hox expression molecular analysis, amend typos and incorporate several of the constructive corrections mentioned.

    1. Author response:

      Reviewer #1:

      Indicated the paper provided a strong analysis of RNAseq databases to provide a biological context and resource for the massive amounts of data in the field on RNA editing. The reviewer noted that future studies will be important to define the functional consequences of the individual edits and why the RNA editing rules we identified exist. We address these comments below.

      (1) The reviewer wondered about the role of noncanonical editing to neuronal protein expression.

      Indeed, the role of noncanonical editing has been poorly studied compared to the more common A-to-I ADAR-dependent editing. Most non-canonical coding edits we found actually caused silent changes at the amino acid level, suggesting evolutionary selection against this mechanism as a pathway for generating protein diversity. As such, we suspect that most of these edits are not altering neuronal function in significant ways. Two potential exceptions to this were non-canonical edits that altered conserved residues in the synaptic proteins Arc1 and Frequenin 1. The C-to-T coding edit in the activity-regulated Arc1 mRNA that encodes a retroviral-like Gag protein involved in synaptic plasticity resulted in a P124L amino acid change (see Author response image 1 panel A below). ~50% of total Arc1 mRNA was edited at this site in both Ib and Is neurons, suggesting a potentially important role if the P124L change alters Arc1 structure or function. Given Arc1 assembles into higher order viral-like capsids, this change could alter capsid formation or structure. Indeed, P124 lies in the hinge region separating the N- and C-terminal capsid assembly regions (panel B) and we hypothesize this change will alter the ability of Arc1 capsids to assemble properly. We plan to experimentally test this by rescuing Arc1 null mutants with edited versus unedited transgenes to see how the previously reported synaptic phenotypes are modified. We also plan to examine the ability of the change to alter Arc1 capsid assembly in a collaboration using CyroEM.

      Author response image 1.

      A. AlphaFold predictions of Drosophila Arc1 and Frq1 with edit site noted. B. Structure of the Drosophila Arc1 capsid. Monomeric Arc1 conformation within the capsid is shown on the right with the location of the edit site indicated.

      The other non-canonical edit (G-to-A) that stood out was in Frequenin 1 (Frq1), a multi-EF hand containing Ca<sup>2+</sup> binding protein that regulates synaptic transmission, that resulted in a G2E amino acid substitution (location within Frq1shown in panel A above). This glycine residue is conserved in all Frq homologs and is the site of N-myristoylation, a co-translational lipid modification to the glycine after removal of the initiator methionine by an aminopeptidase. Myristoylation tethers Frq proteins to the plasma membrane, with a Ca<sup>2+</sup>-myristoyl switch allowing some family members to cycle on and off membranes when the lipid domain is sequestered in the absence of Ca<sup>2+</sup>. Although the G2E edit is found at lower levels (20% in Ib MNs and 18% in Is MNs), it could create a pool of soluble Frq1 that alters it’s signaling. We plan to functionally assay the significance of this non-canonical edit as well. Compared to edits that alter amino acid sequence, determining how non canonical editing of UTRs might regulate mRNA dynamics is a harder question at this stage and will require more experimental follow-up.

      (2) The reviewer noted the last section of the results might be better split into multiple parts as it reads as a long combination of two thoughts.

      We agree with the reviewer that the last section is important, but it was disconnected a bit from the main story and was difficult for us to know exactly where to put it. All the data to that point in the paper was collected from our own PatchSeq analysis from individual larval motoneurons. We wanted to compare these results to other large RNAseq datasets obtained from pooled neuronal populations and felt it was best to include this at the end of the results section, as it no longer related to the rules of RNA editing within single neurons. We used these datasets to confirm many of our edits, as well as find evidence for some developmental and neuron-specific cell type edits. We also took advantage of RNAseq from neuronal datasets with altered activity to explore how activity might alter the editing machinery. We felt it better to include that data in this final section given it was not collected from our original PatchSeq approach.

      Reviewer #2:

      Noted the study provided a unique opportunity to identify RNA editing sites and rates specific to individual motoneuron subtypes, highlighting the RNAseq data was robustly analyzed and high-confidence hits were identified and compared to other RNAseq datasets. The reviewer provided some suggestions for future experiments and requested a few clarifications.

      (1) The reviewer asked about Figure 1F and the average editing rate per site described later in the paper.

      Indeed, Figure 1F shows the average editing rate for each individual gene for all the Ib and Is cells, so we primarily use that to highlight the variability we find in overall editing rate from around 20% for some sites to 100% for others. The actual editing rate for each site for individual neurons is shown in Figure 4D that plots the rate for every edit site and the overall sum rate for that neuron in particular.

      (2) The reviewer also noted that it was unclear where in the VNC the individual motoneurons were located and how that might affect editing.

      The precise segment of the larvae for every individual neuron that was sampled by Patch-seq was recorded and that data is accessible in the original Jetti et al 2023 paper if the reader wants to explore any potential anterior to posterior differences in RNA editing. Due to the technical difficulty of the Patch-seq approach, we pooled all the Ib and Is neurons from each segment together to get more statistical power to identify edit sites. We don’t believe segmental identify would be a major regulator of RNA editing, but cannot rule it out.

      (3) The reviewer also wondered if including RNAs located both in the nucleus and cytoplasm would influence editing rate.

      Given our Patch-seq approach requires us to extract both the cytoplasm and nucleus, we would be sampling both nuclear and cytoplasmic mRNAs. However, as shown in Figure 8 – figure supplement 3 D-F, the vast majority of our edits are found in both polyA mRNA samples and nascent nuclear mRNA samples from other datasets, indicating the editing is occurring co-transcriptionally and within the nucleus. As such, we don't think the inclusion of cytoplasmic mRNA is altering our measured editing rates for most sites. This may not be true for all non-canonical edits, as we did see some differences there, indicating some non-canonical editing may be happening in the cytoplasm as well.

      Reviewer #3:

      indicated the work provided a valuable resource to access RNA editing in single neurons. The reviewer suggested the value of future experiments to demonstrate the effects of editing events on neuronal function. This will be a major effort for us going forwards, as we indeed have already begun to test the role of editing in mRNAs encoding several presynaptic proteins that regulate synaptic transmission. The reviewer also had several other comments as discussed below.

      (1) The reviewer noted that silent mutations could alter codon usage that would result in translational stalling and altered protein production.

      This is an excellent point, as silent mutations in the coding region could have a more significant impact if they generate non-preferred rare codons. This is not something we have analyzed, but it certainly is worth considering in future experiments. Our initial efforts are on testing the edits that cause predictive changes in presynaptic proteins based on the amino acid change and their locale in important functional domains, but it is worth considering the silent edits as well as we think about the larger picture of how RNA editing is likely to impact not only protein function but also protein levels.

      (2) The reviewer noted future studies could be done using tools like Alphafold to test if the amino acid changes are predicted to alter the structure of proteins with coding edits.

      This is an interesting approach, though we don’t have much expertise in protein modeling at that level. We could consider adding this to future studies in collaboration with other modeling labs.

      (3) The reviewer wondered if the negative correlation between edits and transcript abundance could indicate edits might be destabilizing the transcripts.

      This is an interesting idea, but would need to be experimentally tested. For the few edits we have generated already to begin functionally testing, including our published work with editing in the C-terminus of Complexin, we haven’t seen a change in mRNA levels causes by these edits. However, it would not be surprising to see some edits reducing transcript levels. A set of 5’UTR edits we have generated in Syx1A seem to be reducing protein production and may be acting in such a manner.

      (4) The reviewer wondered if the proportion of edits we report in many of the figures is normalized to the length of the transcript, as longer transcripts might have more edits by chance.

      The figures referenced by the reviewer (1, 2 and 7) show the number of high-confidence editing sites that fall into the 5’ UTR, 3’ UTR, or CDS categories. Our intention here was to highlight that the majority of the high confidence edits that made it through the stringent filtering process were in the coding region. This would still be true if we normalized to the length of the given gene region. However, it would be interesting to know if these proportions match the expected proportions of edits in these gene regions given a random editing rate per gene region length across the Drosophila genome, although we did not do this analysis.    

      (5) The reviewer noted that future studies could expand on the work to examine miRNA or other known RBP binding sites that might be altered by the edits.

      This is another avenue we could pursue in the future. We did do this analysis for a few of the important genes encoding presynaptic proteins (these are the most interesting to us given the lab’s interest in the synaptic vesicle fusion machinery), but did not find anything obvious for this smaller subset of targets.

      (6) The reviewer suggested sequence context for Adar could also be investigated for the hits we identified.

      We haven’t pursued this avenue yet, but it would be of interest to do in the future. In a similar vein, it would be informative to identify intron-exon base pairing that could generate the dsDNA template on which ADAR acts.

      (7) The reviewer noted the disconnect between Adar mRNA levels and overall editing levels reported in Figure 4A/B.

      Indeed, the lack of correlation between overall editing levels and Adar mRNA abundance has been noted previously in many studies. For the type of single cell Patch-seq approach we took to generate our RNAseq libraries, the absolute amount of less abundant transcripts obtained from a single neuron can be very noisy. As such, the few neurons with no detectable Adar mRNA are likely to represent that single neuron noise in the sampling. Per the reviewer’s question, these figure panels only show A-to-I edits, so they are specific to ADAR.

      (8) The reviewer notes the scale in Figure 5D can make it hard to visualize the actual impact of the changes.

      The intention of Figure 5D was to address the question of whether sites with high Ib/Is editing differences were simply due to higher Ib or Is mRNA expression levels. If this was the case, then we would expect to see highly edited sites have large Ib/Is TPM differences. Instead, as the figure shows, the vast majority of highly-edited sites were in mRNAs that were NOT significantly different between Ib and Is (red dots in graph) and are therefore clustered together near “0 Difference in TPMs”. TPMs and editing levels for all edit sites can be found in Table 1, and a visualization of these data for selected sites is shown in Figure 5E.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #1 (Public review):

      In this manuscript, Hoon Cho et al. present a novel investigation into the role of PexRAP, an intermediary in ether lipid biosynthesis, in B cell function, particularly during the Germinal Center (GC) reaction. The authors profile lipid composition in activated B cells both in vitro and in vivo, revealing the significance of PexRAP. Using a combination of animal models and imaging mass spectrometry, they demonstrate that PexRAP is specifically required in B cells. They further establish that its activity is critical upon antigen encounter, shaping B cell survival during the GC reaction. Mechanistically, they show that ether lipid synthesis is necessary to modulate reactive oxygen species (ROS) levels and prevent membrane peroxidation.

      Highlights of the Manuscript:

      The authors perform exhaustive imaging mass spectrometry (IMS) analyses of B cells, including GC B cells, to explore ether lipid metabolism during the humoral response. This approach is particularly noteworthy given the challenge of limited cell availability in GC reactions, which often hampers metabolomic studies. IMS proves to be a valuable tool in overcoming this limitation, allowing detailed exploration of GC metabolism.

      The data presented is highly relevant, especially in light of recent studies suggesting a pivotal role for lipid metabolism in GC B cells. While these studies primarily focus on mitochondrial function, this manuscript uniquely investigates peroxisomes, which are linked to mitochondria and contribute to fatty acid oxidation (FAO). By extending the study of lipid metabolism beyond mitochondria to include peroxisomes, the authors add a critical dimension to our understanding of B cell biology.

      Additionally, the metabolic plasticity of B cells poses challenges for studying metabolism, as genetic deletions from the beginning of B cell development often result in compensatory adaptations. To address this, the authors employ an acute loss-of-function approach using two conditional, cell-type-specific gene inactivation mouse models: one targeting B cells after the establishment of a pre-immune B cell population (Dhrs7b^f/f, huCD20-CreERT2) and the other during the GC reaction (Dhrs7b^f/f; S1pr2-CreERT2). This strategy is elegant and well-suited to studying the role of metabolism in B cell activation.

      Overall, this manuscript is a significant contribution to the field, providing robust evidence for the fundamental role of lipid metabolism during the GC reaction and unveiling a novel function for peroxisomes in B cells. 

      Comments on revisions:

      There are still some discrepancies in gating strategies. In Fig. 7B legend (lines 1082-1083), they show representative flow plots of GL7+ CD95+ GC B cells among viable B cells, so it is not clear if they are IgDneg, as the rest of the GC B cells aforementioned in the text.

      We apologize for missing this item in need of correction in the revision and sincerely thank the reviewer for the stamina and care in picking this up. The data shown in Fig. 7B represented cells (events) in the IgD<sup>neg</sup> Dump<sup>neg</sup> viable lymphoid gate. We will correct this omission/blemish in the final revision that becomes the version of record.

      Western blot confirmation: We understand the limitations the authors enumerate. Perhaps an RT-qPCR analysis of the Dhrs7b gene in sorted GC B cells from the S1PR2-CreERT2 model could be feasible, as it requires a smaller number of cells. In any case, we agree with the authors that the results obtained using the huCD20-CreERT2 model are consistent with those from the S1PR2-CreERT2 model, which adds credibility to the findings and supports the conclusion that GC B cells in the S1PR2-CreERT2 model are indeed deficient in PexRAP.

      We will make efforts to go back through the manuscript and highlight this limitation to readers, i.e., that we were unable to get genetic evidence to assess what degree of "counter-selection" applied to GC B cells in our experiments.

      We agree with the referee that optimally to support the Imaging Mass Spectrometry (IMS) data showing perturbations of various ether lipids within GC after depletion of PexRAP, it would have been best if we could have had a qRT2-PCR that allowed quantitation of the Dhrs7b-encoded mRNA in flow-purified GC B cells, or the extent to which the genomic DNA of these cells was in deleted rather than 'floxed' configuration.

      While the short half-life of ether lipid species leads us to infer that the enzymatic function remains reduced/absent, it definitely is unsatisfying that the money for experiments ran out in June and the lab members had to move to new jobs.

      Lines 222-226: We believe the correct figure is 4B, whereas the text refers to 4C.

      As for the 1st item, we apologize and will correct this error.

      Supplementary Figure 1 (line 1147): The figure title suggests that the data on T-cell numbers are from mice in a steady state. However, the legend indicates that the mice were immunized, which means the data are not from steady-state conditions. 

      We will change the wording both on line 1147 and 1152.

      Reviewer #2 (Public review):

      Summary:

      In this study, Cho et al. investigate the role of ether lipid biosynthesis in B cell biology, particularly focusing on GC B cell, by inducible deletion of PexRAP, an enzyme responsible for the synthesis of ether lipids.

      Strengths:

      Overall, the data are well-presented, the paper is well-written and provides valuable mechanistic insights into the importance of PexRAP enzyme in GC B cell proliferation.

      Weaknesses:

      More detailed mechanisms of the impaired GC B cell proliferation by PexRAP deficiency remain to be further investigated. In minor part, there are issues for the interpretation of the data which might cause confusions by readers.

      Comments on revisions:

      The authors improved the manuscript appropriately according to my comments.

      To re-summarize, we very much appreciate the diligence of the referees and Editors in re-reviewing this work at each cycle and helping via constructive peer review, along with their favorable comments and overall assessments. The final points will be addressed with minor edits since there no longer is any money for further work and the lab people have moved on.


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      In this manuscript, Sung Hoon Cho et al. presents a novel investigation into the role of PexRAP, an intermediary in ether lipid biosynthesis, in B cell function, particularly during the Germinal Center (GC) reaction. The authors profile lipid composition in activated B cells both in vitro and in vivo, revealing the significance of PexRAP. Using a combination of animal models and imaging mass spectrometry, they demonstrate that PexRAP is specifically required in B cells. They further establish that its activity is critical upon antigen encounter, shaping B cell survival during the GC reaction. 

      Mechanistically, they show that ether lipid synthesis is necessary to modulate reactive oxygen species (ROS) levels and prevent membrane peroxidation.

      Highlights of the Manuscript:

      The authors perform exhaustive imaging mass spectrometry (IMS) analyses of B cells, including GC B cells, to explore ether lipid metabolism during the humoral response. This approach is particularly noteworthy given the challenge of limited cell availability in GC reactions, which often hampers metabolomic studies. IMS proves to be a valuable tool in overcoming this limitation, allowing detailed exploration of GC metabolism.

      The data presented is highly relevant, especially in light of recent studies suggesting a pivotal role for lipid metabolism in GC B cells. While these studies primarily focus on mitochondrial function, this manuscript uniquely investigates peroxisomes, which are linked to mitochondria and contribute to fatty acid oxidation (FAO). By extending the study of lipid metabolism beyond mitochondria to include peroxisomes, the authors add a critical dimension to our understanding of B cell biology.

      Additionally, the metabolic plasticity of B cells poses challenges for studying metabolism, as genetic deletions from the beginning of B cell development often result in compensatory adaptations. To address this, the authors employ an acute loss-of-function approach using two conditional, cell-type-specific gene inactivation mouse models: one targeting B cells after the establishment of a pre-immune B cell population (Dhrs7b^f/f, huCD20-CreERT2) and the other during the GC reaction (Dhrs7b^f/f; S1pr2-CreERT2). This strategy is elegant and well-suited to studying the role of metabolism in B cell activation.

      Overall, this manuscript is a significant contribution to the field, providing robust evidence for the fundamental role of lipid metabolism during the GC reaction and unveiling a novel function for peroxisomes in B cells.

      We appreciate these positive reactions and response, and agree with the overview and summary of the paper's approaches and strengths.

      However, several major points need to be addressed:

      Major Comments:

      Figures 1 and 2

      The authors conclude, based on the results from these two figures, that PexRAP promotes the homeostatic maintenance and proliferation of B cells. In this section, the authors first use a tamoxifen-inducible full Dhrs7b knockout (KO) and afterwards Dhrs7bΔ/Δ-B model to specifically characterize the role of this molecule in B cells. They characterize the B and T cell compartments using flow cytometry (FACS) and examine the establishment of the GC reaction using FACS and immunofluorescence. They conclude that B cell numbers are reduced, and the GC reaction is defective upon stimulation, showing a reduction in the total percentage of GC cells, particularly in the light zone (LZ).

      The analysis of the steady-state B cell compartment should also be improved. This includes a  more detailed characterization of MZ and B1 populations, given the role of lipid metabolism and lipid peroxidation in these subtypes.

      Suggestions for Improvement:

      B Cell compartment characterization: A deeper characterization of the B cell compartment in non-immunized mice is needed, including analysis of Marginal Zone (MZ) maturation and a more detailed examination of the B1 compartment. This is especially important given the role of specific lipid metabolism in these cell types. The phenotyping of the B cell compartment should also include an analysis of immunoglobulin levels on the membrane, considering the impact of lipids on membrane composition.

      Although the manuscript is focused on post-ontogenic B cell regulation in Ab responses, we believe we will be able to polish a revised manuscript through addition of results of analyses suggested by this point in the review: measurement of surface IgM on and phenotyping of various B cell subsets, including MZB and B1 B cells, to extend the data in Supplemental Fig 1H and I. Depending on the level of support, new immunization experiments to score Tfh and analyze a few of their functional molecules as part of a B cell paper may be feasible.   

      Addendum / update of Sept 2025: We added new data with more on MZB and B1 B cells, surface IgM, and on Tfh populations. 

      GC Response Analysis Upon Immunization: The GC response characterization should include additional data on the T cell compartment, specifically the presence and function of Tfh cells. In Fig. 1H, the distribution of the LZ appears strikingly different. However, the authors have not addressed this in the text. A more thorough characterization of centroblasts and centrocytes using CXCR4 and CD86 markers is needed.

      The gating strategy used to characterize GC cells (GL7+CD95+ in IgD− cells) is suboptimal. A more robust analysis of GC cells should be performed in total B220+CD138− cells.

      We first want to apologize the mislabeling of LZ and DZ in Fig 1H. The greenish-yellow colored region (GL7<sup>+</sup> CD35<sup>+</sup>) indicate the DZ and the cyan-colored region (GL7<sup>+</sup> CD35<sup>+</sup>) indicates the LZ.    Addendum / update of Sept 2025: We corrected the mistake, and added new experimental data using the CD138 marker to exclude preplasmablasts.  

      As a technical note, we experienced high background noise with GL7 staining uniquely with PexRAP deficient (Dhrs7b<sup>f/f</sup>; Rosa26-CreER<sup>T2</sup>) mice (i.e., not WT control mice). The high background noise of GL7 staining was not observed in B cell specific KO of PexRAP (Dhrs7b<sup>f/f</sup>; huCD20-CreER<sup>T2</sup>). Two formal possibilities to account for this staining issue would be if either the expression of the GL7 epitope were repressed by PexRAP or the proper positioning of GL7<sup>+</sup> cells in germinal center region were defective in PexRAPdeficient mice (e.g., due to an effect on positioning cues from cell types other than B cells). In a revised manuscript, we will fix the labeling error and further discuss the GL7 issue, while taking care not to be thought to conclude that there is a positioning problem or derepression of GL7 (an activation antigen on T cells as well as B cells).

      While the gating strategy for an overall population of GC B cells is fairly standard even in the current literature, the question about using CD138 staining to exclude early plasmablasts (i.e., analyze B220<sup>+</sup> CD138<sup>neg</sup> vs B220<sup>+</sup> CD138<sup>+</sup>) is interesting. In addition, some papers like to use GL7<sup>+</sup> CD38<sup>neg</sup> for GC B cells instead of GL7<sup>+</sup> Fas (CD95)<sup>+</sup>, and we thank the reviewer for suggesting the analysis of centroblasts and centrocytes. For the revision, we will try to secure resources to revisit the immunizations and analyze them for these other facets of GC B cells (including CXCR4/CD86) and for their GL7<sup>+</sup> CD38<sup>neg</sup>. B220<sup>+</sup> CD138<sup>-</sup> and B220<sup>+</sup> CD138<sup>+</sup> cell populations. 

      We agree that comparison of the Rosa26-CreERT2 results to those with B cell-specific lossof-function raise a tantalizing possibility that Tfh cells also are influenced by PexRAP. Although the manuscript is focused on post-ontogenic B cell regulation in Ab responses, we hope to add a new immunization experiments that scores Tfh and analyzes a few of their functional molecules could be added to this B cell paper, depending on the ability to wheedle enough support / fiscal resources.  

      Addendum / update of Sept 2025: Within the tight time until lab closure, and limited $$, we were able to do experiments that further reinforced the GC B cell data - including stains for DZ vs LZ sub-subsetting - and analyzed Tfh cells. We were not able to explore changes in functional antigenic markers on the GC B or Tfh cells. 

      The authors claim that Dhrs7b supports the homeostatic maintenance of quiescent B cells in vivo and promotes effective proliferation. This conclusion is primarily based on experiments where CTV-labeled PexRAP-deficient B cells were adoptively transferred into μMT mice (Fig. 2D-F). However, we recommend reviewing the flow plots of CTV in Fig. 2E, as they appear out of scale. More importantly, the low recovery of PexRAP-deficient B cells post-adoptive transfer weakens the robustness of the results and is insufficient to conclusively support the role of PexRAP in B cell proliferation in vivo.

      In the revision, we will edit the text and try to adjust the digitized cytometry data to allow more dynamic range to the right side of the upper panels in Fig. 2E, and otherwise to improve the presentation of the in vivo CTV result. However, we feel impelled to push back respectfully on some of the concern raised here. First, it seems to gloss over the presentation of multiple facets of evidence. The conclusion about maintenance derives primarily from Fig. 2C, which shows a rapid, statistically significant decrease in B cell numbers (extending the finding of Fig. 1D, a more substantial decrease after a bit longer a period). As noted in the text, the rate of de novo B cell production does not suffice to explain the magnitude of the decrease. 

      In terms of proliferation, we will improve presentation of the Methods but the bottom line is that the recovery efficiency is not bad (comparing to prior published work) inasmuch as transferred B cells do not uniformly home to spleen. In a setting where BAFF is in ample supply in vivo, we transferred equal numbers of cells that were equally labeled with CTV and counted B cells. The CTV result might be affected by lower recovered B cell with PexRAP deficiency, generally, the frequencies of CTV<sup>low</sup> divided population are not changed very much. However, it is precisely because of the pitfalls of in vivo analyses that we included complementary data with survival and proliferation in vitro. The proliferation was attenuated in PexRAP-deficient B cells in vitro; this evidence supports the conclusion that proliferation of PexRAP knockout B cells is reduced. It is likely that PexRAP deficient B cells also have defect in viability in vivo as we observed the reduced B cell number in PexRAP-deficient mice. As the reviewer noticed, the presence of a defect in cycling does, in the transfer experiments, limit the ability to interpret a lower yield of B cell population after adoptive transfer into µMT recipient mice as evidence pertaining to death rates. We will edit the text of the revision with these points in mind. 

      In vitro stimulation experiments: These experiments need improvement. The authors have used anti-CD40 and BAFF for B cell stimulation; however, it would be beneficial to also include antiIgM in the stimulation cocktail. In Fig. 2G, CTV plots do not show clear defects in proliferation, yet the authors quantify the percentage of cells with more than three divisions. These plots should clearly display the gating strategy. Additionally, details about histogram normalization and potential defects in cell numbers are missing. A more in-depth analysis of apoptosis is also required to determine whether the observed defects are due to impaired proliferation or reduced survival. 

      As suggested by reviewer, testing additional forms of B cell activation can help explore the generality (or lack thereof) of findings. We plan to test anti-IgM stimulation together with anti-CD40 + BAFF as well as anti-IgM + TLR7/8, and add the data to a revised and final manuscript. 

      Addendum / update of Sept 2025: The revision includes results of new experiments in which anti-IgM was included in the stimulation cocktail, as well as further data on apoptosis and distinguishing impaired cycling / divisions from reduced survival .

      With regards to Fig. 2G (and 2H), in the revised manuscript we will refine the presentation (add a demonstration of the gating, and explicate histogram normalization of FlowJo). 

      It is an interesting issue in bioscience, but in our presentation 'representative data' really are pretty representative, so a senior author is reminded of a comment Tak Mak made about a reduction (of proliferation, if memory serves) to 0.7 x control. [His point in a comment to referees at a symposium related that to a salary reduction by 30% :) A mathematical alternative is to point out that across four rounds of division for WT cells, a reduction to  0.7x efficiency at each cycle means about 1/4 as many progeny.] 

      We will try to edit the revision (Methods, Legends, Results, Discussion] to address better the points of the last two sentences of the comment, and improve the details that could assist in replication or comparisons (e.g., if someone develops a PexRAP inhibitor as potential therapeutic). 

      For the present, please note that the cell numbers at the end of the cultures are currently shown in Fig 2, panel I. Analogous culture results are shown in Fig 8, panels I, J, albeit with harvesting at day 5 instead of day 4. So, a difference of ≥ 3x needs to be explained. As noted above, a division efficiency reduced to 0.7x normal might account for such a decrease, but in practice the data of Fig. 2I show that the number of PexRAP-deficient B cells at day 4 is similar to the number plated before activation, and yet there has been a reasonable amount of divisions. So cell numbers in the culture of mutant B cells are constant because cycling is active but decreased and insufficient to allow increased numbers ("proliferation" in the true sense) as programmed death is increased. In line with this evidence, Fig 8G-H document higher death rates [i.e., frequencies of cleaved caspase3<sup>+</sup> cell and Annexin V<sup>+</sup> cells] of PexRAP-deficient B cells compared to controls. Thus, the in vitro data lead to the conclusion that both decreased division rates and increased death operate after this form of stimulation. 

      An inference is that this is the case in vivo as well - note that recoveries differed by ~3x (Fig. 2D), and the decrease in divisions (presentation of which will be improved) was meaningful but of lesser magnitude (Fig. 2E, F). 

      Reviewer #2 (Public review):

      Summary:

      In this study, Cho et al. investigate the role of ether lipid biosynthesis in B cell biology, particularly focusing on GC B cell, by inducible deletion of PexRAP, an enzyme responsible for the synthesis of ether lipids.

      Strengths:

      Overall, the data are well-presented, the paper is well-written and provides valuable mechanistic insights into the importance of PexRAP enzyme in GC B cell proliferation.

      We appreciate this positive response and agree with the overview and summary of the paper's approaches and strengths. 

      Weaknesses:

      More detailed mechanisms of the impaired GC B cell proliferation by PexRAP deficiency remain to be further investigated. In the minor part, there are issues with the interpretation of the data which might cause confusion for the readers.

      Issues about contributions of cell cycling and divisions on the one hand, and susceptibility to death on the other, were discussed above, amplifying on the current manuscript text. The aggregate data support a model in which both processes are impacted for mature B cells in general, and mechanistically the evidence and work focus on the increased ROS and modes of death. Although the data in Fig. 7 do provide evidence that GC B cells themselves are affected, we agree that resource limitations had militated against developing further evidence about cycling specifically for GC B cells. We will hope to be able to obtain sufficient data from some specific analysis of proliferation in vivo (e.g., Ki67 or BrdU) as well as ROS and death ex vivo when harvesting new samples from mice immunized to analyze GC B cells for CXCR4/CD86, CD38, CD138 as indicated by Reviewer 1. As suggested by Reviewer 2, we will further discuss the possible mechanism(s) by which proliferation of PexRAP-deficient B cells is impaired. We also will edit the text of a revision where to enhance clarity of data interpretation - at a minimum, to be very clear that caution is warranted in assuming that GC B cells will exhibit the same mechanisms as cultures in vitro-stimulated B cells. 

      Addendum / update of Sept 2025: We were able to obtain results of intravital BrdU incorporation into GC B cells to measure cell cycling rates. The revised manuscript includes these results as well as other new data on apoptosis / survival, while deleting the data about CD138 populations whose interpretation was reasonably questioned by the referees.  

      Reviewer #1 (Recommendations for the authors):

      We believe the evidence presented to support the role of PexRAP in protecting B cells from cell death and promoting B cell proliferation is not sufficiently robust and requires further validation in vivo. While the study demonstrates an increase in ether lipid content within the GC compartment, it also highlights a reduction in mature B cells in PexRAP-deficient mice under steady-state conditions. However, the IMS results (Fig. 3A) indicate that there are no significant differences in ether lipid content in the naïve B cell population. This discrepancy raises an intriguing point for discussion: why is PexRAP critical for B cell survival under steady-state conditions?

      We thank the referee for all their care and input, and we agree that further intravital analyses could strengthen the work by providing more direct evidence of impairment of GC B cells in vivo. To revise and improve this manuscript before creation of a contribution of record, we performed new experiments to the limit of available funds and have both (i) added these new data and (ii) sharpened the presentation to correct what we believe to be one inaccurate point raised in the review. 

      (A) Specifically, we immunized mice with a B cell-specific depletion of PexRAP (Dhrs7b<sup>D/D-B</sup> mice) and measured a variety of readouts of the GC B cells' physiology in vivo: proliferation by intravital incorporation of BrdU, ROS in the viable GC B cell gate, and their cell death by annexin V staining directly ex vivo. Consistent with the data with in vitro activated B cells, these analyses showed increased ROS (new - Fig. 7D) and higher frequencies of Annexin V<sup>+</sup> 7AAD<sup>+</sup> in GC B cells (GL7<sup>+</sup> CD38<sup>-</sup> B cell-gate) of immunized Dhrs7b<sup>D/D-B</sup> mice compared with WT controls (huCD20-CreERT2<sup>+/-</sup>, Dhrs7b<sup>+/+</sup>)  (new - Fig. 7E). Collectively, these results indicate that PexRAP aids (directly or indirectly) in controlling ROS in GC B cells and reduces B cell death, likely contributing to the substantially decreased overall GC B cell population. These new data are added to the revised manuscript in Figure 7.  

      Moreover, in each of two independent experiments (each comprising 3 vs 3 immunized mice), BrdU<sup>+</sup> events among GL7<sup>+</sup> CD38<sup>-</sup> (GC B cell)-gated cells were reduced in the B cell-specific PexRAP knockouts compared with WT controls (new, Fig. 7F and Supplemental Fig 6E). This result on cell cycle rates in vivo is presented with caution in the revised manuscript text because the absolute labeling fractions were somewhat different in Expt 1 vs Expt 2. This situation affords a useful opportunity to comment on the culture of "P values" and statistical methods. It is intriguing to consider how many successful drugs are based on research published back when the standard was to interpret a result of this sort more definitively despite a merged "P value" that was not a full 2 SD different from the mean. In the optimistic spirit of the eLife model, it can be for the attentive reader to decide from the data (new, Fig. 7F and Supplemental Fig 6E) whether to interpret the BrdU results more strongly that what we state in the revised text.  

      (B) On the issue of whether or not the loss of PexRAP led to perturbations of the lipidome of B cells prior to activation, we have edited the manuscript to do a better job making this point more clear.  

      We point out to readers that in the resting, pre-activation state abnormalities were detected in naive B cells, not just in activated and GC B cells. In brief, the IMS analysis and LC-MS-MS analysis detected statistically significant differences in some, but not all, the ether phospholipids species in PexRAP deficient cells (some of which was in Supplemental Figure 2 of the original version). 

      With this appropriate and helpful concern having been raised, we realize that this important point merited inclusion in the main figures. We point specifically to a set of phosphatidyl choline ions shown in Fig. 3 (revised - panels A, B, D) of the revised manuscript (PC O-36:5; PC O-38:5; PC O-40:6 and -40:7). 

      For this ancillary record (because a discourse on the limitations of each analysis), we will note issues such as the presence of many non-B cells in each pixel of the IMS analyses (so that some or many "true positives" will fail to achieve a "significant difference") and for the naive B cells, differential rates of synthesis, turnover, and conversion (e.g., addition of another 2-carbon unit or saturation / desaturation of one side-chain). To the extent the concern reflects some surprise and perhaps skepticism that what seem relatively limited differences (many species appear unaffected, etc), we share in the sentiment. But the basic observation is that there are differences, and a reasonable connection between the altered lipid profile and evidence of effects on survival or proliferation (i.e., integration of survival and cell cycling / division). 

      Additionally, it would be valuable to evaluate the humoral response in a T-independent setting. This would clarify whether the role of PexRAP is restricted to GC B cells or extends to activated B cells in general. 

      We agree that this additional set of experiments would be nice and would extend work incrementally by testing the generality of the findings about Ab responses. The practical problem is that money and time ran out while testing important items that strengthen the evidence about GC B cells. 

      Finally, the manuscript would benefit from a thorough revision to improve its readability and clarity. Including more detailed descriptions of technical aspects, such as the specific stimuli and time points used in analyses, would greatly enhance the flow and comprehension of the study. Furthermore, the authors should review figure labeling to ensure consistency throughout the manuscript, and carefully cite the relevant references. For instance, S1PR2 CreERT2 mouse is established by Okada and Kurosaki (Shinnakasu et al ,Nat. Immunol, 2016)

      We appreciate this feedback and comment, inasmuch as both the clarity and scholarship matter greatly to us for a final item of record. For the revision, we have given our best shot to editing the text in the hopes of improved clarity, reduction of discrepancies (helpfully noted in the Minor Comments), and further detail-rich descriptions of procedures. We also edited the figure labeling to give a better consistency. While we note that the appropriate citation of Shinnakasu et al (2016) was ref. #69 of the original and remains as a citation, we have rechecked other referencing and try to use citations with the best relevant references.  

      Minor Comments: The labeling of plots in Fig. 2 should be standardized. For example, in Fig. 2C, D, and G, the same mouse strain is used, yet the Cre+ mouse is labeled differently in each plot. 

      We agree and have tried to tighten up these features in the panels noted as well as more generally (e.g., Fig. 4, 5, 6, 7, 9; consistency of huCD20-CreERT2 / hCD20CreERT2).

      According to the text, the results shown in Fig. 1G and H correspond to a full KO  (Dhrs7b^f/f; Rosa26-CreERT2 mice). However, Fig. 1H indicates that the bottom image corresponds to Dhrs7b^f/f, huCD20-CreERT2 mice (Dhrs7bΔ/Δ -B). 

      We have corrected Fig. 1H to be labeled as Dhrs7b<sup>Δ/Δ</sup> (with the data on Dhrs7b<sup>Δ/Δ-B</sup> presented in Supplemental Figure 4A, which is correctly labeled). Thank you for picking up this error that crept in while using copy/paste in preparation of figure panels and failing to edit out the "-B"!  

      Similarly, the gating strategy for GC cells in the text mentions IgD− cells, while the figure legend refers to total viable B cells. These discrepancies need clarification.

      We believe we located and have corrected this issue in the revised manuscript.   

      Figures 3 and 4. The authors claim that B cell expression of PexRAP is required to  achieve normal concentrations of ether phospholipids. 

      Suggestions for Improvement: 

      Lipid Metabolism Analysis: The analysis in Fig. 3 is generally convincing but could be strengthened by including an additional stimulation condition such as anti-IgM plus antiCD40. In Fig. 4C, the authors display results from the full KO model. It would be helpful to include quantitative graphs summarizing the parameters displayed in the images.

      We have performed new experiments (anti-IgM + anti-CD40) and added the data to the revised manuscript (new - Supplemental Fig. 2H and Supplemental Fig 6, D & F). Conclusions based on the effects are not changed from the original. 

      As a semantic comment and point of scientific process, any interpretation ("claim") can - by definition - only be taken to apply to the conditions of the experiment. Nonetheless, it is inescapable that at least for some ether P-lipids of naive, resting B cells, and for substantially more in B cells activated under the conditions that we outline, B cell expression of PexRAP is required. 

      With regards to the constructive suggestion about a new series of lipidomic analyses, we agree that for activated B cells it would be nice and increase insight into the spectrum of conditions under which the PexRAP-deficient B cells had altered content of ether phospholipids. However, in light of the costs of metabolomic analyses and the lack of funds to support further experiments, and the accuracy of the point as stated, we prioritized the experiments that could fit within the severely limited budget. 

      [One can add that our results provide a premise for later work to analyze a time course after activation, and to perform isotopomer (SIRM) analyses with [13] C-labeled acetate or glucose, so as to understand activation-induced increases in the overall   To revise the manuscript, we did however extrapolate from the point about adding BCR cross-linking to anti-CD40 as a variant form of activating the B cells for measurements of ROS, population growth, and rates of division (CTV partitioning). The results of these analyses, which align with and thereby strengthen the conclusions about these functional features from experiments with anti-CD40 but no anti-IgM, are added to Supplemental Fig 2H and Supplemental Fig 6D, F. 

      Figures 5, 6, and 7

      The authors claim that Dhrs7b in B cells shapes antibody affinity and quantity. They use two mouse models for this analysis: huCD20-CreERT2 and Dhrs7b f/f; S1pr2-CreERT2 mice. 

      Suggestions for Improvement:

      Adaptive immune response characterization: A more comprehensive characterization of the adaptive immune response is needed, ideally using the Dhrs7b f/f; S1pr2-CreERT2 model. This should include: Analysis of the GC response in B220+CD138− cells. Class switch recombination analysis. A detailed characterization of centroblasts, centrocytes, and Tfh populations. Characterization of effector cells (plasma cells and memory cells).

      Within the limits of time and money, we have performed new experiments prompted by this constructive set of suggestions. 

      Specifically, we analyzed the suggested read-outs in the huCD20-CreERT2, Dhrs7b<sup>f/f</sup> model after immunization, recognizing that it trades greater signal-noise for the fact that effects are due to a mix of the impact on B cells during clonal expansion before GC recruitment and activities within the GC. In brief, the results showed that 

      (a) the GC B cell population - defined as CD138<sup>neg</sup> GL7<sup>+</sup> CD38<sup>lo/neg</sup> IgD<sup>neg</sup> B cells - was about half as large for PexRAP-deficient B cells net of any early- or preplasmablasts (CD138<sup>+</sup> events) (new - Fig 5G); 

      (b) the frequencies of pre- / early plasmablasts (CD138<sup>+</sup> GL7<sup>+</sup> CD38<sup>neg</sup>) events (see new - Fig. 6H, I; also, new Supplemental Fig 5D) were so low as to make it unlikely that our data with the S1pr2-CreERT2 model (in Fig 7B, C) would be affected meaningfully by analysis of the CD138 levels;

      (c) There was a modest decrease in centrocytes (LZ) but not centroblasts (DZ) (new - Fig 5H, I) - consistent with the immunohistochemical data of Supplemental Fig. 5A-C). 

      Because of time limitations (the "shelf life" of funds and the lab) and insufficient stock of the S1pr2-CreERT2, Dhrs7b<sup>f/f</sup> mice as well as those that would be needed as adoptive transfer recipients because of S1PR2 expression in (GC-)Tfh, the experiments were performed instead with the huCD20-CreERT2, Dhrs7b<sup>f/f</sup> model. We would also note that using this Cre transgene better harmonizes the centrocyte/centroblast and Tfh data with the existing data on these points in Supplemental Fig. 4. 

      (d) Of note, the analyses of Tfh and GC-Tfh phenotype cells using the huCD20-CreERT2 B cell type-specific inducible Cre system to inactivate Dhrs7b (new - Supplemental Fig 1G-I; which, along with new - Supplemental Fig 5E) provide evidence of an abnormality that must stem from a function or functions of PexRAP in B cells, most likely GC B cells. Specifically, it is known that the GC-Tfh population proliferates and is supported by the GC B cells, and the results of B cell-specific deletion show substantial reductions in Tfh cells (both the GC-Tfh gating and the wider gate for plots of CXCR5/PD-1/ fluorescence of CD4 T cells 

      Timepoint Consistency: The NP response (Fig. 5) is analyzed four weeks postimmunization, whereas SRBC (Supp. Fig. 4) and Fig. 7 are analyzed one week or nine days post-immunization. The NP system analysis should be repeated at shorter timepoints to match the peak GC reaction.

      This comment may stem from a misunderstanding. As diagrammed in Fig. 5A, the experiments involving the NP system were in fact measured at 7 d after a secondary (booster) immunization. That timing is approximately the peak period and harmonizes with the 7 d used for harvesting SRBC-immunized mice. So in fact the data with each system were obtained at a similar time point. Of course the NP experiments involved a second immunization so that many plasma cell and Ab responses derived from memory B cells generated by the primary immunization. However, the field at present is dominated by the view that the vast majority of the GC B cells after this second immunization (which historically we perform with alum adjuvant) are recruited from the naive rather than the memory B cell pool. For the revised manuscript, we have taken care that the Methods, Legend, and Figure provide the information to readers, and expanded the statement of a rationale. 

      It may seem a technicality but under NIH regulations we are legally obligated to try to minimize mouse usage. It also behooves researchers to use funds wisely. In line with those imperatives, we used systems that would simultaneously allow analyses of GC B cells, identification of affinity maturation (which is minimal in our hands at a 7 d time point after primary NP-carrier immunization), and a switched repertoire (also minimal), and where with each immunogen the GC were scored at 7-9 d after immunization (9 d refers to the S1pr2-CreERT2 experiments). Apart from the end of funding, we feel that what little might be learned from performing a series of experiments that involve harvests 7 d after a primary immunization with NP-ovalbumin cannot well be justified. 

      In vitro plasma cell differentiation: Quantification is missing for plasma cell differentiation in vitro (Supp. Fig. 4). The stimulus used should also be specified in the figure legend. Given the use of anti-CD40, differentiation towards IgG1 plasma cells could provide additional insights.

      As suggested by reviewer, we have added the results of quantifying the in vitro plasma cell differentiation in Supplemental Fig 6B. Also, we edited the Methods and Supplemental Figure Legend to give detailed information of in vitro stimulation. 

      Proliferation and apoptosis analysis: The observed defects in the humoral response should be correlated with proliferation and apoptosis analyses, including Ki67 and Caspase markers.

      As suggested by the review, we have performed new experiment and analyzed the frequencies of cell death by annexin V staining, and elected to use intravital uptake of BrdU as a more direct measurement of S phase / cell cycling component of net proliferation. The new results are now displayed in Figure 5 and Supplemental Fig. 5. 

      Western blot confirmation: While the authors have demonstrated the absence of PexRAP protein in the huCD20-CreERT2 model, this has not been shown in GC B cells from the Dhrs7b f/f; S1pr2-CreERT2 model. This confirmation is necessary to validate the efficiency of Dhrs7b deletion.

      We were unable to do this for technical reasons expanded on below. For the revision, we have edited in a bit of text more explicitly to alert readers to the potential impact of counter-selection on interpretation of the findings with GC B cells. Before entering the GC, B cells have undergone many divisions, so if there were major pre-GC counterselection, in all likelihood the GC B cells would PexRAP-sufficient. To recap from the original manuscript and the new data we have added, IMS shows altered lipid profiles in the GC B cells and the literature indicates that the lipids are short-lived, requiring de novo resynthesis. The BrdU, ROS, and annexin V data show that GC B cells are abnormal. Accordingly, abnormal GC B cells represent the parsimonious or straightforward interpretation of the new results with GC-Tfh cell prevalence. 

      While we take these findings together to suggest that counterselection (i.e., a Western result showing normal levels of PexRAP in the GC B cells) seems unlikely, it is formally possible and would mean that the in situ defects of GC B cells arose due to environmental influences of the PexRAP-deficient B cells during the developmental history of the WT B cells observed in the GC. 

      Having noted all that, we understand that concerns about counter-selection are an issue if a reader accepts the data showing that mutant (PexRAP-deficient) B cells tend to proliferate less and die more readily. Indeed, one can speculate that were we also to perform competition experiments in which the Ighb, Cd45.2 B cells (WT or Dhrs7b D/D) are mixed with equal numbers of Igha, Cd45.1 competitors, the differences would become much greater. With this in mind, Western blotting of flow-purified GC B cells might give a sense of how much counter-selection has occurred. 

      That said, the Westerns need at least 2.5 x 10<sup>6</sup> B cells (those in the manuscript used five million, 5  x 10<sup>6</sup>) and would need replication. Taken together with the observation that ~200,000 GC B cells (on average) were measured in each B cell-specific knockout mouse after immunization (Fig. 1, Fig 5) and taking into account yields from sorting, each Western would require some 20-25 tamoxifen-injected ___-CreERT2, Dhrs7b f/f mice, and about half again that number as controls. The expiry of funds prohibited the time and costs of generating that many mice (>70) and flow-purified GC B cells. 

      Figure 8

      The authors claim that Dhrs7b contributes to the modulation of ROS, impacting B cell proliferation.

      Suggestions for Improvement:

      GC ROS Analysis: The in vitro ROS analysis should be complemented by characterizing ROS and lipid peroxidation in the GC response using the Dhrs7b f/f; S1pr2-CreERT2 model. Flow cytometry staining with H2DCFDA, MitoSOX, Caspase-3, and Annexin V would allow assessment of ROS levels and cell death in GC B cells. 

      While subject to some of the same practical limits noted above, we have performed new experiments in line with this helpful input of the reviewer, and added the helpful new data to the revised manuscript. Specifically, in addition to the BrdU and phenotyping analyses after immunization of huCD20-CreER<sup>T2</sup>, Dhrs7b<sup>f/f</sup> mice, DCFDA (ROS), MitoSox, and annexin V signals were measured for GC B cells. Although the mitoSox signals did not significantly differ for PexRAP-deficient GCB, the ROS and annexin V signals were substantially increased. We added the new data to Figure 5 and Supplemental Figure 5. Together with the decreased in vivo BrdU incorporation in GC B cells from Dhrs7b<sup>D/D-B</sup> mice, these results are consistent with and support our hypothesis that PexRAP regulates B cell population growth and GC physiology in part by regulating ROS detoxification, survival and proliferation of B cells.  

      Quantification is missing in Fig. 8E, and Fig. 8F should use clearer symbols for better readability. 

      We added quantification for Fig 8E in Supplemental Fig 6E, and edited the symbols in Fig 8F for better readability.

      Figure 9

      The authors claim that Dhrs7b in B cells affects oxidative metabolism and ER mass. The  results in this section are well-performed and convincing.

      Suggestion for Improvement:

      Based on the results, the discussion should elaborate on the potential role of lipids in antigen presentation, considering their impact on mitochondria and ER function.

      We very much appreciate the praise of the tantalizing findings about oxidative metabolism and ER mass, and will accept the encouragement that we add (prudently) to the Discussion section to make note of the points mentioned by the Reviewer, particularly now that (with their encouragement) we have the evidence that B cell-specific loss of PexRAP (with the huCD20-CreERT2 deletion prior to immunization) resulted in decreased (GC-)Tfh and somewhat lower GC B cell proliferation.  

      Reviewer #2 (Recommendations for the authors):

      The authors should investigate whether PexRAP-deficient GC B cells exhibit increased mitochondrial ROS and cell death ex vivo, as observed in in vitro cultured B cells.

      We very much appreciate the work of the referee and their input. We addressed this helpful recommendation, in essence aligned with points from Reviewer 1, via new experiments (until the money ran out) and addition of data to the manuscript. To recap briefly, we found increased ROS in GC B cells along with higher fractions of annexin V positive cells; intriguingly, increased mtROS (MitoSox signal) was not detected, which contrasts with the results in activated B cells in vitro in a small way. To keep the text focused and not stray too far outside the foundation supported by data, this point may align with papers that provide evidence of differences between pre-GC and GC B cells (for instance with lack of Tfam or LDHA in B cells).    

      It remains unclear whether the impaired proliferation of PexRAP-deficient B cells is primarily due to increased cell death. Although NAC treatment partially rescued the phenotype of reduced PexRAP-deficient B cell number, it did not restore them to control levels. Analysis of the proliferation capacity of PexRAP-deficient B cells following NAC treatment could provide more insight into the cause of impaired proliferation.

      To add to the data permitting an assessment of this issue, we performed new experiments in which B cells were activated (BCR and CD40 cross-linking), cultured, and both the change in population and the CTV partitioning were measured in the presence or absence of NAC. The results, added to the revision as Supplemental Fig 6FH, show that although NAC improved cell numbers for PexRAP-deficient cells relative to controls, this compound did not increase divisions at all. We infer that the more powerful effect of this lipid synthesis enzyme is to promote survival rather than division  capacity. 

      Primary antibody responses were assessed at only one time point (day 20). It would be valuable to examine the kinetics of antibody response at multiple time points (0, 1w, 2w, 3w, for example) to better understand the temporal impact of PexRAP on antibody production.

      We thank the reviewer for this suggestion. While it may be that the kinetic measurement of Ag-specific antibody level across multiple time points would provide an additional mechanistic clue into the of impact PexRAP on antibody production, the end of sponsored funding and imminent lab closure precluded performing such experiments.   

      CD138+ cell population includes both GC-experienced and GC-independent plasma cells (Fig. 7). Enumeration of plasmablasts, which likely consists of both PexRAP-deleted and undeleted cells (Fig. 7D and E), may mislead the readers such that PexRAP is dispensable for plasmablast generation. I would suggest removing these data and instead examining the number of plasmablasts in the experimental setting of Fig. 4A (huCD20-CreERT2-mediated deletion) to address whether PexRAP-deficiency affects plasmablast generation. 

      We have eliminated the figure panels in question, since it is accurate that in the absence of a time-stamping or marking approach we have a limited ability to distinguish plasma cells that arose prior to inactivation of the Dhrs7b gene in B cells. In addition, we performed new experiments that were used to analyze the "early plasmablast" phenotype and added those data to the revision (Supplemental Fig 5D).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary:

      The authors use the theory of planned behavior to understand whether or not intentions to use sex as a biological variable (SABV), as well as attitude (value), subjective norm (social pressure), and behavioral control (ability to conduct behavior), across scientists at a pharmacological conference. They also used an intervention (workshop) to determine the value of this workshop in changing perceptions and misconceptions. Attempts to understand the knowledge gaps were made.

      Strengths:

      The use of SABV is limited in terms of researchers using sex in the analysis as a variable of interest in the models (and not a variable to control). To understand how we can improve on the number of researchers examining the data with sex in the analyses, it is vital we understand the pressure points that researchers consider in their work. The authors identify likely culprits in their analyses. The authors also test an intervention (workshop) to address the main bias or impediments for researchers' use of sex in their analyses. 

      Weaknesses:

      There are a number of assumptions the authors make that could be revisited: 

      (1) that all studies should contain across sex analyses or investigations. It is important to acknowledge that part of the impetus for SABV is to gain more scientific knowledge on females. This will require within sex analyses and dedicated research to uncover how unique characteristics for females can influence physiology and health outcomes. This will only be achieved with the use of female-only studies. The overemphasis on investigations of sex influences limits the work done for women's health, for example, as within-sex analyses are equally important.

      The Sex and Gender Equity in Research (SAGER) guidelines (1) provide guidance that “Where the subjects of research comprise organisms capable of differentiation by sex, the research should be designed and conducted in a way that can reveal sex-related differences in the results, even if these were not initially expected.”.  This is a default position of inclusion where the sex can be determined and analysis assessing for sex related variability in response. This position underpins many of the funding bodies new policies on inclusion.   

      However, we need to place this in the context of the driver of inclusion. The most common reason for including male and female samples is for those studies that are exploring the effect of a treatment and then the goal of inclusion is to assess the generalisability of the treatment effect (exploratory sex inclusion)(2). The second scenario is where sex is included because sex is one of the variables of interest and this situation will arise because there is a hypothesized sex difference of interest (confirmatory sex inclusion).  

      We would argue that the SABV concept was introduced to address the systematic bias of only studying one sex when assessing treatment effect to improve the generalisability of the research.  Therefore, it isn’t directly to gain more scientific knowledge on females.  However, this strategy will highlight when the effect is very different between male and female subjects which will potentially generate sex specific hypotheses.  

      Where research has a hypothesis that is specific to a sex (e.g. it is related to oestrogen levels) it would be appropriate to study only the sex of interest, in this case females. The recently published Sex Inclusive Research Framework gives some guidance here and allows an exemption for such a scenario classifying such proposals “Single sex study justified” (3).

      We have added an additional paragraph to the introduction to clarify the objectives behind inclusion and how this assists the research process. 

      (2) It should be acknowledged that although the variability within each sex is not different on a number of characteristics (as indicated by meta-analyses in rats and mice), this was not done on all variables, and behavioral variables were not included. In addition, across-sex variability may very well be different, which, in turn, would result in statistical sex significance. In addition, on some measures, there are sex differences in variability, as human males have more variability in grey matter volume than females. PMID: 33044802. 

      The manuscript was highlighting the common argument used to exclude the use of females, which is that females are inherently more variable as an absolute truth. We agree there might be situations, where the variance is higher in one sex or another depending on the biology.  We have extended the discussion here to reflect this, and we also linked to the Sex Inclusive Research Framework (3) which highlights that in these situations researchers can utlise this argument provided it is supported with data for the biology of interest. 

      (3) The authors need to acknowledge that it can be important that the sample size is increased when examining more than one sex. If the sample size is too low for biological research, it will not be possible to determine whether or not a difference exists. Using statistical modelling, researchers have found that depending on the effect size, the sample size does need to increase. It is important to bare this in mind as exploratory analyses with small sample size will be extremely limiting and may also discourage further study in this area (or indeed as seen the literature - an exploratory first study with the use of males and females with limited sample size, only to show there is no "significance" and to justify this as an reason to only use males for the further studies in the work. 

      The reviewer raises a common problem: where researchers have frequently argued that if they find no sex differences in a pilot then they can proceed to study only one sex. The SAGER guidelines (1), and now funder guidelines (4, 5), challenge that position. Instead, the expectation is for inclusion as the default in all experiments (exploratory inclusion strategy) to allow generalisable results to be obtained. When the results are very different between the male and female samples, then this can be determined. This perspective shift (2) requires a change in mindset and understanding that the driver behind inclusion is of generalisability not exploration of sex differences. This has been added to the introduction as an additional paragraph exploring the drivers behind inclusion.  

      We agree with the reviewer that if the researcher is interested in sex differences in an effect (confirmatory inclusion strategy, aka sex as a primary variable) then the N will need to be higher.  However, in this situation, one, of course, must have male and female samples in the same experiment to allow the simultaneous exploration to assess the dependency on sex. 

      Reviewer #2 (Public review): 

      Summary:

      The investigators tested a workshop intervention to improve knowledge and decrease misconceptions about sex inclusive research. There were important findings that demonstrate the difficulty in changing opinions and knowledge about the importance of studying both males and females. While interventions can improve knowledge and decrease perceived barriers, the impact was small. 

      Strengths:

      The investigators included control groups and replicated the study in a second population of scientists. The results appear to be well substantiated. These are valuable findings that have practical implications for fields where sex is included as a biological variable to improve rigor and reproducibility. 

      Thank you for assessment and highlighting these strengths.  We appreciate your recognition of the value and practical implications of this work. 

      Weaknesses:

      I found the figures difficult to understand and would have appreciated more explanation of what is depicted, as well as greater space between the bars representing different categories. 

      We have improved the figures and figure legends to improve clarity. 

      Reviewer #3 (Public review):

      Summary:

      This manuscript aims to determine cultural biases and misconceptions in inclusive sex research and evaluate the efficacy of interventions to improve knowledge and shift perceptions to decrease perceived barriers for including both sexes in basic research. 

      Overall, this study demonstrates that despite the intention to include both sexes and a general belief in the importance of doing so, relatively few people routinely include both sexes. Further, the perceptions of barriers to doing so are high, including misconceptions surrounding sample size, disaggregation, and variability of females. There was also a substantial number of individuals without the statistical knowledge to appropriately analyze data in studies inclusive of sex. Interventions increased knowledge and decreased perception of barriers. 

      Strengths:

      (1) This manuscript provides evidence for the efficacy of interventions for changing attitudes and perceptions of research.

      (2) This manuscript also provides a training manual for expanding this intervention to broader groups of researchers.

      Thank you for highlighting these strengths. We appreciate your recognition that the intervention was effect in changing attitudes and perception. We deliberately chose to share the material to provide the resources to allow a wider engagement.  

      Weaknesses:

      The major weakness here is that the post-workshop assessment is a single time point, soon after the intervention. As this paper shows, intention for these individuals is already high, so does decreasing perception of barriers and increasing knowledge change behavior, and increase the number of studies that include both sexes? Similarly, does the intervention start to shift cultural factors? Do these contribute to a change in behavior? 

      Measuring change in behaviour following an intervention is challenging and hence we had implemented an intention score as a proxy for behaviour. We appreciate the benefit of a long-term analysis, but it was beyond the scope of this study and would need a larger dataset size to allow for attrition. We agree that the strategy implemented has weaknesses. We have extended the limitation section in the discussion to include these. 

      Reviewer #1 (Recommendations for the authors):  

      I would ask them to think about alternative explanations and ask for free-form responses, and to revise with the caveats written above - sample size does need to be increased depending on effect size, and that within sex studies are also important. Not all studies should focus on sex influences.  

      The inclusion of the additional paragraph in the introduction to clarify the objective of inclusion and the resulting impact on experimental design should address these recommendations.   

      We have also added the free-form responses as an additional supplementary file.  

      Reviewer #2 (Recommendations for the authors):  

      This is an important set of studies. My only recommendation to improve the data presentation so that it is clear what is depicted and how the analyses were conducted. I know it is in the methods, but reminding the reader would be helpful.  

      We have revisited the figures and included more information in the legends to explain the analysis and improve clarity.   

      Reviewer #3 (Recommendations for the authors):  

      There are parts in the introduction which read as contradictory and as such are confusing - for example, in the 3rd paragraph it states that little progress on sex inclusive research has been made, and in the following sentences it states that the proportion of published studies across sex has improved. The references in these two statements are from the same time range, so has this improved? Or not?  

      The introduction does include a summation statement on the position: “Whilst a positive step forward, this proportion still represents a minority of studies, and notably this inclusion was not associated with an increase in the proportion of studies that included data analysed by sex.” We have reworded the text to ensure it is internally consistent with this summary statement and this should increase clarity.

      In discussing the results, it is sometimes confusing what the percentages mean. For example, "the researchers reported only conducting sex inclusive research in <=55% of their studies over the past 5 years (55% in study 1 general population and 35% study 2 pre-assessment)." Does that mean 55% of people are conducting sex inclusive research, or does this mean only half of their studies? These two options have very different implications.

      We agree that the sentence is confusing and it has been reworded.  

      Addressing long-term assessments in attitude and action (ie, performing sex inclusive research) is a crucial addition, with data if possible, but at least substantive discussion.  

      We have add this to the limitation section in the discussion

      One minor but confusing point is the analogy comparing sex inclusive studies with attending the gym. The point is well taken - knowledge is not enough for behavior change. However, the argument here is that to increase sex inclusive research requires cultural change. To go to the gym, requires motivation.This seems like an oranges-to-lemons comparison (same family, different outcome when you bite into it).

      At the core, both scenarios involve the challenge of changing established habits and cultural norms in action based on knowledge (the right thing to do). The exercise scenario is a primary example provided by the original authors to describe how aspects of the theory of planned behaviour (perceived behavioural control, attitude, and social norms) may influence behavioural change. Understanding which of these aspects may drive or influence change is why we used this framework to understand our study population.  We disagree that is an oranges-to-lemons comparison.

      References

      (1) Heidari S, Babor TF, De Castro P, Tort S, Curno M. Sex and Gender Equity in Research: rationale for the SAGER guidelines and recommended use. Res Integr Peer Rev. 2016;1:2.

      (2) Karp NA. Navigating the paradigm shift of sex inclusive preclinical research and lessons learnt. Commun Biol. 2025;8(1):681.

      (3) Karp NA, Berdoy M, Gray K, Hunt L, Jennings M, Kerton A, et al. The Sex Inclusive Research Framework to address sex bias in preclinical research proposals. Nat Commun. 2025;16(1):3763.

      (4) MRC. Sex in experimental design - Guidance on new requirements https://www.ukri.org/councils/mrc/guidance-for-applicants/policies-and-guidance-forresearchers/sex-in-experimental-design/: UK Research and Innovation; 2022 [

      (5) Clayton JA, Collins FS. Policy: NIH to balance sex in cell and animal studies. Nature. 2014;509(7500):282-3.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Summary:

      Asthenospermia, characterized by reduced sperm motility, is one of the major causes of male infertility. The "9 + 2" arranged MTs and over 200 associated proteins constitute the axoneme, the molecular machine for flagellar and ciliary motility. Understanding the physiological functions of axonemal proteins, particularly their links to male infertility, could help uncover the genetic causes of asthenospermia and improve its clinical diagnosis and management. In this study, the authors generated Ankrd5 null mice and found that ANKRD5-/- males exhibited reduced sperm motility and infertility. Using FLAG-tagged ANKRD5 mice, mass spectrometry, and immunoprecipitation (IP) analyses, they confirmed that ANKRD5 is localized within the N-DRC, a critical protein complex for normal flagellar motility. However, transmission electron microscopy (TEM) and cryo-electron tomography (cryo-ET) of sperm from Ankrd5 null mice did not reveal significant structural abnormalities.

      Strengths:

      The phenotypes observed in ANKRD5-/- mice, including reduced sperm motility and male infertility, are conversing. The authors demonstrated that ANKRD5 is an N-DRC protein that interacts with TCTE1 and DRC4. Most of the experiments are well designed and executed.

      Weaknesses:

      The last section of cryo-ET analysis is not convincing. "ANKRD5 depletion may impair buffering effect between adjacent DMTs in the axoneme".

      "In WT sperm, DMTs typically appeared circular, whereas ANKRD5-KO DMTs seemed to be extruded as polygonal. (Fig. S9B,D). ANKRD5-KO DMTs seemed partially open at the junction between the A- and B-tubes (Fig. S9B,D)." In the TEM images of 4E, ANKRD5-KO DMTs look the same as WT. The distortion could result from suboptimal sample preparation, imaging or data processing. Thus, the subsequent analyses and conclusions are not reliable.

      Thank you for your valuable advice. To validate the results of cryo-ET, we carefully analyzed the TEM results (previously we only focused on the global "9+2" structure of the axial filament) and found that deletion of ANKRD5 resulted in both normal and deformed DMT morphologies, which was consistent with the results observed by cryo-ET. At the same time, we have added the corresponding text and picture descriptions in the article:

      The text description we added is: “Upon re-examining the TEM data in light of the Cryo-ET findings, similar abnormalities were observed in the TEM images (Fig.4E, Fig. S10B). Notably, both intact and deformed DMT structures were consistently observed in both TEM and STA analyses, with the deformation of the B-tube being more obvious (Fig.4E, Fig. S10). ”

      This paper still requires significant improvements in writing and language refinement. Here is an example: "While N-DRC is critical for sperm motility, but the existence of additional regulators that coordinate its function remains unclear" - ill-formed sentences.

      We appreciate the reviewer’s valuable comment regarding the clarity of our writing. The sentence cited (“While N-DRC is critical for sperm motility, but the existence of additional regulators that coordinate its function remains unclear”) was indeed ill-formed. We have revised it to improve readability and precision. The corrected version now reads:“Although the N-DRC is critical for sperm motility, whether additional regulatory components coordinate its function remains unclear.” We have carefully re-examined the manuscript and refined the language throughout to ensure clarity and conciseness.

      Reviewer #2 (Public review):

      Summary:

      The manuscript investigates the role of ANKRD5 (ANKEF1) as a component of the N-DRC complex in sperm motility and male fertility. Using Ankrd5 knockout mice, the study demonstrates that ANKRD5 is essential for sperm motility and identifies its interaction with N-DRC components through IP-mass spectrometry and cryo-ET. The results provide insights into ANKRD5's function, highlighting its potential involvement in axoneme stability and sperm energy metabolism.

      Strengths:

      The authors employ a wide range of techniques, including gene knockout models, proteomics, cryo-ET, and immunoprecipitation, to explore ANKRD5's role in sperm biology.

      Weaknesses:

      “Limited Citations in Introduction: Key references on the role of N-DRC components (e.g.,DRC2, DRC4) in male infertility are missing, which weakens the contextual background.”

      We appreciate the reviewer’s valuable suggestion. To address this concern, we have added the following sentence in the Introduction:

      “Recent mammalian knockout studies further confirmed that loss of DRC2 or DRC4 results in severe sperm flagellar assembly defects, multiple morphological abnormalities of the sperm flagella (MMAF), and complete male infertility, highlighting their indispensable roles in spermatogenesis and reproduction [31].”

      This addition introduces up-to-date evidence on DRC2 and DRC4 functions in male infertility and strengthens the contextual background as recommended.

      Reviewer #1 (Recommendations for the authors):

      "Male infertility impacts 8%-12% of the global male population, with sperm motility defects contributing to 40%-50% of these cases [2,3]. " Is reference 3 proper? I don't see "sperm motility defects contributing to 40%-50%" of male infertility.

      Thank you for identifying this issue. You are correct—reference 3 does not support the statement about sperm motility defects comprising 40–50% of male infertility cases; it actually states:

      “Male factor infertility is when an issue with the man’s biology makes him unable to impregnate a woman. It accounts for between 40 to 50 percent of infertility cases and affects around 7 percent of men.”

      This was a misunderstanding on my part, and I apologize for the oversight.

      To correct this, we have replaced the statement with more accurate references:

      PMID: 33968937 confirms:

      “Asthenozoospermia accounts for over 80% of primary male infertility cases.”

      PMID: 33191078 defines asthenozoospermia (AZS) as reduced or absent sperm motility and notes it as a major cause of male infertility.

      We have updated the manuscript accordingly:

      In the Significance Statement: “Male infertility affects approximately 8%-12% of men globally, with defects in sperm motility accounting for over 80% of these cases.”

      In the Introduction: “Male infertility affects approximately 8% to 12% of the global male population, with defects in sperm motility accounting for over 80% of these cases[2,3].”

      Thank you again for your careful review and for giving us the opportunity to improve the accuracy of our manuscript.

      "Rather than bypassing the issue with ICSI, infertility from poor sperm motility could potentially be treated or even cured through stimulation of specific signaling pathways or gene therapy." Need references.

      We appreciate the reviewer’s insightful comment. In response, we have added three supporting references to the relevant sentence.

      The first reference (PMID: 39932044) demonstrates that cBiMPs and the PDE-10A inhibitor TAK-063 significantly and sustainably improve motility in human sperm with low activity, including cryopreserved samples, without inducing premature acrosome reaction or DNA damage. The second reference (PMID: 29581387) shows that activation of the PKA/PI3K/Ca²⁺ signaling pathways can reverse reduced sperm motility. The third reference (PMID: 33533741) reports that CRISPR-Cas9-mediated correction of a point mutation in Tex11<sup>PM/Y</sup> spermatogonial stem cells (SSCs) restores spermatogenesis in mice and results in the production of fertile offspring.

      These references provide mechanistic support and demonstrate the feasibility of treating poor sperm motility through targeted pathway modulation or gene therapy, thus reinforcing the validity of our statement.

      "Our findings indicate that ANKRD5 (Ankyrin repeat domain 5; also known as ANK5 or ANKEF1) interacts with N-DRC structure". The full name should be provided the first time ANKRD5 appears. Is ANKRD5 a component of N-DRC or does it interact with N-DRC?

      We thank the reviewer for the valuable suggestion. In response, we have moved the full name “Ankyrin repeat domain 5; also known as ANK5 or ANKEF1” to the abstract where ANKRD5 first appears, and have removed the redundant mention from the main text.

      Based on our experimental data, we consider ANKRD5 to be a novel component of the N-DRC (nexin-dynein regulatory complex), rather than merely an interacting partner. Therefore, we have revised the sentence in the main text to read:

      “Here, we demonstrate that ANKRD5 is a novel N-DRC component essential for maintaining sperm motility.”

      Fig 5E, numbers of TEM images should be added.

      We thank the reviewer for the suggestion. We would like to clarify that Fig. 5E does not contain TEM images, and it is likely that the reviewer was referring to Fig. 4E instead.

      In Fig. 4E, we conducted three independent experiments. In each experiment, 60 TEM cross-sectional images of sperm tails were analyzed for both Ankrd5 knockout and control mice.

      The findings were consistent across all replicates.

      We have updated the figure legend accordingly, which now reads:

      “Transmission electron microscopy (TEM) of sperm tails from control and Ankrd5 KO mice. Cross-sections of the midpiece, principal piece, and end piece were examined. Red dashed boxes highlight regions of interest, and the magnified views of these boxed areas are shown in the upper right corner of each image. In three independent experiments, 20 sperm cross-sections per mouse were analyzed for each group, with consistent results observed.”

      There are random "222" in the references. Please check and correct.

      I sincerely apologize for the errors caused by the reference management software, which resulted in the insertion of random "222" and similar numbering issues in the reference list. I have carefully reviewed and corrected the following problems:

      References 9, 11, 13, 26, 34, 63, and 64 had the number "222" mistakenly placed before the title; these have now been removed. References 15 and 18 had "111" incorrectly inserted before the title; this has also been corrected. Reference 36 had an erroneous "2" before the title and was found to be a duplicate of Reference 32; these have now been merged into a single citation. Additionally, References 22 and 26 were identified as duplicates of the same article and have been consolidated accordingly. 

      All these issues have been resolved to ensure the reference list is accurate and properly formatted.

      Reviewer #2 (Recommendations for the authors):

      The authors have already addressed most of the issues I am concerned about.

      In addition, we have also corrected some errors in the revised manuscript:

      (1) In Figure 3G, the y-axis label was previously marked as “Sperm count in the oviduct (10⁶)”, which has now been corrected to “Sperm count in the oviduct”.

      (2) All p-values have been reformatted to italic lowercase letters to comply with the journal style guidelines.

      Figure 6 Legend: A typographical error in the figure legend has been corrected. The text previously read “(A) The differentially expressed proteins of Ankrd5<sup>+/–</sup> and Ankrd5<sup>+/-</sup> were identified...”. This has now been amended to “(A) The differentially expressed proteins of Ankrd5<sup>+/–</sup> and Ankrd5<sup>+/–</sup> were identified...” to correctly represent the comparison between heterozygous and homozygous knockout groups.

      In the original Figure 4E, we added a zoom-in panel to the image to show the deformed DMT.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #2 (Public review): 

      Summary: 

      The paper describes the high-resolution structure of KdpFABC, a bacterial pump regulating intracellular potassium concentrations. The pump consists of a subunit with an overall structure similar to that of a canonical potassium channel and a subunit with a structure similar to a canonical ATP-driven ion pump. The ions enter through the channel subunit and then traverse the subunit interface via a long channel that lies parallel to the membrane to enter the pump, followed by their release into the cytoplasm. 

      The work builds on the previous structural and mechanistic studies from the authors' and other labs. While the overall architecture and mechanism have already been established, a detailed understanding was lacking. The study provides a 2.1 Å resolution structure of the E1-P state of the transport cycle, which precedes the transition to the E2 state, assumed to be the ratelimiting step. It clearly shows a single K+ ion in the selectivity filter of the channel and in the canonical ion binding site in the pump, resolving how ions bind to these key regions of the transporter. It also resolves the details of water molecules filling the tunnel that connects the subunits, suggesting that K+ ions move through the tunnel transiently without occupying welldefined binding sites. The authors further propose how the ions are released into the cytoplasm in the E2 state. The authors support the structural findings through mutagenesis and measurements of ATPase activity and ion transport by surface-supported membrane (SSM) electrophysiology. 

      Reviewer #3 (Public review): 

      Summary: 

      By expressing protein in a strain that is unable to phosphorylate KdpFABC, the authors achieve structures of the active wildtype protein, capturing a new intermediate state, in which the terminal phosphoryl group of ATP has been transferred to a nearby Asp, and ADP remains covalently bound. The manuscript examines the coupling of potassium transport and ATP hydrolysis by a comprehensive set of mutants. The most interesting proposal revolves around the proposed binding site for K+ as it exits the channel near T75. Nearby mutations to charged residues cause interesting phenotypes, such as constitutive uncoupled ATPase activity, leading to a model in which lysine residues can occupy/compete with K+ for binding sites along the transport pathway. 

      Strengths: 

      The high resolution (2.1 Å) of the current structure is impressive, and allows many new densities in the potassium transport pathway to be resolved. The authors are judicious about assigning these as potassium ions or water molecules, and explain their structural interpretations clearly. In addition to the nice structural work, the mechanistic work is thorough. A series of thoughtful experiments involving ATP hydrolysis/transport coupling under various pH and potassium concentrations bolsters the structural interpretations and lends convincing support to the mechanistic proposal. The SSME experiments are generally rigorous. 

      Weaknesses: 

      The present SSME experiments do not support quantitative comparisons of different mutants, as in Figures 4D and 5E. Only qualitative inferences can be drawn among different mutant constructs. 

      Thank you to both reviewers for your thorough review of our work. We acknowledge the limitations of SSME experiments in quantitative comparison of mutants and have revised the manuscript to address this point. In addition, we have included new ATPase data from reconstituted vesicles which we believe will help to strengthen our contention that both ATPase and transport are equally affected by Val496 mutations.

      Reviewer #2 (Recommendations for the authors): 

      I have a minor editorial comment: 

      Perhaps I am confused. However, in reference to the text in the Results: "Our WT complex displayed high levels of K+-dependent ATPase activity and generated robust transport currents (Fig. 1 - figure suppl. 1).", I do not see either K+-dependency of ATPase activity nor transport currents in Fig. 1 - figure suppl. 1. Perhaps the text needs to be edited for clarity. 

      Thank you for pointing this out. This confusion was caused by our removal of a panel from the revised manuscript, which depicted K+-dependent transport currents. Although this panel is somewhat redundant, given inclusion of raw SSME traces from all the mutants, it has been replaced as Fig. 1 - figure supplement 1F, thus providing a thorough characterization of the preparation used for cryo-EM analysis and supporting the statement quoted by this reviewer.

      Reviewer #3 (Recommendations for the authors): 

      The authors have provided a detailed description of the SSME data collection, and followed rigorous protocols to ensure that the currents measured on a particular sensor remained stable over time. 

      I still have reservations about the direct comparison of transport in the different mutants. Specifically, on page 6, the authors state that "The longer side chain of V496M reduces transport modestly with no effect on ATPase activity. V496R, which introduces positive charge, completely abolishes activity. V496W and V496H reduce both transport and ATPase activity by about half, perhaps due to steric hindrance for the former and partial protonation for the latter." And in figures 4D and 5B, by plotting all of the peak currents on the same graph, the authors are giving the data a quantitative veneer, when these different experiments really aren't directly comparable, especially in the absence of any controls for reconstitution efficiency. 

      In terms of overall conclusions, for the more drastic mutant phenotypes, I think it is completely reasonable to conclude that transport is not observed. But a 2-fold difference could easily result from differences in reconstitution or sensor preparation. My suggestion would be to show example traces rather than a numeric plot in 4D/5E, to convey the qualitative nature of the mutant-to-mutant comparisons, and to re-write the text to acknowledge the shortcomings of mutant-to-mutant comparisons with SSME, and avoid commenting on the more subtle phenotypes, such as modest decreases and reductions by about half. 

      Figure 4, supplement 1. What is S162D? I don't think it is mentioned in the main text. 

      We agree with the reviewer's point that quantitative comparison of different mutants by SSME is compromised by ambiguity in reconstitution. However, we do not think that display of raw SSME currents is an effective way to communicate qualitative effects to the general reader, given the complexity of these data (e.g., distinction between transient binding current seen in V496R and genuine, steady-state transport current seen in WT). So we have taken a compromise approach. To start, we have removed the transport data from the main figure (Fig. 4). Luckily, we had frozen and saved the batch of reconstituted proteoliposomes from Val496 mutants that had been used for transport assays. We therefore measured ATPase activities from these proteoliposomes - after adding a small amount of detergent to prevent buildup of electrochemical gradients (1 mg/ml decylmaltoside which is only slightly more than the critical micelle concentration of 0.87 mg/ml). Differences in ATPase activity from these proteoliposomes were very similar to those measured prior to reconstitution (i.e., data in Fig. 4d) indicating that reconstitution efficiencies were comparable for the various mutants. Furthermore, differences in SSME currents are very similar to these ATPase activities, suggesting that Val496 mutants did not affect energy coupling. These data are shown in the revised Fig. 4 - figure suppl. 1a, along with the SSME raw data and size-exclusion chromatography elution profiles (Fig. 4 - figure suppl. 1b-g). We also altered the text to point out the concern over comparing transport data from different mutants (see below). We hope that this revised presentation adequately supports the conclusion that Val496 mutations - and especially the V496R substitution - influence the passage of K+ through the tunnel without affecting mechanics of the ATP-dependent pump. 

      The paragraph in question now reads as follows (pg. 6-7, with additional changes to legends to Fig. 4 and Fig. 4 - figure suppl. 1):

      "In order to provide experimental evidence for K+ transport through the tunnel, we made a series of substitutions to Val496 in KdpA. This residue resides near the widest part of the tunnel and is fully exposed to its interior (Fig. 4a). We made substitutions to increase its bulk (V496M and V496W) and to introduce charge (V496E, V496R and V496H). We used the AlphaFold-3 artificial intelligence structure prediction program (Jumper et al., 2021) to generate structures of these mutants and to evaluate their potential impact on tunnel dimensions. This analysis predicts that V496W and V496R reduce the radius to well below the 1.4 Å threshold required for passage of K+ or water (Fig. 4c); V496E and V496M also constrict the tunnel, but to a lesser extent. Measurements of ATPase and transport activity (Fig. 4d) show that negative charge (V496E) has no effect. The or a longer side chain of (V496M) reduces transport modestly with have no apparent effect on ATPase activity. V496R, which introduces positive charge, almost completely abolishes activity. V496W and V496H reduce both transport and ATPase activity by about half, perhaps due to steric hindrance for the former and partial protonation for the latter. Transport activity of these mutants was also measured, but quantitative comparisons are hampered by potential inconsistency in reconstitution of proteoliposomes and in preparation of sensors for SSME. To account for differences in reconstitution, we compared ATPase activity and transport currents taken from the same batch of vesicles (Fig. 4 - figure suppl. 1a).  These data show that differences in ATPase activity of proteoliposomes was consistent with differences measured prior to reconstitution (Fig. 4d). Transport activity, which was derived from multiple sensors, mirrored ATPase activity, indicating that the Val496 mutants did not affect energy coupling, but simply modulated turnover rate of the pump."

      S162D was included as a negative control, together with D307A. However, given the inactive mutants discussed in Fig. 5 (Asp582 and Lys586 substitutions), these seem an unnecessary distraction and have been removed from Fig. 4 - figure suppl. 1.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      The study presents significant findings on the role of mitochondrial depletion in axons and its impact on neuronal proteostasis. It effectively demonstrates how the loss of axonal mitochondria and elevated levels of eIF2β contribute to autophagy collapse and neuronal dysfunction. The use of Drosophila as a model organism and comprehensive proteome analysis adds robustness to the findings.

      In this revision, the authors have responded thoughtfully to previous concerns. In particular, they have addressed the need for a quantitative analysis of age-dependent changes in eIF2β and eIF2α. By adding western blot data from multiple time points (7 to 63 days), they show that eIF2β levels gradually increase until middle age, then decline. In milton knockdown flies, this pattern appears shifted, supporting the idea that mitochondrial defects may accelerate aging-related molecular changes. These additions clarify the temporal dynamics of eIF2β and improve the overall interpretation.

      Other updates include appropriate corrections to figures and quantification methods. The authors have also revised some of their earlier mechanistic claims, presenting a more cautious interpretation of their findings.

      Overall, this work provides new insights into how mitochondrial transport defects may influence aging-related proteostasis through eIF2β. The manuscript is now more convincing, and the revisions address the main points raised earlier. I find the updated version much improved.

      Thank you so much for the review, insightful comments and encouragement. We appreciate it.  

      Reviewer #2 (Public review):

      In the manuscript, the authors aimed to elucidate the molecular mechanism that explains neurodegeneration caused by the depletion of axonal mitochondria. In Drosophila, starting with siRNA depletion of milton and Miro, the authors attempted to demonstrate that the depletion of axonal mitochondria induces the defect in autophagy. From proteome analyses, the authors hypothesized that autophagy is impacted by the abundance of eIF2β and the phosphorylation of eIF2α. The authors followed up the proteome analyses by testing the effects of eIF2β overexpression and depletion on autophagy. With the results from those experiments, the authors proposed a novel role of eIF2β in proteostasis that underlies neurodegeneration derived from the depletion of axonal mitochondria, which they suggest accelerates age-dependent changes rather than increasing their magnitude.

      Strong caution is necessary regarding the interpretation of translational regulation resulting from the milton KD. The effect of milton KD on translation appears subtle, if present at all, in the puromycin incorporation experiments in both the initial and revised versions. Additionally, the polysome profiling data in the revised manuscript lack the clear resolution for ribosomal subunits, monosomes, and polysomes that is typically expected in publications.

      Thank you so much for the review and insightful comments. We appreciate it.  

      Reviewer #2 (Recommendations for the authors):

      The revised manuscript demonstrates many improvements. The authors have provided a more comprehensive data set and a more detailed description of their results. Furthermore, their explanation of the Integrated Stress Response (ISR) has been corrected, and this correction is reflected in the data interpretation.

      As in the public review, I maintained my emphasis on the weakness of the claim on suppressed global translation, since the data are the same in the initial and the revised versions.

      Thank you for your review. We understand that further studies will be needed to elucidate the roles on mitochondrial distribution in global translation profile. We will keep working on it. 

      A few suggestions for minor corrections.

      (1) The order of figures in the revised version is disorganized.

      Thank you for pointing it out. We corrected the order. 

      (2) In Figure 1A, mitochondria is bound by milton, and kinesin is bound by Miro. Their roles should be opposite.

      Thank you for pointing it out, and we are sorry for the oversight. We corrected it.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations for the authors): 

      I will address here just some minor changes that would improve understanding, reproducibility, or cohesion with the literature.

      (1) It would be good to mention that the prostatic vesicle of this study is named vesicula granulorum in (Steniböck, 1966) and granule vesicle in (Hooge et al, 2007).

      We have now included this (line 90 of our revised manuscript).  

      (2) A slightly more detailed discussion of the germline genes would be interesting. For example, a potential function of pa1b3-2 and cgnl1-2 based on the similarity to known genes or on the conserved domains.

      Pa1b3-2 appears to encode an acetylhydrolase; cgnl1-2 is likely a cingulin family protein involved in cell junctions. However, given the evolutionary distance between acoels and model organisms in whom these genes have been studied, we believe it is premature to speculate on their function without substantial additional work. We believe this work would be more appropriate in a future publication focused on the molecular genetic underpinnings of Hofstenia’s reproductive systems and their development.  

      (3) It is mentioned that the animals can store sperm while lacking a seminal bursa "given that H. miamia can lay eggs for months after a single mating" (line 635) - this could also be self-fertilization, according to the authors' other findings.

      We agree that it is possible this is self-fertilization, and we believe we have represented this uncertainty accurately in the text. However, we do not think this is likely, because self-fertilization manifests as a single burst of egg laying (Fig. 6D). We discuss this in the Results (line 540). 

      (4) A source should be given for the tree in Figure 7B. 

      We have now included this source (line 736), and we apologize for the oversight.  

      (5) Either in the Methods or in the Results section, it would be good to give more details on why actin and FMRFamide and tropomyosin are chosen for the immunohistochemistry studies.

      We have now included more detail in the Methods (line 823). Briefly, these are previously-validated antibodies that we knew would label relevant morphology.

      (6) In the Methods "a standard protocol hematoxylin eosin" is mentioned. Even if this is a fairly common technique, more details or a reference should be provided.

      We have now included more detail, and a reference (lines 766-774).  

      (7) Given the historical placement of Acoela within Platyhelminthes and the fact that the readers might not be very familiar with this group of animals, two passages can be confusing: line 499 and lines 674-678.

      We have edited these sentences to clarify when we mean platyhelminthes, which addresses this confusion.  

      (8) A small addition to Table S1: Amphiscolops langerhansi also presents asexual reproduction through fission ([1], cited in [2]]).

      Thanks. We have included this in Table S1.

      (a) Hanson, E. D. 1960. 'Asexual Reproduction in Acoelous Turbellaria'. The Yale Journal of Biology and Medicine 33 (2): 107-11.

      (b) Hendelberg, Jan, and Bertil Åkesson. 1991. 'Studies of the Budding Process in Convolutriloba Retrogemma (Acoela, Platyhelminthes)'. In Turbellarian Biology: Proceedings of the Sixth International Symposium on the Biology of the Turbellaria, Held at Hirosaki, Japan, 7-12 August 1990, 11-17. Springer. 

      Reviewer #2 (Recommendations for the authors): 

      I do not have any major comments on the manuscript. By default, I feel descriptive studies are a critical part of the advancement of science, particularly if the data are of great quality - as is the case here. The manuscript addresses various topics and describes these adequately. My minor point would be that in some sections, it feels like one could have gone a bit deeper. I highlighted three examples in the weakness section above (deeper analysis of markers for germline; modes of oogenesis/spermatogenesis; or proposed model for sperm storage). For instance, ultrastructural data might have been informative. But as said, I don't see this as a major problem, more a "would have been nice to see".

      We have responded to these points in detail above.

    1. Author response:

      Reviewer #1 (Public review):

      Major Comments:

      (1) The abstract frames progressive renal dysfunction as a "central, disease-modifying feature" in both Gba1b and Parkin models, with systemic consequences including water retention, ionic hypersensitivity, and worsened neuro phenotypes. While the data demonstrates renal degeneration and associated physiological stress, the causal contribution of renal defects versus broader organismal frailty is not fully disentangled. Please consider adding causal experiments (e.g., temporally restricted renal rescue/knockdown) to directly establish kidney-specific contributions.

      We concur that this would help strengthen our conclusions. However, manipulating Gba1b in a tissue-specific manner remains challenging due to its propensity for secretion via extracellular vesicles (ECVs). Leo Pallanck and Marie Davis have elegantly shown that ectopic Gba1b expression in neurons and muscles (tissues with low predicted endogenous expression) is sufficient to rescue major organismal phenotypes. Consistent with this, we have been unable to generate clear tissue-specific phenotypes using Gba1b RNAi.

      We will pursue more detailed time-course experiments of the progression of renal pathology, (water weight, renal stem cell proliferation, redox defects, etc.) with the goal of identifying earlier-onset phenotypes that potentially drive dysfunction.

      (2) The manuscript shows multiple redox abnormalities in Gba1b mutants (reduced whole fly GSH, paradoxical mitochondrial reduction with cytosolic oxidation, decreased DHE, increased lipid peroxidation, and reduced peroxisome density/Sod1 mislocalization). These findings support a state of redox imbalance, but the driving mechanism remains broad in the current form. It is unclear if the dominant driver is impaired glutathione handling or peroxisomal antioxidant/β-oxidation deficits or lipid peroxidation-driven toxicity, or reduced metabolic flux/ETC activity. I suggest adding targeted readouts to narrow the mechanism.

      We agree that we have not yet established a core driver of redox imbalance. Identifying one is likely to be challenging, especially as our RNA-sequencing data from aged Gba1b<sup>⁻/⁻</sup> fly heads (Atilano et al., 2023) indicate that several glutathione S-transferases (GstD2, GstD5, GstD8, and GstD9) are upregulated. We can attempt overexpression of GSTs, which has been elegantly shown by Leo Pallanck to ameliorate pathology in Pink1/Parkin mutant fly brains. However, mechanisms that specifically suppress lipid peroxidation or its associated toxicity, independently of other forms of redox damage, remain poorly understood in Drosophila. Our position is there probably will not be one dominant driver of redox imbalance. Notably, CytB5 overexpression has been shown to reduce lipid peroxidation (Chen et al., 2017), and GstS1 has been reported to conjugate glutathione to the toxic lipid peroxidation product 4-HNE (Singh et al., 2001). Additionally, work from the Bellen lab demonstrated that overexpression of lipases, bmm or lip4, suppresses lipid peroxidation-mediated neurodegeneration (Liu et al., 2015). We will therefore test the effects of over-expressing CytB5, bmm and lip4 in Gba1b<sup>⁻/⁻</sup> flies to help further define the mechanism.

      (3) The observation that broad antioxidant manipulations (Nrf2 overexpression in tubules, Sod1/Sod2/CatA overexpression, and ascorbic acid supplementation) consistently shorten lifespan or exacerbate phenotypes in Gba1b mutants is striking and supports the idea of redox fragility. However, these interventions are broad. Nrf2 influences proteostasis and metabolism beyond redox regulation, and Sod1/Sod2/CatA may affect multiple cellular compartments. In the absence of dose-response testing or controls for potential off-target effects, the interpretation that these outcomes specifically reflect redox dyshomeostasis feels ahead of the data. I suggest incorporating narrower interpretations (e.g., targeting lipid peroxidation directly) to clarify which redox axis is driving the vulnerability.

      We are in agreement that Drosophila Cnc exhibits functional conservation with both Nrf1 and Nrf2, which have well-established roles in proteostasis and lysosomal biology that may exacerbate pre-existing lysosomal defects in Gba1b mutants. In our manuscript, Nrf2 manipulation forms part of a broader framework of evidence, including dietary antioxidant ascorbic acid and established antioxidant effectors CatA, Sod1, and Sod2. Together, these data indicate that Gba1b mutant flies display a deleterious response to antioxidant treatments or manipulations. To further characterise the redox state, we will quantify lipid peroxidation using Bodipy 581/591 and assess superoxide levels via DHE staining under our redox-altering experimental conditions.

      As noted above, we will attempt to modulate lipid peroxidation directly through CytB5 and GstS1 overexpression, acknowledging the caveat that this approach may not fully dissociate lipid peroxidation from other aspects of redox stress. We have also observed detrimental effects of PGC1α on the lifespan of Gba1b<sup>⁻/⁻</sup> flies and will further investigate its impact on redox status in the renal tubules.

      (4) This manuscript concludes that nephrocyte dysfunction does not exacerbate brain pathology. This inference currently rests on a limited set of readouts: dextran uptake and hemolymph protein as renal markers, lifespan as a systemic measure, and two brain endpoints (LysoTracker staining and FK2 polyubiquitin accumulation). While these data suggest that nephrocyte loss alone does not amplify lysosomal or ubiquitin stress, they may not fully capture neuronal function and vulnerability. To strengthen this conclusion, the authors could consider adding functional or behavioral assays (e.g., locomotor performance)

      We will address this suggestion by performing DAM activity assays and climbing assays in the Klf15; Gba1b<sup>⁻/⁻</sup> double mutants.

      (5) The manuscript does a strong job of contrasting Parkin and Gba1b mutants, showing impaired mitophagy in Malpighian tubules, complete nephrocyte dysfunction by day 28, FRUMS clearance defects, and partial rescue with tubule-specific Parkin re-expression. These findings clearly separate mitochondrial quality control defects from the lysosomal axis of Gba1b. However, the mechanistic contrast remains incomplete. Many of the redox and peroxisomal assays are only presented for Gba1b. Including matched readouts across both models (e.g., lipid peroxidation, peroxisome density/function, Grx1-roGFP2 compartmental redox status) would make the comparison more balanced and strengthen the conclusion that these represent distinct pathogenic routes.

      We agree that park<sup>⁻/⁻</sup> mutants have been characterised in greater detail than park<sup>⁻/⁻</sup>. The primary aim of our study was not to provide an exhaustive characterisation of park¹/¹, but rather to compare key shared and distinct mechanisms underlying renal dysfunction. We have included several relevant readouts for park<sup>⁻/⁻</sup> tubules (e.g., Figure 7D and 8H: mito-Grx1-roGFP2; Figure 8J: lipid peroxidation using BODIPY 581/591). To expand our characterisation of park¹/¹ flies, we will express the cytosolic Grx1 reporter and the peroxisomal marker YFP::Pts.

      (6) Rapamycin treatment is shown to rescue several renal phenotypes in Gba1b mutants (water retention, RSC proliferation, FRUMS clearance, lipid peroxidation) but not in Parkin, and mitophagy is not restored in Gba1b. This provides strong evidence that the two models engage distinct pathogenic pathways. However, the therapeutic interpretation feels somewhat overstated. Human relevance should be framed more cautiously, and the conclusions would be stronger with mechanistic markers of autophagy (e.g., Atg8a, Ref(2)p flux in Malpighian tubules) or with experiments varying dose, timing, and duration (short-course vs chronic rapamycin).

      We will measure Atg8a, polyubiquitin, and Ref(2)P levels in Gba1b<sup>⁻/⁻</sup> and park<sup>¹/¹</sup> tubules following rapamycin treatment. In our previous study focusing on the gut (Atilano et al., 2023), we showed that rapamycin treatment increased lysosomal area, as assessed using LysoTracker<sup>TM</sup>. We will extend this analysis to the renal tubules following rapamycin exposure. Another reviewer requested that we adopt more cautious language regarding the clinical translatability of this work, and we will amend this in Version 2.

      (7) Several systemic readouts used to support renal dysfunction (FRUMS clearance, salt stress survival) could also be influenced by general organismal frailty. To ensure these phenotypes are kidney-intrinsic, it would be helpful to include controls such as tissue-specific genetic rescue in Malpighian tubules or nephrocytes, or timing rescue interventions before overt systemic decline. This would strengthen the causal link between renal impairment and the observed systemic phenotypes.

      As noted in our response to point 1, we currently lack reliable approaches to manipulate Gba1b in a tissue-specific manner. However, we agree that it is important to distinguish kidney-intrinsic dysfunction from generalised organismal frailty. In the park model, we have already performed renal cell-autonomous rescue: re-expression of Park specifically in Malpighian tubule principal cells (C42-Gal4) throughout adulthood partially normalises water retention, whereas brain-restricted Park expression has no effect on renal phenotypes. Because rescuing Park only in the renal tubules is sufficient to correct a systemic fluid-handling phenotype in otherwise mutant animals, these findings indicate that the systemic defects are driven, at least in part, by renal dysfunction rather than nonspecific organismal frailty.

      To strengthen this causal link, we will now extend this same tubule-specific Park rescue (C42-Gal4 and the high-fidelity Malpighian tubule driver CG31272-Gal4) to additional systemic readouts raised by the reviewer. Specifically, we will assay FRUMS clearance and salt stress survival in rescued versus non-rescued park mutants to determine whether renal rescue also mitigates these systemic phenotypes.

      Reviewer #2 (Public review):

      (1) The authors claim that: "renal system dysfunction negatively impacts both organismal and neuronal health in Gba1b-/- flies, including autophagic-lysosomal status in the brain." This statement implies that renal impairments drive neurodegeneration. However, there is no direct evidence provided linking renal defects to neurodegeneration in this model. It is worth noting that Gba1b-/- flies are a model for neuronopathic Gaucher disease (GD): they accumulate lipids in their brains and present with neurodegeneration and decreased survival, as shown by Kinghorn et al. (The Journal of Neuroscience, 2016, 36, 11654-11670) and by others, which the authors failed to mention (Davis et al., PLoS Genet. 2016, 12: e1005944; Cabasso et al., J Clin Med. 2019, 8:1420; Kawasaki et al., Gene, 2017, 614:49-55).

      With the caveats noted in the responses below, we show that driving Nrf2 expression using the renal tubular driver C42 results in decreased survival, more extensive renal defects, and increased brain pathology in Gba1b<sup>⁻/⁻</sup> flies, but not in healthy controls. This suggests that a healthy brain can tolerate renal dysfunction without severe pathological consequences. Our findings therefore indicate that in Gba1b<sup>⁻/⁻</sup> flies, there may be an interaction between renal defects and brain pathology. We do not explicitly claim that renal impairments drive neurodegeneration; rather, we propose that manipulations exacerbating renal dysfunction can have organism-wide effects, ultimately impacting the brain.

      The reviewer is correct that our Gba1b<sup>⁻/⁻</sup> fly model represents a neuronopathic GD model with age-related pathology. Indeed, we reproduce the autophagic-lysosomal defects previously reported (Kinghorn et al., 2016) in Figure 5. We agree that the papers cited by the reviewer merit inclusion, and in Version 2 we will incorporate them into the following pre-existing sentence in the Results:

      “The gut and brain of Gba1b<sup>⁻/⁻</sup> flies, similar to macrophages in GD patients, are characterised by enlarged lysosomes (Kinghorn et al., 2016; Atilano et al., 2023).”

      (2) The authors tested brain pathology in two experiments:

      (a) To determine the consequences of abnormal nephrocyte function on brain health, they measured lysosomal area in the brain of Gba1b-/-, Klf15LOF, or stained for polyubiquitin. Klf15 is expressed in nephrocytes and is required for their differentiation. There was no additive effect on the increased lysosomal volume (Figure 3D) or polyubiquitin accumulation (Figure 3E) seen in Gba1b-/- fly brains, implying that loss of nephrocyte viability itself does not exacerbate brain pathology.

      (b) The authors tested the consequences of overexpression of the antioxidant regulator Nrf2 in principal cells of the kidney on neuronal health in Gba1b-/- flies, using the c42-GAL4 driver. They claim that "This intervention led to a significant increase in lysosomal puncta number, as assessed by LysoTrackerTM staining (Figure 5D), and exacerbated protein dyshomeostasis, as indicated by polyubiquitin accumulation and increased levels of the ubiquitin-autophagosome trafficker Ref(2)p/p62 in Gba1b-/- fly brains (Figure 5E). Interestingly, Nrf2 overexpression had no significant effect on lysosomal area or ubiquitin puncta in control brains, demonstrating that the antioxidant response specifically in Gba1b-/- flies negatively impacts disease states in the brain and renal system."Notably, c42-GAL4 is a leaky driver, expressed in salivary glands, Malpighian tubules, and pericardial cells (Beyenbach et al., Am. J. Cell Physiol. 318: C1107-C1122, 2020). Expression in pericardial cells may affect heart function, which could explain deterioration in brain function.

      Taken together, the contribution of renal dysfunction to brain health remains debatable.

      Based on the above, I believe the title should be changed to: Redox Dyshomeostasis Links Renal and Neuronal Dysfunction in Drosophila Models of Gaucher disease. Such a title will reflect the results presented in the manuscript

      We agree that C42-Gal4 is a leaky driver; unfortunately, this was true for all commonly used Malpighian tubule drivers available when we began the study. A colleague has recommended CG31272-Gal4 from the Perrimon lab’s recent publication (Xu et al., 2024) as a high-fidelity Malpighian tubule driver. If it proves to maintain principal-cell specificity throughout ageing in our hands, we will repeat key experiments using this driver.

      (3) The authors mention that Gba1b is not expressed in the renal system, which means that no renal phenotype can be attributed directly to any known GD pathology. They suggest that systemic factors such as circulating glycosphingolipids or loss of extracellular vesicle-mediated delivery of GCase may mediate renal toxicity. This raises a question about the validity of this model to test pathology in the fly kidney. According to Flybase, there is expression of Gba1b in renal structures of the fly.

      Our evidence suggesting that Gba1b is not substantially expressed in renal tissue is based on use of the Gba1b-CRIMIC-Gal4 line, which fails to drive expression of fluorescently tagged proteins in the Malpighian tubules and we have previously shown there is no expression within the nephrocytes with this driver line (Atilano et al., 2023). This does not exclude the possibility that Gba1b functions within the tubules. Notably, Leo Pallanck has provided compelling evidence that Gba1b is present in extracellular vesicles (ECVs) and given the role of the Malpighian tubules in haemolymph filtration, these cells are likely exposed to circulating ECVs. The lysosomal defects observed in Gba1b<sup>⁻/⁻</sup> tubules therefore suggest a potential role for Gba1b in this tissue.  

      John Vaughan and Thomas Clandinin have developed mCherry- and Lamp1.V5-tagged Gba1b constructs. We intend to express these in tissues shown by the Pallanck lab to release ECVs (e.g., neurons and muscle) and examine whether the protein can be detected in the tubules.

      (4) It is worth mentioning that renal defects are not commonly observed in patients with Gaucher disease. Relevant literature: Becker-Cohen et al., A Comprehensive Assessment of Renal Function in Patients With Gaucher Disease, J. Kidney Diseases, 2005, 46:837-844.

      We have identified five references indicating that renal involvement, while rare, does occur in association with GD. We agree that this is a valid citation and will include it in the revised introductory sentence:

      “However, renal dysfunction remains a rare symptom in GD patients (Smith et al., 1978; Chander et al., 1979; Siegel et al., 1981; Halevi et al., 1993).”

      (5) In the discussion, the authors state: "Together, these findings establish renal degeneration as a driver of systemic decline in Drosophila models of GD and PD..." and go on to discuss a brain-kidney axis in PD. However, since this study investigates a GD model rather than a PD model, I recommend omitting this paragraph, as the connection to PD is speculative and not supported by the presented data.

      Our position is that Gba1b<sup>⁻/⁻</sup> represents a neuronopathic Gaucher disease model with mechanistic relevance to PD. The severity of GBA1 mutations correlates with the extent of GBA1/GCase loss of function and, consequently, with increased PD risk. Likewise, biallelic park<sup>⁻/⁻</sup> mutants cause a severe and heritable form of PD, and the Drosophila park<sup>⁻/⁻</sup> model is a well-established and widely recognised system that has been instrumental in elucidating how Parkin and Pink1 mutations drive PD pathogenesis.

      We therefore see no reason to omit this paragraph. While some aspects are inherently speculative, such discussion is appropriate and valuable when addressing mechanisms underlying a complex and incompletely understood disease, provided interpretations remain measured. At no point do we claim that our work demonstrates a direct brain-renal axis. Rather, our data indicate that renal dysfunction is a disease-modifying feature in these models, aligning with emerging epidemiological evidence linking PD and renal impairment.

      (6) The claim: "If confirmed, our findings could inform new biomarker strategies and therapeutic targets for GBA1 mutation carriers and other at-risk groups. Maintaining renal health may represent a modifiable axis of intervention in neurodegenerative disease," extends beyond the scope of the experimental evidence. The authors should consider tempering this statement or providing supporting data.

      (7) The conclusion, "we uncover a critical and previously overlooked role for the renal system in GD and PD pathogenesis," is too strong given the data presented. As no mechanistic link between renal dysfunction and neurodegeneration has been established, this claim should be moderated.

      We agree that these sections may currently overstate our findings. In Version 2, we will revise them to ensure our claims remain balanced, while retaining the key points that arise from our data and clearly indicating where conclusions require confirmation (“if confirmed”) or additional study (“warrants further investigation”).

      “If confirmed, our findings could inform new biomarker strategies and therapeutic targets for patients with GD and PD. Maintaining renal health may represent a modifiable axis of intervention in these diseases.”

      “We uncover a notable and previously underappreciated role for the renal system in GD and PD, which now warrants further investigation.”

      (8) The relevance of Parkin mutant flies is questionable, and this section could be removed from the manuscript.

      We intend to include the data for the Parkin loss-of-function mutants, as these provide essential support for the PD-related findings discussed in our manuscript. To our knowledge, this represents the first demonstration that Parkin mutants display defects in Malpighian tubule function and water homeostasis. We therefore see no reason to remove these findings. Furthermore, as Reviewer 1 specifically requested additional experiments using the Park fly model, we plan to incorporate these analyses in the revised manuscript.

      Minor comments:

      (1)  Figure 1G: The FRUMS assay is not shown for Gba1b-/- flies.

      The images in Figure 1G illustrate representative stages of dye clearance. We have quantified the clearance time course for both genotypes. During this process, the tubules of Gba1b<sup>⁻/⁻</sup> flies, similar to controls, sequentially resemble each of the three example images. As the Gba1b<sup>⁻/⁻</sup> tubules appear morphologically identical to controls, differing only in population-level clearance dynamics, we do not feel that including additional example images would provide further informative value.

      (2) In panels D and F of Figure 2, survival of control and Gba1b-/- flies in the presence of 4% NaCl is presented. However, longevity is different (up to 10 days in D and ~3 days in F for control). The authors should explain this.

      We agree. In our experience, feeding-based stress survival assays show considerable variability between experiments, and we therefore interpret results only within individual experimental replicates. We have observed similar variability in oxidative stress, starvation, and xenobiotic survival assays, which may reflect batch-specific or environmental effects.

      (3) In Figure 7F, the representative image does not correspond to the quantification; the percentage of endosome-negative nephrocytes seems to be higher for the control than for the park1/1 flies. Please check this.

      The example images are correctly oriented. Typically, an endosome-negative nephrocyte shows no dextran uptake, whereas an endosome-positive nephrocyte displays a ring of puncta around the cell periphery. In park¹/¹ mutants, dysfunctional nephrocytes exhibit diffuse dextran staining throughout the cell, accompanied by diffuse DAPI signal, indicating a complete loss of membrane integrity and likely cell death. We have 63× images from the preparations shown in Figure 7F demonstrating this. In Version 2, we will include apical and medial z-slices of the nephrocytes to illustrate these findings (to be added as supplementary   data).

      (4) In Figure 7H, the significance between control and park1/1 flies in the FRUMS assay is missing.

      We observe significant dye clearance from the haemolymph; however, the difference in complete clearance from the tubules does not reach statistical significance. This may speculatively reflect alterations in specific aspects of tubule function, where absorption and transcellular flux are affected, but subsequent clearance from the tubule lumen remains intact. We do not feel that our current data provide sufficient resolution to draw detailed conclusions about tubule physiology at this level.

      Reviewer #3 (Public review):

      Weaknesses:

      The paper relies mostly on the biallelic Gba1b mutant, which may reflect dysfunction in Gaucher's patients, though this has yet to be fully explored. The claims for the heterozygous allele and a role in Parkinson's is a little more tenuous, making assumptions that heterozygosity is a similar but milder phenotype than the full loss-of-function.

      We agree with the reviewer that studying heterozygotes may provide valuable insight into GBA1-associated PD. We will therefore assess whether subtle renal defects are detectable in Gba1b<sup>⁻/⁻</sup> heterozygotes. We clearly state that GBA1 mutations act as a risk factor for PD rather than a Mendelian inherited cause. Consistent with findings from Gba heterozygous mice, Gba1b<sup>⁻/⁻</sup> flies display minimal phenotypes (Kinghorn et al. 2016), and any observable effects are expected to be very mild and age dependent.

      (1) Figure 1c, the loss of stellate cells. What age are the MTs shown? Is this progressive or developmental?

      These experiments were conducted on flies that were three weeks of age, as were all manipulations unless otherwise stated. We will ensure that this information is clearly indicated in the figure legends in Version 2. We did not observe changes in stellate cell number at three days of age, and this result will be included in the supplementary material in Version 2. Our data therefore suggest that this is a progressive phenotype.

      (2) I might have missed this, but for Figure 3, do the mutant flies start with a similar average weight, or are they bloated?

      We will perform an age-related time course of water weight in response to Reviewer 1’s comments. For all experiments, fly eggs are age-matched and seeded below saturation density to ensure standardised conditions. Gba1b mutant flies do not exhibit any defects in body size or timing of eclosion.

      (3) On 2F, add to the graph that 4% NaCl (or if it is KCL) is present for all conditions, just to make the image self-sufficient to read.

      Many thanks for the suggestion. We agree that this will increase clarity and will make this amendment in Version 2 of the manuscript

      (4) P13 - rephrase, 'target to either the mitochondria or the cytosol' (as it is phrased, it sounds as though you are doing both at the same time).

      We agree and we plan to revise the sentence as follows:

      Original:

      “To further evaluate the glutathione redox potential (E<sub>GSH</sub>) in MTs, we utilised the redox-sensitive green, fluorescent biosensor Grx1-roGFP2, targeted to both the mitochondria and cytosol (Albrecht et al., 2011).”

      Revised:

      “To further evaluate the glutathione redox potential (E<sub>GSH</sub>) in MTs, we utilised the redox-sensitive fluorescent biosensor Grx1-roGFP2, targeted specifically to either the mitochondria or the cytosol using mito- or cyto-tags, respectively (Albrecht et al., 2011).”

      (5) In 6F - the staining appears more intense in the Park mutant - perhaps add asterisks or arrowheads to indicate the nephrocytes so that the reader can compare the correct parts of the image?

      Reviewer 2 reached the same interpretation. Typically, an endosome-negative nephrocyte shows no dextran uptake, whereas an endosome-positive nephrocyte displays a ring of puncta around the cell periphery. In park¹/¹ mutants, dysfunctional nephrocytes exhibit diffuse dextran staining throughout the cell, accompanied by diffuse DAPI signal, indicative of a complete loss of membrane integrity and likely cell death. We have 63× images from the preparations shown in Figure 7F demonstrating this, and in Version 2 we will include apical and medial z-slices of the nephrocytes to illustrate these findings (to be added as supplementary data).

      (6) In the main results text - need some description/explanation of the SOD1 v SOD2 distribution (as it is currently understood) in the cell - SOD2 being predominantly mitochondrial. This helps arguments later on.

      Thank you for this suggestion. We plan to amend the text as follows:

      “Given that Nrf2 overexpression shortens lifespan in Gba1b<sup>⁻/⁻</sup> flies, we investigated the effects of overexpressing its downstream antioxidant targets, Sod1, Sod2, and CatA, both ubiquitously using the tub-Gal4 driver and with c42-Gal4, which expresses in PCs.”

      to:

      “Given that Nrf2 overexpression shortens lifespan in Gba1b<sup>⁻/⁻</sup> flies, we investigated the effects of overexpressing its downstream antioxidant targets, Sod1, Sod2, and CatA, both ubiquitously using the tub-Gal4 driver and with c42-Gal4, which expresses in PCs. Sod1 and CatA function primarily in the cytosol and peroxisomes, whereas Sod2 is localised to the mitochondria. Sod1 and Sod2 catalyse the dismutation of superoxide radicals to hydrogen peroxide, while CatA subsequently degrades hydrogen peroxide to water and oxygen.”

      (7) Figure 1G, what age are the flies? Same for 3D and E, 4C,D,E, 5B - please check the ages of flies for all of the imaging figures; this information appears to have been missed out.

      As stated above, all experiments were conducted on three-week-old flies unless otherwise specified. In Version 2 of the manuscript, we will ensure this information is included consistently in the figure legends to prevent any potential confusion.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      Specifically, the authors need to define the DFG conformation using criteria accepted in the field, for example, see https://klifs.net/index.php.

      We thank the reviewer for this suggestion. In the manuscript, we use pseudodihedral and bond angle-based DFG definitions that have been previously established by literature cited in the study (re-iterated below) to unambiguously define the side-chain conformational states of the DFG motif. As we are interested in the specific mechanics of DFG flips under different conditions, we’ve found that the descriptors defined below are sufficient to distinguish between DFG states and allow a more direct comparison with previously-reported results in the literature using different methods.

      We amended the text to be more clear as to those definitions and their choice:

      DFG angle definitions:

      Phe382/Cg, Asp381/OD2, Lys378/O

      Source: Structural Characterization of the Aurora Kinase B "DFG-flip" Using Metadynamics. Lakkaniga NR, Balasubramaniam M, Zhang S, Frett B, Li HY. AAPS J. 2019 Dec 18;22(1):14. doi: 10.1208/s12248-019-0399-6. PMID: 31853739; PMCID: PMC7905835.

      “Finally, we chose the angle formed by Phe382's gamma carbon, Asp381's protonated side chain oxygen (OD2), and Lys378's backbone oxygen as PC3 based on observations from a study that used a similar PC to sample the DFG flip in Aurora Kinase B using metadynamics \cite{Lakkaniga2019}. This angular PC3 should increase or decrease (based on the pathway) during the DFG flip, with peak differences at intermediate DFG configurations, and then revert to its initial state when the flip concludes.”

      DFG pseudodihedral definitions:

      Ala380/Cb, Ala380/Ca, Asp381/Ca, Asp381/Cg

      Ala380/Cb, Ala380/CA, Phe382/CA, Phe382Cg

      Source: Computational Study of the “DFG-Flip” Conformational Transition in c-Abl and c-Src Tyrosine Kinases. Yilin Meng, Yen-lin Lin, and Benoît Roux The Journal of Physical Chemistry B 2015 119 (4), 1443-1456 DOI: 10.1021/jp511792a

      “For downstream analysis, we used two pseudodihedrals previously defined in the existing Abl1 DFG flip simulation literature \cite{Meng2015} to identify and discriminate between DFG states. The first (dihedral 1) tracks the flip state of Asp381, and is formed by the beta carbon of Ala380, the alpha carbon of Ala380, the alpha carbon of Asp381, and the gamma carbon of Asp381. The second (dihedral 2) tracks the flip state of Phe382, and is formed by the beta carbon of Ala380, the alpha carbon of Ala380, the alpha carbon of Phe381, and the gamma carbon of Phe381. These pseudodihedrals, when plotted in relation to each other, clearly distinguish between the initial DFG-in state, the target DFG-out state, and potential intermediate states in which either Asp381 or Phe381 has flipped.”

      Convergence needs to be demonstrated for estimating the population difference between different conformational states.

      We agree that demonstrating convergence is important for accurate estimations of population differences between conformational states. However, as the DFG flip is a complex and concerted conformational change with an energy barrier of 30 kcal/mol [1], and considering the traditional limitations of methods like weighted ensemble molecular dynamics (WEMD), it would take an unrealistic amount of GPU time (months) to observe convergence in our simulations. As discussed in the text (see examples below), we caveat our energy estimations by explicitly mentioning that the state populations we report are not converged and are indicative of a much larger energy barrier in the mutant.

      “These relative probabilities qualitatively agree with the large expected free energy barrier for the DFG-in to DFG-out transition (~32 kcal/mol), and with our observation of a putative metastable DFG-inter state that is missed by NMR experiments due to its low occupancy.”

      “As an important caveat, it is unlikely that the DFG flip free energy barriers of over 70 kcal/mol estimated for the Abl1 drug-resistant variants quantitatively match the expected free energy barrier for their inactivation. Rather, our approximate free energy barriers are a symptom of the markedly increased simulation time required to sample the DFG flip in the variants relative to the wild-type, which is a strong indicator of the drastically reduced propensity of the variants to complete the DFG flip. Although longer WE simulations could allow us to access the timescales necessary for more accurately sampling the free energy barriers associated with the DFG flip in Abl1's drug-resistant compound mutants, the computational expense of running WE for 200 iterations is already large (three weeks with 8 NVIDIA RTX3900 GPUs for one replicate); this poses a logistical barrier to attempting to sample sufficient events to be able to fully characterize how the reaction path and free energy barrier change for the flip associated with the mutations. Regardless, the results of our WE simulations resoundingly show that the Glu255Lys/Val and Thr315Ile compound mutations drastically reduce the probability for DFG flip events in Abl1.”

      (1) Conformational states dynamically populated by a kinase determine its function. Tao Xie et al., Science 370, eabc2754 (2020). DOI:10.1126/science.abc2754

      The DFG flip needs to be sampled several times to establish free energy difference.

      Our simulations have captured thousands of correlated and dozens of uncorrelated DFG flip events. The per-replicate free energy differences are computed based on the correlated transitions. Please consult the WEMD literature (referenced below and in the manuscript, references 34 and 36) for more information on how WEMD allows the sampling of multiple such events and subsequent estimation of probabilities:

      Zuckermann et al (2017) 10.1146/annurev-biophys-070816-033834

      Chong et al (2021) 10.1021/acs.jctc.1c01154

      The free energy plots do not appear to show an intermediate state as claimed.

      Both the free energy plots and the representative/anecdotal trajectories analyzed in the study show a saddle point when Asp381 has flipped but Phe382 has not (which defines the DFG-inter state), we observe a distinct change in probability when going to the pseudodihedral values associated with DFG-inter to DFG-up or DFG-out. We removed references to the putative state S1 as we we agree with the reviewer that its presence is unlikely given the data we show.

      The trajectory length of 7 ns in both Figure 2 and Figure 4 needs to be verified, as it is extremely short for a DFG flip that has a high free energy barrier.

      We appreciate this point. To clarify, the 7 ns segments corresponds to a collated trajectory extracted from the tens of thousands of walkers that compose the WEMD ensemble, and represent just the specific moment at which the dihedral flips occur rather than the entire flip process. On average, our WEMD simulations sample over 3 us of aggregate simulation time before the first DFG flip event is observed, in line with a high energy barrier. This is made clear in the manuscript excerpt below: “Over an aggregate simulation time of over 20 $\mu$s, we have collected dozens of uncorrelated and unbiased inactivation events, starting from the lowest energy conformation of the Abl1 kinase core (PDB 6XR6) \cite{Xie2020}.”

      The free energy scale (100 kT) appears to be one order of magnitude too large.

      As discussed in the text and quoted in response to comment 2, the exponential splitting nature of WEMD simulations (where the probability of individual walkers are split upon crossing each bin threshold) often leads to unrealistically high energy barriers for rare events. This is not unexpected, and as discussed in the text, we consider that value to be a qualitative measurement of the decreased probability of a DFG flip in Abl1 mutants, and not a direct measurement of energy barriers.

      Setting the DFG-Asp to the protonated state is not justified, because in the DFG-in state, the DFG-Asp is clearly deprotonated.

      According to previous publications, DFG-Asp is frequently protonated in the DFG-in state of Abl1 kinase. For instance, as quoted from Hanson, Chodera, et al., Cell Chem Bio (2019), “C onsistent with previous simulations on the DFG-Asp-out/in interconversion of Abl kinase we only observe the DFG flip with protonated Asp747 ( Shan et al., 2009 ). We showed previously that the pKa for the DFG-Asp in Abl is elevated at 6.5.”

      Finally, the authors should discuss their work in the context of the enormous progress made in theoretical studies and mechanistic understanding of the conformational landscape of protein kinases in the last two decades, particularly with regard to the DFG flip. and The study is not very rigorous. The major conclusions do not appear to be supported. The claim that it is the first unbiased simulation to observe DFG flip is not true. For example, Hanson, Chodera et al (Cell Chem Biol 2019), Paul, Roux et al (JCTC 2020), and Tsai, Shen et al (JACS 2019) have also observed the DFG flip.

      We thank the reviewer for pointing out these issues. We have revised the manuscript to better contextualize our claims within the limitations of the method and to acknowledge previous work by Hanson, Chodera et al., Paul, Roux et al., and Tsai, Shen et al.

      The updated excerpt is described below

      “Through our work, we have simulated an ensemble of DFG flip pathways in a wild-type kinase and its variants with atomistic resolution and without the use of biasing forces, also reporting the effects of inhibitor-resistant mutations in the broader context of kinase inactivation likelihood with such level of detail. “

      Reviewer #2:

      I appreciated the discussion of the strengths/weaknesses of weighted ensemble simulations. Am I correct that this method doesn't do anything to explicitly enhance sampling along orthogonal degrees of freedom? Maybe a point worth mentioning if so.

      Yes, this is correct. We added a sentence to WEMD summary section of Results and Discussion discussing it.

      “As a supervised enhanced sampling method, WE employs progress coordinates (PCs) to track the time-dependent evolution of a system from one or more basis states towards a target state. Although weighted ensemble simulations are unbiased in the sense that no biasing forces are added over the course of the simulations, the selection of progress coordinates and the bin definitions can potentially bias the results towards specific pathways \cite{Zuckerman2017}. Additionally, traditional WEMD simulations do not explicitly enhance sampling along orthogonal degrees of freedom (those not captured by the progress coordinates). In practice, this means that insufficient PC definitions can lead to poor sampling.”

      I don't understand Figure 3C. Could the authors instead show structures corresponding to each of the states in 3B, and maybe also a representative structure for pathways 1 and 2?

      We have remade Figure 3. We removed 3B and accompanying discussion as upon review we were not confident on the significance of the LPATH results where it pertains to the probability of intermediate states. We replaced 3B with a summary of the pathways 1 and 2 in regards to the Phe382 flip (which is the most contrasting difference).

      Why introduce S1 and DFG-inter? And why suppose that DFG-inter is what corresponds to the excited state seen by NMR?

      As a consequence of dropping the LPATH analysis, we also removed mentions to S1 as it further analysis made it hard to distinguish from DFG-in, For DFG-inter, we mention that conformation because (a) it is shared by both flipping mechanisms that we have found, and (b) it seems relevant for pharmacology, as it has been observed in other kinases such as Aurora B (PDB 2WTV), as Asp381 flipping before Phe382 creates space in the orthosteric kinase pocket which could be potentially targeted by an inhibitor.

      It would be nice to have error bars on the populations reported in Figure 3.

      Agreed, upon review we decided do drop the populations as we were not confident on the significance of the LPATH results where it pertains to the probability of intermediate states.

      I'm confused by the attempt to relate the relative probabilities of states to the 32 kca/mol barrier previously reported between the states. The barrier height should be related to the probability of a transition. The DFG-out state could be equiprobable with the DFG-in state and still have a 32 kcal/mol barrier separating them.

      Thanks for the correction, we agree with the reviewer and have amended the discussion to reflect this. Since we are starting our simulations in the DFG-in state, the probability of walkers arriving in DFG-out in our steady state WEMD simulations should (assuming proper sampling) represent the probability of the transition. We incorrectly associated the probability of the DFG-out state itself with the probability of the transition.

      How do the relative probabilities of the DFG-in/out states compare to experiments, like NMR?

      Previous NMR work has found the population of apo DFG in (PDB 6XR6) in solution to be around 88% for wild-type ABL1, and 6% for DFG out (PDB 6XR7). The remaining 6% represents post-DFG-out state (PDB 6XRG) where the activation loop has folded in near the hinge, which we did not simulate due to the computational cost associated with it. The same study reports the barrier height from DFG-in to DFG-out to be estimated at around 30 kcal/mol.

      (1) Conformational states dynamically populated by a kinase determine its function. Tao Xie et al., Science 370, eabc2754 (2020). DOI:10.1126/science.abc2754

      (we already have that in the text, just need to quote here)

      “Do the staggered and concerted DFG flip pathways mentioned correspond to pathways 1 and 2 in Figure 3B, or is that a concept from previous literature?”

      Yes, we have amended Figure 3B to be clearer. In previous literature both pathways have been observed [1], although not specifically defined.

      Source: Computational Study of the “DFG-Flip” Conformational Transition in c-Abl and c-Src Tyrosine Kinases. Yilin Meng, Yen-lin Lin, and Benoît Roux The Journal of Physical Chemistry B 2015 119 (4), 1443-1456 DOI: 10.1021/jp511792a

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary

      Query: In this manuscript, the authors introduce Gcoupler, a Python-based computational pipeline designed to identify endogenous intracellular metabolites that function as allosteric modulators at the G protein-coupled receptor (GPCR) - Gα protein interface. Gcoupler is comprised of four modules:

      I. Synthesizer - identifies protein cavities and generates synthetic ligands using LigBuilder3

      II. Authenticator - classifies ligands into high-affinity binders (HABs) and low-affinity binders (LABs) based on AutoDock Vina binding energies

      III. Generator - trains graph neural network (GNN) models (GCM, GCN, AFP, GAT) to predict binding affinity using synthetic ligands

      IV. BioRanker - prioritizes ligands based on statistical and bioactivity data

      The authors apply Gcoupler to study the Ste2p-Gpa1p interface in yeast, identifying sterols such as zymosterol (ZST) and lanosterol (LST) as modulators of GPCR signaling. Our review will focus on the computational aspects of the work. Overall, we found the Gcoupler approach interesting and potentially valuable, but we have several concerns with the methods and validation that need to be addressed prior to publication/dissemination.

      We express our gratitude to Reviewer #1 for their concise summary and commendation of our work. We sincerely apologize for the lack of sufficient detail in summarizing the underlying methods employed in Gcoupler, as well as its subsequent experimental validations using yeast, human cell lines, and primary rat cardiomyocyte-based assays.

      We wish to state that substantial improvements have been made in the revised manuscript, every section has been elaborated upon to enhance clarity. Please refer to the point-by-point response below and the revised manuscript.

      Query: (1) The exact algorithmic advancement of the Synthesizer beyond being some type of application wrapper around LigBuilder is unclear. Is the grow-link approach mentioned in the methods already a component of LigBuilder, or is it custom? If it is custom, what does it do? Is the API for custom optimization routines new with the Synthesizer, or is this a component of LigBuilder? Is the genetic algorithm novel or already an existing software implementation? Is the cavity detection tool a component of LigBuilder or novel in some way? Is the fragment library utilized in the Synthesizer the default fragment library in LigBuilder, or has it been customized? Are there rules that dictate how molecule growth can occur? The scientific contribution of the Synthesizer is unclear. If there has not been any new methodological development, then it may be more appropriate to just refer to this part of the algorithm as an application layer for LigBuilder.

      We appreciate Reviewer #1's constructive suggestion. We wish to emphasize that

      (1) The LigBuilder software comprises various modules designed for distinct functions. The Synthesizer in Gcoupler strategically utilizes two of these modules: "CAVITY" for binding site detection and "BUILD" for de novo ligand design.

      (2) While both modules are integral to LigBuilder, the Synthesizer plays a crucial role in enabling their targeted, automated, and context-aware application for GPCR drug discovery.

      (3) The CAVITY module is a structure-based protein binding site detection program, which the Synthesizer employs for identifying ligand binding sites on the protein surface.

      (4) The Synthesizer also leverages the BUILD module for constructing molecules tailored to the target protein, implementing a fragment-based design strategy using its integrated fragment library.

      (5) The GROW and LINK methods represent two independent approaches encompassed within the aforementioned BUILD module.

      Author response image 1.

      Schematic representation of the key strategy used in the Synthesizer module of Gcoupler.

      Our manuscript details the "grow-link" hybrid approach, which was implemented using a genetic algorithm through the following stages:

      (1) Initial population generation based on a seed structure via the GROW method.

      (2) Selection of "parent" molecules from the current population for inclusion in the mating pool using the LINK method.

      (3) Transfer of "elite" molecules from the current population to the new population.

      (4) Population expansion through structural manipulations (mutation, deletion, and crossover) applied to molecules within the mating pool.

      Please note, the outcome of this process is not fixed, as it is highly dependent on the target cavity topology and the constraint parameters employed for population evaluation. Synthesizer customizes generational cycles and optimization parameters based on cavity-specific constraints, with the objective of either generating a specified number of compounds or comprehensively exploring chemical diversity against a given cavity topology.

      While these components are integral to LigBuilder, Synthesizer's innovation lies

      (1) in its programmatic integration and dynamic adjustment of these modules.

      (2) Synthesizer distinguishes itself not by reinventing these algorithms, but by their automated coordination, fine-tuning, and integration within a cavity-specific framework.

      (3) It dynamically modifies generation parameters according to cavity topology and druggability constraints, a capability not inherently supported by LigBuilder.

      (4) This renders Synthesizer particularly valuable in practical scenarios where manual optimization is either inefficient or impractical.

      In summary, Synthesizer offers researchers a streamlined interface, abstracting the technical complexities of LigBuilder and thereby enabling more accessible and reproducible ligand generation pipelines, especially for individuals with limited experience in structural or cheminformatics tools.

      Query: (2) The use of AutoDock Vina binding energy scores to classify ligands into HABs and LABs is problematic. AutoDock Vina's energy function is primarily tuned for pose prediction and displays highly system-dependent affinity ranking capabilities. Moreover, the HAB/LAB thresholds of -7 kcal/mol or -8 kcal/mol lack justification. Were these arbitrarily selected cutoffs, or was benchmarking performed to identify appropriate cutoffs? It seems like these thresholds should be determined by calibrating the docking scores with experimental binding data (e.g., known binders with measured affinities) or through re-scoring molecules with a rigorous alchemical free energy approach.

      We again express our gratitude to Reviewer #1 for these inquiries. We sincerely apologize for the lack of sufficient detail in the original version of the manuscript. In the revised manuscript, we have ensured the inclusion of a detailed rationale for every threshold utilized to prioritize high-affinity binders. Please refer to the comprehensive explanation below, as well as the revised manuscript, for further details.

      We would like to clarify that:

      (1) The Authenticator module is not solely reliant on absolute binding energy values for classification. Instead, it calculates binding energies for all generated compounds and applies a statistical decision-making layer to define HAB and LAB classes.

      (2) Rather than using fixed thresholds, the module employs distribution-based methods, such as the Empirical Cumulative Distribution Function (ECDF), to assess the overall energy landscape of the compound set. We then applied multiple statistical tests to evaluate the HAB and LAB distributions and determine an optimal, data-specific cutoff that balances class sizes and minimizes overlap.

      (3) This adaptive approach avoids rigid thresholds and instead ensures context-sensitive classification, with safeguards in place to maintain adequate representation of both classes for downstream model training, and in this way, the framework prioritizes robust statistical reasoning over arbitrary energy cutoffs and aims to reduce the risks associated with direct reliance on Vina scores alone.

      (4) To assess the necessity and effectiveness of the Authenticator module, we conducted a benchmarking analysis where we deliberately omitted the HAB and LAB class labels, treating the compound pool as a heterogeneous, unlabeled dataset. We then performed random train-test splits using the Synthesizer-generated compounds and trained independent models.

      (5) The results from this approach demonstrated notably poorer model performance, indicating that arbitrary or unstructured data partitioning does not effectively capture the underlying affinity patterns. These experiments highlight the importance of using the statistical framework within the Authenticator module to establish meaningful, data-driven thresholds for distinguishing High- and Low-Affinity Binders. The cutoff values are thus not arbitrary but emerge from a systematic benchmarking and validation process tailored to each dataset.

      Please note: While calibrating docking scores with experimental binding affinities or using rigorous methods like alchemical free energy calculations can improve precision, these approaches are often computationally intensive and reliant on the availability of high-quality experimental data, a major limitation in many real-world screening scenarios.

      In summary, the primary goal of Gcoupler is to enable fast, scalable, and broadly accessible screening, particularly for cases where experimental data is sparse or unavailable. Incorporating such resource-heavy methods would not only significantly increase computational overhead but also undermine the framework’s intended usability and efficiency for large-scale applications. Instead, our workflow relies on statistically robust, data-driven classification methods that balance speed, generalizability, and practical feasibility.

      Query: (3) Neither the Results nor Methods sections provide information on how the GNNs were trained in this study. Details such as node features, edge attributes, standardization, pooling, activation functions, layers, dropout, etc., should all be described in detail. The training protocol should also be described, including loss functions, independent monitoring and early stopping criteria, learning rate adjustments, etc.

      We again thank Reviewer #1 for this suggestion. We would like to mention that in the revised manuscript, we have added all the requested details. Please refer to the points below for more information.

      (1) The Generator module of Gcoupler is designed as a flexible and automated framework that leverages multiple Graph Neural Network architectures, including Graph Convolutional Model (GCM), Graph Convolutional Network (GCN), Attentive FP, and Graph Attention Network (GAT), to build classification models based on the synthetic ligand datasets produced earlier in the pipeline.

      (2) By default, Generator tests all four models using standard hyperparameters provided by the DeepChem framework (https://deepchem.io/), offering a baseline performance comparison across architectures. This includes pre-defined choices for node features, edge attributes, message-passing layers, pooling strategies, activation functions, and dropout values, ensuring reproducibility and consistency. All models are trained with binary cross-entropy loss and support default settings for early stopping, learning rate, and batch standardization where applicable.

      (3) In addition, Generator supports model refinement through hyperparameter tuning and k-fold cross-validation (default: 3 folds). Users can either customize the hyperparameter grid or rely on Generator’s recommended parameter ranges to optimize model performance. This allows for robust model selection and stability assessment of tuned parameters.

      (4) Finally, the trained models can be used to predict binding probabilities for user-supplied compounds, making it a comprehensive and user-adaptive tool for ligand screening.

      Based on the reviewer #1 suggestion, we have now added a detailed description about the Generator module of Gcoupler, and also provided relevant citations regarding the DeepChem workflow.

      Query: (4) GNN model training seems to occur on at most 500 molecules per training run? This is unclear from the manuscript. That is a very small number of training samples if true. Please clarify. How was upsampling performed? What were the HAB/LAB class distributions? In addition, it seems as though only synthetically generated molecules are used for training, and the task is to discriminate synthetic molecules based on their docking scores. Synthetic ligands generated by LigBuilder may occupy distinct chemical space, making classification trivial, particularly in the setting of a random split k-folds validation approach. In the absence of a leave-class-out validation, it is unclear if the model learns generalizable features or exploits clear chemical differences. Historically, it was inappropriate to evaluate ligand-based QSAR models on synthetic decoys such as the DUD-E sets - synthetic ligands can be much more easily distinguished by heavily parameterized ligand-based machine learning models than by physically constrained single-point docking score functions.

      We thank reviewer #1 for these detailed technical queries. We would like to clarify that:

      (1) The recommended minimum for the training set is 500 molecules, but users can add as many synthesized compounds as needed to thoroughly explore the chemical space related to the target cavity.

      (2) Our systematic evaluation demonstrated that expanding the training set size consistently enhanced model performance, especially when compared to AutoDock docking scores. This observation underscores the framework's scalability and its ability to improve predictive accuracy with more training compounds.

      (3) The Authenticator module initially categorizes all synthesized molecules into HAB and LAB classes. These labeled molecules are then utilized for training the Generator module. To tackle class imbalance, the class with fewer data points undergoes upsampling. This process aims to achieve an approximate 1:1 ratio between the two classes, thereby ensuring balanced learning during GNN model training.

      (4) The Authenticator module's affinity scores are the primary determinant of the HAB/LAB class distribution, with a higher cutoff for HABs ensuring statistically significant class separation. This distribution is also indirectly shaped by the target cavity's topology and druggability, as the Synthesizer tends to produce more potent candidates for cavities with favorable binding characteristics.

      (5) While it's true that synthetic ligands may occupy distinct chemical space, our benchmarking exploration for different sites on the same receptor still showed inter-cavity specificity along with intra-cavity diversity of the synthesized molecules.

      (6) The utility of random k-fold validation shouldn't be dismissed outright; it provides a reasonable estimate of performance under practical settings where class boundaries are often unknown. Nonetheless, we agree that complementary validation strategies like leave-class-out could further strengthen the robustness assessment.

      (7) We agree that using synthetic decoys like those from the DUD-E dataset can introduce bias in ligand-based QSAR model evaluations if not handled carefully. In our workflow, the inclusion of DUD-E compounds is entirely optional and only considered as a fallback, specifically in scenarios where the number of low-affinity binders (LABs) synthesized by the Synthesizer module is insufficient to proceed with model training.

      (8) The primary approach relies on classifying generated compounds based on their derived affinity scores via the Authenticator module. However, in rare cases where this results in a heavily imbalanced dataset, DUD-E compounds are introduced not as part of the core benchmarking, but solely to maintain minimal class balance for initial model training. Even then, care is taken to interpret results with this limitation in mind. Ultimately, our framework is designed to prioritize data-driven generation of both HABs and LABs, minimizing reliance on synthetic decoys wherever possible.

      Author response image 2.

      Scatter plots depicting the segregation of High/Low-Affinity Metabolites (HAM/LAM) (indicated in green and red) identified using Gcoupler workflow with 100% training data. Notably, models trained on lesser training data size (25%, 50%, and 75% of HAB/LAB) severely failed to segregate HAM and LAM (along Y-axis). X-axis represents the binding affinity calculated using IC4-specific docking using AutoDock.

      Based on the reviewer #1’s suggestion, we have now added all these technical details in the revised version of the manuscript.

      Query: (5) Training QSAR models on docking scores to accelerate virtual screening is not in itself novel (see here for a nice recent example: https://www.nature.com/articles/s43588-025-00777-x), but can be highly useful to focus structure-based analysis on the most promising areas of ligand chemical space; however, we are perplexed by the motivation here. If only a few hundred or a few thousand molecules are being sampled, why not just use AutoDock Vina? The models are trained to try to discriminate molecules by AutoDock Vina score rather than experimental affinity, so it seems like we would ideally just run Vina? Perhaps we are misunderstanding the scale of the screening that was done here. Please clarify the manuscript methods to help justify the approach.

      We acknowledge the effectiveness of training QSAR models on docking scores for prioritizing chemical space, as demonstrated by the referenced study (https://www.nature.com/articles/s43588-025-00777-x) on machine-learning-guided docking screen frameworks.

      We would like to mention that:

      (1) While such protocols often rely on extensive pre-docked datasets across numerous protein targets or utilize a highly skewed input distribution, training on as little as 1-10% of ligand-protein complexes and testing on the remainder in iterative cycles.

      (2) While powerful for ultra-large libraries, this approach can introduce bias towards the limited training set and incur significant overhead in data curation, pre-computation, and infrastructure.

      (3) In contrast, Gcoupler prioritizes flexibility and accessibility, especially when experimental data is scarce and large pre-docked libraries are unavailable. Instead of depending on fixed docking scores from external pipelines, Gcoupler integrates target-specific cavity detection, de novo compound generation, and model training into a self-contained, end-to-end framework. Its QSAR models are trained directly on contextually relevant compounds synthesized for a given binding site, employing a statistical classification strategy that avoids arbitrary thresholds or precomputed biases.

      (4) Furthermore, Gcoupler is open-source, lightweight, and user-friendly, making it easily deployable without the need for extensive infrastructure or prior docking expertise. While not a complete replacement for full-scale docking in all use cases, Gcoupler aims to provide a streamlined and interpretable screening framework that supports both focused chemical design and broader chemical space exploration, without the computational burden associated with deep learning docking workflows.

      (5) Practically, even with computational resources, manually running AutoDock Vina on millions of compounds presents challenges such as format conversion, binding site annotation, grid parameter tuning, and execution logistics, all typically requiring advanced structural bioinformatics expertise.

      (6) Gcoupler's Authenticator module, however, streamlines this process. Users only need to input a list of SMILES and a receptor PDB structure, and the module automatically handles compound preparation, cavity mapping, parameter optimization, and high-throughput scoring. This automation reduces time and effort while democratizing access to structure-based screening workflows for users without specialized expertise.

      Ultimately, Gcoupler's motivation is to make large-scale, structure-informed virtual screening both efficient and accessible. The model serves as a surrogate to filter and prioritize compounds before deeper docking or experimental validation, thereby accelerating targeted drug discovery.

      Query: (6) The brevity of the MD simulations raises some concerns that the results may be over-interpreted. RMSD plots do not reliably compare the affinity behavior in this context because of the short timescales coupled with the dramatic topological differences between the ligands being compared; CoQ6 is long and highly flexible compared to ZST and LST. Convergence metrics, such as block averaging and time-dependent MM/GBSA energies, should be included over much longer timescales. For CoQ6, the authors may need to run multiple simulations of several microseconds, identify the longest-lived metastable states of CoQ6, and perform MM/GBSA energies for each state weighted by each state's probability.

      We appreciate Reviewer #1's suggestion regarding simulation length, as it is indeed crucial for interpreting molecular dynamics (MD) outcomes. We would like to mention that:

      (1) Our simulation strategy varied based on the analysis objective, ranging from short (~5 ns) runs for preliminary or receptor-only evaluations to intermediate (~100 ns) and extended (~550 ns) runs for receptor-ligand complex validation and stability assessment.

      (2) Specifically, we conducted three independent 100 ns MD simulations for each receptor-metabolite complex in distinct cavities of interest. This allowed us to assess the reproducibility and persistence of binding interactions. To further support these observations, a longer 550 ns simulation was performed for the IC4 cavity, which reinforced the 100 ns findings by demonstrating sustained interaction stability over extended timescales.

      (3) While we acknowledge that even longer simulations (e.g., in the microsecond range) could provide deeper insights into metastable state transitions, especially for highly flexible molecules like CoQ6, our current design balances computational feasibility with the goal of screening multiple cavities and ligands.

      (4) In our current workflow, MM/GBSA binding free energies were calculated by extracting 1000 representative snapshots from the final 10 ns of each MD trajectory. These configurations were used to compute time-averaged binding energies, incorporating contributions from van der Waals, electrostatic, polar, and non-polar solvation terms. This approach offers a more reliable estimate of ligand binding affinity compared to single-point molecular docking, as it accounts for conformational flexibility and dynamic interactions within the binding cavity.

      (5) Although we did not explicitly perform state-specific MM/GBSA calculations weighted by metastable state probabilities, our use of ensemble-averaged energy estimates from a thermally equilibrated segment of the trajectory captures many of the same benefits. We acknowledge, however, that a more rigorous decomposition based on metastable state analysis could offer finer resolution of binding behavior, particularly for highly flexible ligands like CoQ6, and we consider this a valuable direction for future refinement of the framework.

      Reviewer #2 (Public review):

      Summary:

      Query: Mohanty et al. present a new deep learning method to identify intracellular allosteric modulators of GPCRs. This is an interesting field for e.g. the design of novel small molecule inhibitors of GPCR signalling. A key limitation, as mentioned by the authors, is the limited availability of data. The method presented, Gcoupler, aims to overcome these limitations, as shown by experimental validation of sterols in the inhibition of Ste2p, which has been shown to be relevant molecules in human and rat cardiac hypertrophy models. They have made their code available for download and installation, which can easily be followed to set up software on a local machine.

      Strengths:

      Clear GitHub repository

      Extensive data on yeast systems

      We sincerely thank Reviewer #2 for their thorough review, summary, and appreciation of our work. We highly value their comments and suggestions.

      Weaknesses:

      Query: No assay to directly determine the affinity of the compounds to the protein of interest.

      We thank Reviewer #2 for raising these insightful questions. During the experimental design phase, we carefully accounted for validating the impact of metabolites in the rescue response by pheromone.

      We would like to mention that we performed an array of methods to validate our hypothesis and observed similar rescue effects. These assays include:

      a. Cell viability assay (FDA/PI Flourometry-based)

      b. Cell growth assay

      c. FUN1<sup>TM</sup>-based microscopy assessment

      d. Shmoo formation assays

      e. Mating assays

      f. Site-directed mutagenesis-based loss of function

      g. ransgenic reporter-based assay

      h. MAPK signaling assessment using Western blot.

      i. And via computational techniques.

      Concerning the in vitro interaction studies of Ste2p and metabolites, we made significant efforts to purify Ste2p by incorporating a His tag at the N-terminal. Despite dedicated attempts over the past year, we were unsuccessful in purifying the protein, primarily due to our limited expertise in protein purification for this specific system. As a result, we opted for genetic-based interventions (e.g., point mutants), which provide a more physiological and comprehensive approach to demonstrating the interaction between Ste2p and the metabolites.

      Author response image 3.

      (a) Affinity purification of Ste2p from Saccharomyces cerevisiae. Western blot analysis using anti-His antibody showing the distribution of Ste2p in various fractions during the affinity purification process. The fractions include pellet, supernatant, wash buffer, and sequential elution fractions (1–4). Wild-type and ste2Δ strains served as positive and negative controls, respectively. (b) Optimization of Ste2p extraction protocol. Ponceau staining (left) and Western blot analysis using anti-His antibody (right) showing Ste2p extraction efficiency. The conditions tested include lysis buffers containing different concentrations of CHAPS detergent (0.5%, 1%) and glycerol (10%, 20%).

      Furthermore, in addition to the clarification above, we have added the following statement in the discussion section to tone down our claims: “A critical limitation of our study is the absence of direct binding assays to validate the interaction between the metabolites and Ste2p. While our results from genetic interventions, molecular dynamics simulations, and docking studies strongly suggest that the metabolites interact with the Ste2p-Gpa1 interface, these findings remain indirect. Direct binding confirmation through techniques such as surface plasmon resonance, isothermal titration calorimetry, or co-crystallization would provide definitive evidence of this interaction. Addressing this limitation in future work would significantly strengthen our conclusions and provide deeper insights into the precise molecular mechanisms underlying the observed phenotypic effects.”

      We request Reviewer #2 to kindly refer to the assays conducted on the point mutants created in this study, as these experiments offer robust evidence supporting our claims.

      Query: In conclusion, the authors present an interesting new method to identify allosteric inhibitors of GPCRs, which can easily be employed by research labs. Whilst their efforts to characterize the compounds in yeast cells, in order to confirm their findings, it would be beneficial if the authors show their compounds are active in a simple binding assay.

      We express our gratitude and sincere appreciation for the time and effort dedicated by Reviewer #2 in reviewing our manuscript. We are confident that our clarifications address the reviewer's concerns.

      Reviewer #3 (Public review):

      Summary:

      Query: In this paper, the authors introduce the Gcoupler software, an open-source deep learning-based platform for structure-guided discovery of ligands targeting GPCR interfaces. Overall, this manuscript represents a field-advancing contribution at the intersection of AI-based ligand discovery and GPCR signaling regulation.

      Strengths:

      The paper presents a comprehensive and well-structured workflow combining cavity identification, de novo ligand generation, statistical validation, and graph neural network-based classification. Notably, the authors use Gcoupler to identify endogenous intracellular sterols as allosteric modulators of the GPCR-Gα interface in yeast, with experimental validations extending to mammalian systems. The ability to systematically explore intracellular metabolite modulation of GPCR signaling represents a novel and impactful contribution. This study significantly advances the field of GPCR biology and computational ligand discovery.

      We thank and appreciate Reviewer #3 for vesting time and efforts in reviewing our manuscript and for appreciating our efforts.

      Recommendations for the authors:

      Reviewing Editor Comments:

      We encourage the authors to address the points raised during revision to elevate the assessment from "incomplete" to "solid" or ideally "convincing." In particular, we ask the authors to improve the justification for their methodological choices and to provide greater detail and clarity regarding each computational layer of the pipeline.

      We are grateful for the editors' suggestions. We have incorporated significant revisions into the manuscript, providing comprehensive technical details to prevent any misunderstandings. Furthermore, we meticulously explained every aspect of the computational workflow.

      Reviewer #2 (Recommendations for the authors):

      Query: Would it be possible to make the package itself pip installable?

      Yes, it already exists under the testpip repository and we have now migrated it to the main pip. Please access the link from here: https://pypi.org/project/gcoupler/

      Query: I am confused by the binding free energies reported in Supplementary Figure 8. Is the total DG reported that of the protein-ligand complex? If that is the case, the affinities of the ligands would be extremely high. They are also very far off from the reported -7 kcal/mol active/inactive cut-off.

      We thank Reviewer #2 for this query. We would like to mention that we have provided a detailed explanation in the point-by-point response to Reviewer #2's original comment. Briefly, to clarify, the -7 kcal/mol active/inactive cutoff mentioned in the manuscript refers specifically to the docking-based binding free energies (ΔG) calculated using AutoDock or AutoDock Vina, which are used for compound classification or validation against the Gcoupler framework.

      In contrast, the binding free energies reported in Supplementary Figure 8 are obtained through the MM-GBSA method, which provides a more detailed and physics-based estimate of binding affinity by incorporating solvation and enthalpic contributions. It is well-documented in the literature that MM-GBSA tends to systematically underestimate absolute binding free energies when compared to experimental values (10.2174/1568026616666161117112604; Table 1).

      Author response image 4.

      Scatter plot comparing the predicted binding affinity calculated by Docking and MM/GBSA methods, against experimental ΔG (10.1007/s10822-023-00499-0)

      Our use of MM-GBSA is not to match experimental ΔG directly, but rather to assess relative binding preferences among ligands. Despite its limitations in predicting absolute affinities, MM-GBSA is known to perform better than docking for ranking compounds by their binding potential. In this context, an MM-GBSA energy value still reliably indicates stronger predicted binding, even if the numerical values appear extremely higher than typical experimental or docking-derived cutoffs.

      Thus, the two energy values, docking-based and MM-GBSA, serve different purposes in our workflow. Docking scores are used for classification and thresholding, while MM-GBSA energies provide post hoc validation and a higher-resolution comparison of binding strength across compounds.

      To corroborate their findings, can the authors include direct binding affinity assays for yeast and human Ste2p? This will help in establishing whether the observed phenotypic effects are indeed driven by binding of the metabolites.

      We thank Reviewer #2 for raising these insightful questions. During the experimental design phase, we carefully accounted for validating the impact of metabolites in the rescue response by pheromone.

      We would like to mention that we performed an array of methods to validate our hypothesis and observed similar rescue effects. These assays include:

      a. Cell viability assay (FDA/PI Flourometry- based)

      b. Cell growth assay

      c. FUN1<sup>TM</sup>-based microscopy assessment

      d. Shmoo formation assays

      e. Mating assays

      f. Site-directed mutagenesis-based loss of function

      g. Transgenic reporter-based assay

      h. MAPK signaling assessment using Western blot.

      i. And via computational techniques.

      Concerning the in vitro interaction studies of Ste2p and metabolites, we made significant efforts to purify Ste2p by incorporating a His tag at the N-terminal. Despite dedicated attempts over the past year, we were unsuccessful in purifying the protein, primarily due to our limited expertise in protein purification for this specific system. As a result, we opted for genetic-based interventions (e.g., point mutants), which provide a more physiological and comprehensive approach to demonstrating the interaction between Ste2p and the metabolites.

      Furthermore, in addition to the clarification above, we have added the following statement in the discussion section to tone down our claims: “A critical limitation of our study is the absence of direct binding assays to validate the interaction between the metabolites and Ste2p. While our results from genetic interventions, molecular dynamics simulations, and docking studies strongly suggest that the metabolites interact with the Ste2p-Gpa1 interface, these findings remain indirect. Direct binding confirmation through techniques such as surface plasmon resonance, isothermal titration calorimetry, or co-crystallization would provide definitive evidence of this interaction. Addressing this limitation in future work would significantly strengthen our conclusions and provide deeper insights into the precise molecular mechanisms underlying the observed phenotypic effects.”

      We request Reviewer #2 to kindly refer to the assays conducted on the point mutants created in this study, as these experiments offer robust evidence supporting our claims.

      Did the authors perform expression assays to make sure the mutant proteins were similarly expressed to wt?

      We thank reviewer #2 for this comment. We would like to mention that:

      (1) In our mutants (S75A, T155D, L289K)-based assays, all mutants were generated using integration at the same chromosomal TRP1 locus under the GAL1 promoter and share the same C-terminal CYC1 terminator sequence used for the reconstituted wild-type (rtWT) construct, thus reducing the likelihood of strain-specific expression differences.

      (2) Furthermore, all strains were grown under identical conditions using the same media, temperature, and shaking parameters. Each construct underwent the same GAL1 induction protocol in YPGR medium for identical durations, ensuring uniform transcriptional activation across all strains and minimizing culture-dependent variability in protein expression.

      (3) Importantly, both the rtWT and two of the mutants (T155D, L289K) retained α-factor-induced cell death (PI and FUN1-based fluorometry and microscopy; Figure 4c-d) and MAPK activation (western blot; Figure 4e), demonstrating that the mutant proteins are expressed at levels sufficient to support signalling.

      Reviewer #3 (Recommendations for the authors):

      My comments that would enhance the impact of this method are:

      (1) While the authors have compared the accuracy and efficiency of Gcoupler to AutoDock Vina, one of the main points of Gcoupler is the neural network module. It would be beneficial to have it evaluated against other available deep learning ligand generative modules, such as the following: 10.1186/s13321-024-00829-w, 10.1039/D1SC04444C.

      Thank you for the observation. To clarify, our benchmarking of Gcoupler’s accuracy and efficiency was performed against AutoDock, not AutoDock Vina. This choice was intentional, as AutoDock is one of the most widely used classical techniques in computer-aided drug design (CADD) for obtaining high-resolution predictions of ligand binding energy, binding poses, and detailed atomic-level interactions with receptor residues. In contrast, AutoDock Vina is primarily optimized for large-scale virtual screening, offering faster results but typically with lower resolution and limited configurational detail.

      Since Gcoupler is designed to balance accuracy with computational efficiency in structure-based screening, AutoDock served as a more appropriate reference point for evaluating its predictions.

      We agree that benchmarking against other deep learning-based ligand generative tools is important for contextualizing Gcoupler’s capabilities. However, it's worth noting that only a few existing methods focus specifically on cavity- or pocket-driven de novo drug design using generative AI, and among them, most are either partially closed-source or limited in functionality.

      While PocketCrafter (10.1186/s13321-024-00829-w) offers a structure-based generative framework, it differs from Gcoupler in several key respects. PocketCrafter requires proprietary preprocessing tools, such as the MOE QuickPrep module, to prepare protein pocket structures, limiting its accessibility and reproducibility. In addition, PocketCrafter’s pipeline stops at the generation of cavity-linked compounds and does not support any further learning from the generated data.

      Similarly, DeepLigBuilder (10.1039/D1SC04444C) provides de novo ligand generation using deep learning, but the source code is not publicly available, preventing direct benchmarking or customization. Like PocketCrafter, it also lacks integrated learning modules, which limits its utility for screening large, user-defined libraries or compounds of interest.

      Additionally, tools like AutoDesigner from Schrödinger, while powerful, are not publicly accessible and hence fall outside the scope of open benchmarking.

      Author response table 1.

      Comparison of de novo drug design tools. SBDD refers to Structure-Based Drug Design, and LBDD refers to Ligand-Based Drug Design.

      In contrast, Gcoupler is a fully open-source, end-to-end platform that integrates both Ligand-Based and Structure-Based Drug Design. It spans from cavity detection and molecule generation to automated model training using GNNs, allowing users to evaluate and prioritize candidate ligands across large chemical spaces without the need for commercial software or advanced coding expertise.

      (2) In Figure 2, the authors mention that IC4 and IC5 potential binding sites are on the direct G protein coupling interface ("This led to the identification of 17 potential surface cavities on Ste2p, with two intracellular regions, IC4 and IC5, accounting for over 95% of the Ste2p-Gpa1p interface (Figure 2a-b, Supplementary Figure 4j-n)..."). Later, however, in Figure 4, when discussing which residues affect the binding of the metabolites the most, the authors didn't perform MD simulations of mutant STE2 and just Gpa1p (without metabolites present). It would be beneficial to compare the binding of G protein with and without metabolites present, as these interface mutations might be affecting the binding of G protein by itself.

      Thank you for this insightful suggestion. While we did not perform in silico MD simulations of the mutant Ste2-Gpa1 complex in the absence of metabolites, we conducted experimental validation to functionally assess the impact of interface mutations. Specifically, we generated site-directed mutants (S75A, L289K, T155D) and expressed them in a ste2Δ background to isolate their effects.

      As shown in the Supplementary Figure, these mutants failed to rescue cells from α-factor-induced programmed cell death (PCD) upon metabolite pre-treatment. This was confirmed through fluorometry-based viability assays, FUN1<sup>TM</sup> staining, and p-Fus3 signaling analysis, which collectively monitor MAPK pathway activation (Figure 4c–e).

      Importantly, the induction of PCD in response to α-factor in these mutants demonstrates that G protein coupling is still functionally intact, indicating that the mutations do not interfere with Gpa1 binding itself. However, the absence of rescue by metabolites strongly suggests that the mutated residues play a direct role in metabolite binding at the Ste2p–Gpa1p interface, thus modulating downstream signaling.

      While further MD simulations could provide structural insight into the isolated mutant receptor–G protein interaction, our experimental data supports the functional relevance of metabolite binding at the identified interface.

      (3) While the experiments, performed by the authors, do support the hypothesis that metabolites regulate GPCR signaling, there are no experiments evaluating direct biophysical measurements (e.g., dissociation constants are measured only in silicon).

      We thank Reviewer #3 for raising these insightful comments. We would like to mention that we performed an array of methods to validate our hypothesis and observed similar rescue effects. These assays include:

      a. Cell viability assay (FDA/PI Flourometry- based)

      b. Cell growth assay

      c. FUN1<sup>TM</sup>-based microscopy assessment

      d. Shmoo formation assays

      e. Mating assays

      f. Site-directed mutagenesis-based loss of function

      g. Transgenic reporter-based assay

      h. MAPK signaling assessment using Western blot.

      i. And via computational techniques.

      Concerning the direct biophysical measurements of Ste2p and metabolites, we made significant efforts to purify Ste2p by incorporating a His tag at the N-terminal, with the goal of performing Microscale Thermophoresis (MST) and Isothermal Titration Calorimetry (ITC) measurements. Despite dedicated attempts over the past year, we were unsuccessful in purifying the protein, primarily due to our limited expertise in protein purification for this specific system. As a result, we opted for genetic-based interventions (e.g., point mutants), which provide a more physiological and comprehensive approach to demonstrating the interaction between Ste2p and the metabolites.

      Furthermore, in addition to the clarification above, we have added the following statement in the discussion section to tone down our claims: “A critical limitation of our study is the absence of direct binding assays to validate the interaction between the metabolites and Ste2p. While our results from genetic interventions, molecular dynamics simulations, and docking studies strongly suggest that the metabolites interact with the Ste2p-Gpa1 interface, these findings remain indirect. Direct binding confirmation through techniques such as surface plasmon resonance, isothermal titration calorimetry, or co-crystallization would provide definitive evidence of this interaction. Addressing this limitation in future work would significantly strengthen our conclusions and provide deeper insights into the precise molecular mechanisms underlying the observed phenotypic effects.”

      (4) The authors do not discuss the effects of the metabolites at their physiological concentrations. Overall, this manuscript represents a field-advancing contribution at the intersection of AI-based ligand discovery and GPCR signaling regulation.

      We thank reviewer #3 for this comment and for recognising the value of our work. Although direct quantification of intracellular free metabolite levels is challenging, several lines of evidence support the physiological relevance of our test concentrations.

      - Genetic validation supports endogenous relevance: Our genetic screen of 53 metabolic knockout mutants showed that deletions in biosynthetic pathways for these metabolites consistently disrupted the α-factor-induced cell death, with the vast majority of strains (94.4%) resisting the α-factor-induced cell death, and notably, a subset even displayed accelerated growth in the presence of α‑factor. This suggests that endogenous levels of these metabolites normally provide some degree of protection, supporting their physiological role in GPCR regulation.

      - Metabolomics confirms in vivo accumulation: Our untargeted metabolomics analysis revealed that α-factor-treated survivors consistently showed enrichment of CoQ6 and zymosterol compared to sensitive cells. This demonstrates that these metabolites naturally accumulate to protective levels during stress responses, validating their biological relevance.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      This study investigates the sex determination mechanism in the clonal ant Ooceraea biroi, focusing on a candidate complementary sex determination (CSD) locus-one of the key mechanisms supporting haplodiploid sex determination in hymenopteran insects. Using whole genome sequencing, the authors analyze diploid females and the rarely occurring diploid males of O. biroi, identifying a 46 kb candidate region that is consistently heterozygous in females and predominantly homozygous in diploid males. This region shows elevated genetic diversity, as expected under balancing selection. The study also reports the presence of an lncRNA near this heterozygous region, which, though only distantly related in sequence, resembles the ANTSR lncRNA involved in female development in the Argentine ant, Linepithema humile (Pan et al. 2024). Together, these findings suggest a potentially conserved sex determination mechanism across ant species. However, while the analyses are well conducted and the paper is clearly written, the insights are largely incremental. The central conclusion - that the sex determination locus is conserved in ants - was already proposed and experimentally supported by Pan et al. (2024), who included O. biroi among the studied species and validated the locus's functional role in the Argentine ant. The present study thus largely reiterates existing findings without providing novel conceptual or experimental advances.

      Although it is true that Pan et al., 2024 demonstrated (in Figure 4 of their paper) that the synteny of the region flanking ANTSR is conserved across aculeate Hymenoptera (including O. biroi), Reviewer 1’s claim that that paper provides experimental support for the hypothesis that the sex determination locus is conserved in ants is inaccurate. Pan et al., 2024 only performed experimental work in a single ant species (Linepithema humile) and merely compared reference genomes of multiple species to show synteny of the region, rather than functionally mapping or characterizing these regions.

      Other comments:

      The mapping is based on a very small sample size: 19 females and 16 diploid males, and these all derive from a single clonal line. This implies a rather high probability for false-positive inference. In combination with the fact that only 11 out of the 16 genotyped males are actually homozygous at the candidate locus, I think a more careful interpretation regarding the role of the mapped region in sex determination would be appropriate. The main argument supporting the role of the candidate region in sex determination is based on the putative homology with the lncRNA involved in sex determination in the Argentine ant, but this argument was made in a previous study (as mentioned above).

      Our main argument supporting the role of the candidate region in sex determination is not based on putative homology with the lncRNA in L. humile. Instead, our main argument comes from our genetic mapping (in Fig. 2), and the elevated nucleotide diversity within the identified region (Fig. 4). Additionally, we highlight that multiple genes within our mapped region are homologous to those in mapped sex determining regions in both L. humile and Vollenhovia emeryi, possibly including the lncRNA.

      In response to the Reviewer’s assertion that the mapping is based on a small sample size from a single clonal line, we want to highlight that we used all diploid males available to us. Although the primary shortcoming of a small sample size is to increase the probability of a false negative, small sample sizes can also produce false positives. We used two approaches to explore the statistical robustness of our conclusions. First, we generated a null distribution by randomly shuffling sex labels within colonies and calculating the probability of observing our CSD index values by chance (shown in Fig. 2). Second, we directly tested the association between homozygosity and sex using Fisher’s Exact Test (shown in Supplementary Fig. S2). In both cases, the association of the candidate locus with sex was statistically significant after multiple-testing correction using the Benjamini-Hochberg False Discovery Rate. These approaches are clearly described in the “CSD Index Mapping” section of the Methods.

      We also note that, because complementary sex determination loci are expected to evolve under balancing selection, our finding that the mapped region exhibits a peak of nucleotide diversity lends orthogonal support to the notion that the mapped locus is indeed a complementary sex determination locus.

      The fourth paragraph of the results and the sixth paragraph of the discussion are devoted to explaining the possible reasons why only 11/16 genotyped males are homozygous in the mapped region. The revised manuscript will include an additional sentence (in what will be lines 384-388) in this paragraph that includes the possible explanation that this locus is, in fact, a false positive, while also emphasizing that we find this possibility to be unlikely given our multiple lines of evidence.

      In response to Reviewer 1’s suggestion that we carefully interpret the role of the mapped region in sex determination, we highlight our careful wording choices, nearly always referring to the mapped locus as a “candidate sex determination locus” in the title and throughout the manuscript. For consistency, the revised manuscript version will change the second results subheading from “The O. biroi CSD locus is homologous to another ant sex determination locus but not to honeybee csd” to “O. biroi’s candidate CSD locus is homologous to another ant sex determination locus but not to honeybee csd,” and will add the word “candidate” in what will be line 320 at the beginning of the Discussion, and will change “putative” to “candidate” in what will be line 426 at the end of the Discussion.

      In the abstract, it is stated that CSD loci have been mapped in honeybees and two ant species, but we know little about their evolutionary history. But CSD candidate loci were also mapped in a wasp with multi-locus CSD (study cited in the introduction). This wasp is also parthenogenetic via central fusion automixis and produces diploid males. This is a very similar situation to the present study and should be referenced and discussed accordingly, particularly since the authors make the interesting suggestion that their ant also has multi-locus CSD and neither the wasp nor the ant has tra homologs in the CSD candidate regions. Also, is there any homology to the CSD candidate regions in the wasp species and the studied ant?

      In response to Reviewer 1’s suggestion that we reference the (Matthey-Doret et al. 2019) study in the context of diploid males being produced via losses of heterozygosity during asexual reproduction, the revised manuscript will include (in what will be lines 123-126) the highlighted portion of the following sentence: “Therefore, if O. biroi uses CSD, diploid males might result from losses of heterozygosity at sex determination loci (Fig. 1C), similar to what is thought to occur in other asexual Hymenoptera that produce diploid males (Rabeling and Kronauer 2012; Matthey-Doret et al. 2019).”

      We note, however, that in their 2019 study, Matthey-Doret et al. did not directly test the hypothesis that diploid males result from losses of heterozygosity at CSD loci during asexual reproduction, because the diploid males they used for their mapping study came from inbred crosses in a sexual population of that species.

      We address this further below, but we want to emphasize that we do not intend to argue that O. biroi has multiple CSD loci. Instead, we suggest that additional, undetected CSD loci is one possible explanation for the absence of diploid males from any clonal line other than clonal line A. In response to Reviewer 1’s suggestion that we reference the (Matthey-Doret et al. 2019) study in the context of multilocus CSD, the revised manuscript version will include the following additional sentence in the fifth paragraph of the discussion (in what will be lines 372-374): “Multi-locus CSD has been suggested to limit the extent of diploid male production in asexual species under some circumstances (Vorburger 2013; Matthey-Doret et al. 2019).”

      Regarding Reviewer 2’s question about homology between the putative CSD loci from the (Matthey-Doret et al. 2019) study and O. biroi, we note that there is no homology. The revised manuscript version will have an additional Supplementary Table (which will be the new Supplementary Table S3) that will report the results of this homology search. The revised manuscript will also include the following additional sentence in the Results, in what will be lines 172-174: “We found no homology between the genes within the O. biroi CSD index peak and any of the genes within the putative L. fabarum CSD loci (Supplementary Table S3).”

      The authors used different clonal lines of O. biroi to investigate whether heterozygosity at the mapped CSD locus is required for female development in all clonal lines of O. biroi (L187-196). However, given the described parthenogenesis mechanism in this species conserves heterozygosity, additional females that are heterozygous are not very informative here. Indeed, one would need diploid males in these other clonal lines as well (but such males have not yet been found) to make any inference regarding this locus in other lines.

      We agree that a full mapping study including diploid males from all clonal lines would be preferable, but as stated earlier in that same paragraph, we have only found diploid males from clonal line A. We stand behind our modest claim that “Females from all six clonal lines were heterozygous at the CSD index peak, consistent with its putative role as a CSD locus in all O. biroi.” In the revised manuscript version, this sentence (in what will be lines 199-201) will be changed slightly in response to a reviewer comment below: “All females from all six clonal lines (including 26 diploid females from clonal line B) were heterozygous at the CSD index peak, consistent with its putative role as a CSD locus in all O. biroi.”

      Reviewer #2 (Public review):

      The manuscript by Lacy et al. is well written, with a clear and compelling introduction that effectively conveys the significance of the study. The methods are appropriate and well-executed, and the results, both in the main text and supplementary materials, are presented in a clear and detailed manner. The authors interpret their findings with appropriate caution.

      This work makes a valuable contribution to our understanding of the evolution of complementary sex determination (CSD) in ants. In particular, it provides important evidence for the ancient origin of a non-coding locus implicated in sex determination, and shows that, remarkably, this sex locus is conserved even in an ant species with a non-canonical reproductive system that typically does not produce males. I found this to be an excellent and well-rounded study, carefully analyzed and well contextualized.

      That said, I do have a few minor comments, primarily concerning the discussion of the potential 'ghost' CSD locus. While the authors acknowledge (line 367) that they currently have no data to distinguish among the alternative hypotheses, I found the evidence for an additional CSD locus presented in the results (lines 261-302) somewhat limited and at times a bit difficult to follow. I wonder whether further clarification or supporting evidence could already be extracted from the existing data. Specifically:

      We agree with Reviewer 2 that the evidence for a second CSD locus is limited. In fact, we do not intend to advocate for there being a second locus, but we suggest that a second CSD locus is one possible explanation for the absence of diploid males outside of clonal line A. In our initial version, we intentionally conveyed this ambiguity by titling this section “O. biroi may have one or multiple sex determination loci.” However, we now see that this leads to undue emphasis on the possibility of a second locus. In the revised manuscript, we will split this into two separate sections: “Diploid male production differs across O. biroi clonal lines” and “O. biroi lacks a tra-containing CSD locus.”

      (1) Line 268: I doubt the relevance of comparing the proportion of diploid males among all males between lines A and B to infer the presence of additional CSD loci. Since the mechanisms producing these two types of males differ, it might be more appropriate to compare the proportion of diploid males among all diploid offspring. This ratio has been used in previous studies on CSD in Hymenoptera to estimate the number of sex loci (see, for example, Cook 1993, de Boer et al. 2008, 2012, Ma et al. 2013, and Chen et al., 2021). The exact method might not be applicable to clonal raider ants, but I think comparing the percentage of diploid males among the total number of (diploid) offspring produced between the two lineages might be a better argument for a difference in CSD loci number.

      We want to re-emphasize here that we do not wish to advocate for there being two CSD loci in O. biroi. Rather, we want to explain that this is one possible explanation for the apparent absence of diploid males outside of clonal line A. We hope that the modifications to the manuscript described in the previous response help to clarify this.

      Reviewer 2 is correct that comparing the number of diploid males to diploid females does not apply to clonal raider ants. This is because males are vanishingly rare among the vast numbers of females produced. We do not count how many females are produced in laboratory stock colonies, and males are sampled opportunistically. Therefore, we cannot report exact numbers. However, we will add the highlighted portion of the following sentence (in what will be lines 268-270) to the revised manuscript: “Despite the fact that we maintain more colonies of clonal line B than of clonal line A in the lab, all the diploid males we detected came from clonal line A.”

      (2) If line B indeed carries an additional CSD locus, one would expect that some females could be homozygous at the ANTSR locus but still viable, being heterozygous only at the other locus. Do the authors detect any females in line B that are homozygous at the ANTSR locus? If so, this would support the existence of an additional, functionally independent CSD locus.

      We thank the reviewer for this suggestion, and again we emphasize that we do not want to argue in favor of multiple CSD loci. We just want to introduce it as one possible explanation for the absence of diploid males outside of clonal line A.

      The 26 sequenced diploid females from clonal line B are all heterozygous at the mapped locus, and the revised manuscript will clarify this in what will be lines 199-201. Previously, only six of those diploid females were included in Supplementary Table S2, and that will be modified accordingly.

      (3) Line 281: The description of the two tra-containing CSD loci as "conserved" between Vollenhovia and the honey bee may be misleading. It suggests shared ancestry, whereas the honey bee csd gene is known to have arisen via a relatively recent gene duplication from fem/tra (10.1038/nature07052). It would be more accurate to refer to this similarity as a case of convergent evolution rather than conservation.

      In the sentence that Reviewer 2 refers to, we are representing the assertion made in the (Miyakawa and Mikheyev 2015) paper in which, regarding their mapping of a candidate CSD locus that contains two linked tra homologs, they write in the abstract: “these data support the prediction that the same CSD mechanism has indeed been conserved for over 100 million years.” In that same paper, Miyakawa and Mikheyev write in the discussion section: “As ants and bees diverged more than 100 million years ago, sex determination in honey bees and V. emeryi is probably homologous and has been conserved for at least this long.”

      As noted by Reviewer 2, this appears to conflict with a previously advanced hypothesis: that because fem and csd were found in Apis mellifera, Apis cerana, and Apis dorsata, but only fem was found in Mellipona compressipes, Bombus terrestris, and Nasonia vitripennis, that the csd gene evolved after the honeybee (Apis) lineage diverged from other bees (Hasselmann et al. 2008). However, it remains possible that the csd gene evolved after ants and bees diverged from N. vitripennis, but before the divergence of ants and bees, and then was subsequently lost in B. terrestris and M. compressipes. This view was previously put forward based on bioinformatic identification of putative orthologs of csd and fem in bumblebees and in ants [(Schmieder et al. 2012), see also (Privman et al. 2013)]. However, subsequent work disagreed and argued that the duplications of tra found in ants and in bumblebees represented convergent evolution rather than homology (Koch et al. 2014). Distinguishing between these possibilities will be aided by additional sex determination locus mapping studies and functional dissection of the underlying molecular mechanisms in diverse Aculeata.

      Distinguishing between these competing hypotheses is beyond the scope of our paper, but the revised manuscript will include additional text to incorporate some of this nuance. We will include these modified lines below (in what will be lines 287-295), with the additions highlighted:

      “A second QTL region identified in V. emeryi (V.emeryiCsdQTL1) contains two closely linked tra homologs, similar to the closely linked honeybee tra homologs, csd and fem (Miyakawa and Mikheyev 2015). This, along with the discovery of duplicated tra homologs that undergo concerted evolution in bumblebees and ants (Schmieder et al. 2012; Privman et al. 2013) has led to the hypothesis that the function of tra homologs as CSD loci is conserved with the csd-containing region of honeybees (Schmieder et al. 2012; Miyakawa and Mikheyev 2015). However, other work has suggested that tra duplications occurred independently in honeybees, bumblebees, and ants (Hasselmann et al. 2008; Koch et al. 2014), and it remains to be demonstrated that either of these tra homologs acts as a primary CSD signal in V. emeryi.”

      (4) Finally, since the authors successfully identified multiple alleles of the first CSD locus using previously sequenced haploid males, I wonder whether they also observed comparable allelic diversity at the candidate second CSD locus. This would provide useful supporting evidence for its functional relevance.

      As is already addressed in the final paragraph of the results and in Supplementary Fig. S4, there is no peak of nucleotide diversity in any of the regions homologous to V.emeryiQTL1, which is the tra-containing candidate sex determination locus (Miyakawa and Mikheyev 2015). In the revised manuscript, the relevant lines will be 307-310. We want to restate that we do not propose that there is a second candidate CSD locus in O. biroi, but we simply raise the possibility that multi-locus CSD *might* explain the absence of diploid males from clonal lines other than clonal line A (as one of several alternative possibilities).

      Overall, these are relatively minor points in the context of a strong manuscript, but I believe addressing them would improve the clarity and robustness of the authors' conclusions.

      Reviewer #3 (Public review):

      Summary:

      The sex determination mechanism governed by the complementary sex determination (CSD) locus is one of the mechanisms that support the haplodiploid sex determination system evolved in hymenopteran insects. While many ant species are believed to possess a CSD locus, it has only been specifically identified in two species. The authors analyzed diploid females and the rarely occurring diploid males of the clonal ant Ooceraea biroi and identified a 46 kb CSD candidate region that is consistently heterozygous in females and predominantly homozygous in males. This region was found to be homologous to the CSD locus reported in distantly related ants. In the Argentine ant, Linepithema humile, the CSD locus overlaps with an lncRNA (ANTSR) that is essential for female development and is associated with the heterozygous region (Pan et al. 2024). Similarly, an lncRNA is encoded near the heterozygous region within the CSD candidate region of O. biroi. Although this lncRNA shares low sequence similarity with ANTSR, its potential functional involvement in sex determination is suggested. Based on these findings, the authors propose that the heterozygous region and the adjacent lncRNA in O. biroi may trigger female development via a mechanism similar to that of L. humile. They further suggest that the molecular mechanisms of sex determination involving the CSD locus in ants have been highly conserved for approximately 112 million years. This study is one of the few to identify a CSD candidate region in ants and is particularly noteworthy as the first to do so in a parthenogenetic species.

      Strengths:

      (1) The CSD candidate region was found to be homologous to the CSD locus reported in distantly related ant species, enhancing the significance of the findings.

      (2) Identifying the CSD candidate region in a parthenogenetic species like O. biroi is a notable achievement and adds novelty to the research.

      Weaknesses

      (1) Functional validation of the lncRNA's role is lacking, and further investigation through knockout or knockdown experiments is necessary to confirm its involvement in sex determination.

      See response below.

      (2) The claim that the lncRNA is essential for female development appears to reiterate findings already proposed by Pan et al. (2024), which may reduce the novelty of the study.

      We do not claim that the lncRNA is essential for female development in O. biroi, but simply mention the possibility that, as in L. humile, it is somehow involved in sex determination. We do not have any functional evidence for this, so this is purely based on its genomic position immediately adjacent to our mapped candidate region. We agree with the reviewer that the study by Pan et al. (2024) decreases the novelty of our findings. Another way of looking at this is that our study supports and bolsters previous findings by partially replicating the results in a different species.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      L307-308 should state homozygous for either allele in THE MAJORITY of diploid males.

      This will be fixed in the revised manuscript, in what will be line 321.

      Reviewer #3 (Recommendations for the authors):

      The association between heterozygosity in the CSD candidate region and female development in O. biroi, along with the high sequence homology of this region to CSD loci identified in two distantly related ant species, is not sufficient to fully address the evolution of the CSD locus and the mechanisms of sex determination.

      Given that functional genetic tools, such as genome editing, have already been established in O. biroi, I strongly recommend that the authors investigate the role of the lncRNA through knockout or knockdown experiments and assess its impact on the sex-specific splicing pattern of the downstream tra gene.

      Although knockout experiments of the lncRNA would be illuminating, the primary signal of complementary sex determination is heterozygosity. As is clearly stated in our manuscript and that of (Pan et al. 2024), it does not appear to be heterozygosity within the lncRNA that induces female development, but rather heterozygosity in non-transcribed regions linked to the lncRNA. Therefore, future mechanistic studies of sex determination in O. biroi, L. humile, and other ants should explore how homozygosity or heterozygosity of this region impacts the sex determination cascade, rather than focusing (exclusively) on the lncRNA.

      With this in mind, we developed three sets of guide RNAs that cut only one allele within the mapped CSD locus, with the goal of producing deletions within the highly variable region within the mapped locus. This would lead to functional hemizygosity or homozygosity within this region, depending on how the cuts were repaired. We also developed several sets of PCR primers to assess the heterozygosity of the resultant animals. After injecting 1,162 eggs over several weeks and genotyping the hundreds of resultant animals with PCR, we confirmed that we could induce hemizygosity or homozygosity within this region, at least in ~1/20 of the injected embryos. Although it is possible to assess the sex-specificity of the splice isoform of tra as a proxy for sex determination phenotypes (as done by (Pan et al. 2024)), the ideal experiment would assess male phenotypic development at the pupal stage. Therefore, over several more weeks, we injected hundreds more eggs with these reagents and reared the injected embryos to the pupal stage. However, substantial mortality was observed, with only 12 injected eggs developing to the pupal stage. All of these were female, and none of them had been successfully mutated.

      In conclusion, we agree with the reviewer that functional experiments would be useful, and we made extensive attempts to conduct such experiments. However, these experiments turned out to be extremely challenging with the currently available protocols. Ultimately, we therefore decided to abandon these attempts.  

      We opted not to include these experiments in the paper itself because we cannot meaningfully interpret their results. However, we are pleased that, in this response letter, we can include a brief description for readers interested in attempting similar experiments.

      Since O. biroi reproduces parthenogenetically and most offspring develop into females, observing a shift from female- to male-specific splicing of tra upon early embryonic knockout of the lncRNA would provide much stronger evidence that this lncRNA is essential for female development. Without such functional validation, the authors' claim (lines 36-38) seems to reiterate findings already proposed by Pan et al. (2024) and, as such, lacks sufficient novelty.

      We have responded to the issue of “lack of novelty” above. But again, the actual CSD locus in both O. biroi and L. humile appears to be distinct from (but genetically linked to) the lncRNA, and we have no experimental evidence that the putative lncRNA in O. biroi is involved in sex determination at all. Because of this, and given the experimental challenges described above, we do not currently intend to pursue functional studies of the lncRNA.

      References

      Hasselmann M, Gempe T, Schiøtt M, Nunes-Silva CG, Otte M, Beye M. 2008. Evidence for the evolutionary nascence of a novel sex determination pathway in honeybees. Nature 454:519–522.

      Koch V, Nissen I, Schmitt BD, Beye M. 2014. Independent Evolutionary Origin of fem Paralogous Genes and Complementary Sex Determination in Hymenopteran Insects. PLOS ONE 9:e91883.

      Matthey-Doret C, van der Kooi CJ, Jeffries DL, Bast J, Dennis AB, Vorburger C, Schwander T. 2019. Mapping of multiple complementary sex determination loci in a parasitoid wasp. Genome Biology and Evolution 11:2954–2962.

      Miyakawa MO, Mikheyev AS. 2015. QTL mapping of sex determination loci supports an ancient pathway in ants and honey bees. PLOS Genetics 11:e1005656.

      Pan Q, Darras H, Keller L. 2024. LncRNA gene ANTSR coordinates complementary sex determination in the Argentine ant. Science Advances 10:eadp1532.

      Privman E, Wurm Y, Keller L. 2013. Duplication and concerted evolution in a master sex determiner under balancing selection. Proceedings of the Royal Society B: Biological Sciences 280:20122968.

      Rabeling C, Kronauer DJC. 2012. Thelytokous parthenogenesis in eusocial Hymenoptera. Annual Review of Entomology 58:273–292.

      Schmieder S, Colinet D, Poirié M. 2012. Tracing back the nascence of a new sex-determination pathway to the ancestor of bees and ants. Nature Communications 3:1–7.

      Vorburger C. 2013. Thelytoky and Sex Determination in the Hymenoptera: Mutual Constraints. Sexual Development 8:50–58.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review): 

      Jiang et al. present a measure of phenological lag by quantifying the effects of abiotic constraints on the differences between observed and expected phenological changes, using a combination of previously published phenology change data for 980 species, and associated climate data for study sites. They found that, across all samples, observed phenological responses to climate warming were smaller than expected responses for both leafing and flowering spring events. They also show that data from experimental studies included in their analysis exhibited increased phenological lag compared to observational studies, possibly as a result of reduced sensitivity to climatic changes. Furthermore, the authors present evidence that spatial trends in phenological responses to warming may differ than what would be expected from phenological sensitivity, due to the seasonal timing of when warming occurs. Thus, climate change may not result in geographic convergences of phenological responses. This study presents an interesting way to separate the individual effects of climate change and other abiotic changes on the phenological responses across sites and species. 

      Strengths: 

      A straightforward mathematical definition of phenological lag allows for this method to potentially be applied in different geographic contexts. Where data exists, other researchers can partition the effects of various abiotic forcings on phenological responses that differ from those expected from warming sensitivity alone. 

      Identifying phenological lag, and associated contributing factors, provides a method by which more nuanced predictions of phenological responses to climate change can be made. Thus, this study could improve ecological forecasting models. 

      Weaknesses: 

      The analysis here could be more robust. A more thorough examination of phenological lag would provide stronger evidence that the framework presented has utility. The differences in phenologica lag by study approach, species origin, region, and growth form are interesting, and could be expanded. For example, the authors have the data to explore the relationships between phenological lag and the quantitative variables included in the final model (altitude, latitude, mean annual temperature) and other spatial or temporal variables. This would also provide stronger evidence for the author's claims about potential mechanisms that contribute to phenological lag. 

      We did examine the relationships of phenological lag with geographic or climatic variables in our analyses. Other than the weak correlations with latitude and altitude cited in the discussion section (lines 292-293), phenological lag was not related to mean annual temperature or long-term precipitation for both leafing and flowering.  

      The authors include very little data visualizations, and instead report results and model statistics in tables. This is difficult to interpret and may obscure underlying patterns in the data. Including visual representations of variable distributions and between-variable relationships, in addition to model statistics, provides stronger evidence than model statistics alone. 

      Table 2 shows the influences of geographic or climatic variables, particularly those related to drivers of spring phenology, i.e., budburst temperature, forcing change, and phenological lag, on phenological changes. As quantitative contributions of these drivers have been extracted, the influences of remaining variables are either minor or insignificant. Thus, examination of variable distributions, which has been done in previous syntheses, is probably not necessary.         

      Some of independent variables were apparently correlated (r <0.6), e.g., MAT with altitude and latitude, budburst temperature with forcing change and spring warming, and forcing change with spring warming.

      Reviewer #3 (Public review): 

      Summary: 

      The authors developed a new phenological lag metric and applied this analytical framework to a global dataset to synthesize shifts in spring phenology and assess how abiotic constraints influence spring phenology. 

      Strengths: 

      The dataset developed in this study is extensive, and the phenological lag metric is valuable. 

      Weaknesses: 

      The stability of the method used in this study needs improvement, particularly in the calculation of forcing requirements. In addition, the visualization of the results (such as Table 1) should be enhanced. 

      Not clear how to improve the calculation of forcing accumulation.    

      Recommendations for the authors: 

      Editor (Recommendations for the authors): 

      To improve the robustness of the metric and the conclusions drawn, we recommend that the authors: 

      Test the sensitivity of their results to different base temperature thresholds and to nonlinear forcing response models, even for a subset of species. The proposed framework relies on an accurate understanding of species-specific thermal responses, which remain poorly resolved.

      Different above-zero base temperatures have been used previously, although justifications are mostly following previous work. As we indicated in our first responses, the use of above-zero base temperatures underestimates forcing from low temperatures, which has more impacts on species with early spring phenology or in areas of cold climate due to greater proportions of forcing accumulations from low temperatures. The use of high base temperatures can lead to an interpretation that early season species require little or no forcing to break buds, which is biologically incorrect. Thus, the use of above-zero base temperatures would be more appropriate for particular locations or species than for meta-analysis across different spring phenology and climatic conditions. 

      The research on multiple warming is limited in terms of levels of warming used (mostly one and occasionally two) for assessing non-linear forcing responses. This can be the focus of future work.  

      Our framework is based on drivers of spring phenology and not dependent on “accurate understanding of species-specific thermal responses”. However, a good understanding of species- and site-specific responses to phenological constraints (e.g., insufficient winter chilling, photoperiod, and environmental stresses) does help determine the nature of phenological lag. All these are explained in our paper.     

      Analyze relationships between phenological lag and additional geographic or climatic gradients already available in the dataset (e.g., latitude, mean annual temperature, interannual variability). 

      We did examine the relationships of phenological lag with geographic or climatic variables in our analyses. Other than the weak correlations with latitude and altitude cited in the discussion section (lines 292-293), phenological lag was not related to mean annual temperature or long-term precipitation for both leafing and flowering.  

      Our objective is to understand changes in spring phenology and differences in plant phenological responses across different functional groups or climatic regions, although our approach can be used to study interannual variability of spring phenology. Our metadata are compiled for comparing warmer vs control treatments (often multiyear averages), not for assessing interannual variability.      

      Improve data visualization to better convey how phenological lag varies across environmental and biological contexts. 

      See responses above.

      Consider incorporating explicit uncertainty estimates around phenological lag calculations.  These steps would improve the interpretability and generalizability of the framework, helping it move from a useful heuristic to a more robust and empirically grounded analytical tool. 

      The calculation of phenological lag is based on drivers of spring phenology with uncertainty depending primarily on uncertainty in phenological observations. Previous uncertainty assessments can be used here (see a few selected studies below).   

      Alles, G.R., Comba, J.L., Vincent, J.M., Nagai, S. and Schnorr, L.M., 2020. Measuring phenology uncertainty with large scale image processing. Ecological Informatics, 59, p.101109.

      Liu G, Chuine I, Denéchère R, Jean F, Dufrêne E, Vincent G, Berveiller D, Delpierre N. Higher sample sizes and observer intercalibration are needed for reliable scoring of leaf phenology in trees. Journal of Ecology. 2021 Jun;109(6):2461-74.

      Tang, J., Körner, C., Muraoka, H., Piao, S., Shen, M., Thackeray, S.J. and Yang, X., 2016.Emerging opportunities and challenges in phenology: a review. Ecosphere, 7(8), p.e01436. 

      Nagai, S., Inoue, T., Ohtsuka, T., Yoshitake, S., Nasahara, K.N. and Saitoh, T.M., 2015. Uncertainties involved in leaf fall phenology detected by digital camera. Ecological Informatics, 30, pp.124-132.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer#1 (Public Review):

      In the current article, Octavia Soegyono and colleagues study "The influence of nucleus accumbens shell D1 and D2 neurons on outcome-specific Pavlovian instrumental transfer", building on extensive findings from the same lab. While there is a consensus about the specific involvement of the Shell part of the Nucleus Accumbens (NAc) in specific stimulus-based actions in choice settings (and not in General Pavlovian instrumental transfer - gPIT, as opposed to the Core part of the NAc), mechanisms at the cellular and circuitry levels remain to be explored. In the present work, using sophisticated methods (rat Cre-transgenic lines from both sexes, optogenetics and the well-established behavioral paradigm outcome-specific PIT - sPIT), Octavia Soegyono and colleagues decipher the diOerential contribution of dopamine receptors D1 and D2 expressing-spiny projection neurons (SPNs).

      After validating the viral strategy and the specificity of the targeting (immunochemistry and electrophysiology), the authors demonstrate that while both NAc Shell D1- and D2SPNs participate in mediating sPIT, NAc Shell D1-SPNs projections to the Ventral Pallidum (VP, previously demonstrated as crucial for sPIT), but not D2-SPNs, mediates sPIT. They also show that these eOects were specific to stimulus-based actions, as valuebased choices were left intact in all manipulations.

      This is a well-designed study and the results are well supported by the experimental evidence. The paper is extremely pleasant to read and add to the current literature.

      We thank the Reviewer for their positive assessment.

      Comments on revisions:  

      We thank the authors for their detailed responses and for addressing our comments and concerns.

      To further improve consistency and transparency, we kindly request that the authors provide, for Supplemental Figures S1-S4, panels E (raw data for lever presses during the PIT test), the individual data points together with the corresponding statistical analyses in the figure legends.

      Panel E of Figures S1-S4 now includes the individual data points. The outcome-specific data have already been analysed, and we report these analyses in the main manuscript. These analyses are more informative than those requested by the Reviewer since they report the net eFects of the stimuli on choice between actions while controlling for potential individual baseline instrumental performance. All data remain fully transparent and are publicly available on an online repository in accordance with eLife policies (see relevant section in Materials and Methods).  

      In addition, regarding Supplemental Figure S3, panel E, we note the absence of a PIT eOect in the eYFP group under the ON condition, which appears to diOer from the net response reported in the main Figure 5, panel B. Could the authors clarify this apparent discrepancy?

      We apologize for the error, which has now been corrected. 

      We also note a discrepancy between the authors' statement in their response ("40 rats excluded based on post-mortem analyses") and the number of excluded animals reported in the Materials and Methods section, which adds up to 47. We kindly ask the authors to clarify this point for consistency.

      We thank the Reviewer for identifying the error reported in our initial response. The total number of animals excluded was 47, as reported in the manuscript. 

      Finally, as a minor point, we suggest indicating the total number of animals used in the study in the Materials and Methods section.

      The total number of animals has been included in the Materials and Methods section.

      Reviewer #2 (Public Review):

      Summary:

      This manuscript by Soegyono et a. describes a series of experiments designed to probe the involvement of dopamine D1 and D2 neurons within the nucleus accumbens shell in outcome-specific Pavlovian-instrumental transfer (osPIT), a well-controlled assay of cueguided action selection based on congruent outcome associations. They used an optogenetic approach to phasically silence NAc shell D1 (D1-Cre mice) or D2 (A2a-Cre mice) neurons during a subset of osPIT trials. Both manipulations disrupted cue-guided action selection but had no eOects on negative control measures/tasks (concomitant approach behavior, separate valued guided choice task), nor were any osPIT impairments found in reporter only control groups. Separate experiments revealed that selective inhibition of NAc shell D1 but not D2 inputs to ventral pallidum were required for osPIT expression, thereby advancing understanding of the basal ganglia circuitry underpinning this important aspect of decision making.

      Strengths:

      The combinatorial viral and optogenetic approaches used here were convincingly validated through anatomical tract-tracing and ex vivo electrophysiology. The behavioral assays are sophisticated and well-controlled to parse cue and value guided action selection. The inclusion of reporter only control groups is rigorous and rules out nonspecific eOects of the light manipulation. The findings are novel and address a critical question in the literature. Prior work using less decisive methods had implicated NAc shell D1 neurons in osPIT but suggested that D2 neurons may not be involved. The optogenetic manipulations used in the current study provides a more direct test of their involvement and convincingly demonstrate that both populations play an important role. Prior work had also implicated NAc shell connections to ventral pallidum in osPIT, but the current study reveals the selective involvement of D1 but not D2 neurons in this circuit. The authors do a good job of discussing their findings, including their nuanced interpretation that NAc shell D2 neurons may contribute to osPIT through their local regulation of NAc shell microcircuitry.

      We thank the Reviewer for their positive assessment.

      Weaknesses:

      The current study exclusively used an optogenetic approach to probe the function of D1 and D2 NAc shell neurons. Providing a complementary assessment with chemogenetics or other appropriate methods would strengthen conclusions, particularly the novel demonstration for D2 NAc shell involvement. Likewise, the null result of optically inhibiting D2 inputs to ventral pallidum leaves open the possibility that a more complete or sustained disruption of this pathway may have impaired osPIT.

      We acknowledge the reviewer's valuable suggestion that demonstrating NAc-S D1- and D2-SPNs engagement in outcome-specific PIT through another technique would strengthen our optogenetic findings. Several approaches could provide this validation. Chemogenetic manipulation, as the reviewer suggested, represents one compelling option. Alternatively, immunohistochemical assessment of phosphorylated histone H3 at serine 10 (P-H3) oFers another promising avenue, given its established utility in reporting striatal SPNs plasticity in the dorsal striatum (Matamales et al., 2020). We hope to complete such an assessment in future work since it would address the limitations of previous work that relied solely on ERK1/2 phosphorylation measures in NAc-S SPNs (Laurent et al., 2014). The manuscript was modified to report these future avenues of research (page 12). 

      Regarding the null result from optical silencing of D2 terminals in the ventral pallidum, we agree with the reviewer's assessment. While we acknowledge this limitation in the current manuscript (page 13), we aim to address this gap in future studies to provide a more complete mechanistic understanding of the circuit.

      Conclusions:

      The research described here was successful in providing critical new insights into the contributions of NAc D1 and D2 neurons in cue-guided action selection. The authors' data interpretation and conclusions are well reasoned and appropriate. They also provide a thoughtful discussion of study limitations and implications for future research. This research is therefore likely to have a significant impact on the field.

      We thank the Reviewer for their positive assessment.

      Comments on revisions:

      I have reviewed the rebuttal and revised manuscript and have no remaining concerns.

      We are pleased to have addressed the Reviewer’s query.

      References

      Laurent, V., Bertran-Gonzalez, J., Chieng, B. C., & Balleine, B. W. (2014). δ-Opioid and Dopaminergic Processes in Accumbens Shell Modulate the Cholinergic Control of Predictive Learning and Choice. J Neurosci, 34(4), 1358-1369. https://doi.org/10.1523/JNEUROSCI.4592-13.2014

      Matamales, M., McGovern, A. E., Mi, J. D., Mazzone, S. B., Balleine, B. W., & BertranGonzalez, J. (2020). Local D2- to D1-neuron transmodulation updates goal-directed learning in the striatum. Science, 367(6477), 549-555. https://doi.org/10.1126/science.aaz5751

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      This paper describes a number of patterns of epistasis in a large fitness landscape dataset recently published by Papkou et al. The paper is motivated by an important goal in the field of evolutionary biology to understand the statistical structure of epistasis in protein fitness landscapes, and it capitalizes on the unique opportunities presented by this new dataset to address this problem. 

      The paper reports some interesting previously unobserved patterns that may have implications for our understanding of fitness landscapes and protein evolution. In particular, Figure 5 is very intriguing. However, I have two major concerns detailed below. First, I found the paper rather descriptive (it makes little attempt to gain deeper insights into the origins of the observed patterns) and unfocused (it reports what appears to be a disjointed collection of various statistics without a clear narrative. Second, I have concerns with the statistical rigor of the work. 

      (1) I think Figures 5 and 7 are the main, most interesting, and novel results of the paper. However, I don't think that the statement "Only a small fraction of mutations exhibit global epistasis" accurately describes what we see in Figure 5. To me, the most striking feature of this figure is that the effects of most mutations at all sites appear to be a mixture of three patterns. The most interesting pattern noted by the authors is of course the "strong" global epistasis, i.e., when the effect of a mutation is highly negatively correlated with the fitness of the background genotype. The second pattern is a "weak" global epistasis, where the correlation with background fitness is much weaker or non-existent. The third pattern is the vertically spread-out cluster at low-fitness backgrounds, i.e., a mutation has a wide range of mostly positive effects that are clearly not correlated with fitness. What is very interesting to me is that all background genotypes fall into these three groups with respect to almost every mutation, but the proportions of the three groups are different for different mutations. In contrast to the authors' statement, it seems to me that almost all mutations display strong global epistasis in at least a subset of backgrounds. A clear example is C>A mutation at site 3. 

      (1a) I think the authors ought to try to dissect these patterns and investigate them separately rather than lumping them all together and declaring that global epistasis is rare. For example, I would like to know whether those backgrounds in which mutations exhibit strong global epistasis are the same for all mutations or whether they are mutation- or perhaps positionspecific. Both answers could be potentially very interesting, either pointing to some specific site-site interactions or, alternatively, suggesting that the statistical patterns are conserved despite variation in the underlying interactions. 

      (1b) Another rather remarkable feature of this plot is that the slopes of the strong global epistasis patterns seem to be very similar across mutations. Is this the case? Is there anything special about this slope? For example, does this slope simply reflect the fact that a given mutation becomes essentially lethal (i.e., produces the same minimal fitness) in a certain set of background genotypes? 

      (1c) Finally, how consistent are these patterns with some null expectations? Specifically, would one expect the same distribution of global epistasis slopes on an uncorrelated landscape? Are the pivot points unusually clustered relative to an expectation on an uncorrelated landscape? 

      (1d) The shapes of the DFE shown in Figure 7 are also quite interesting, particularly the bimodal nature of the DFE in high-fitness (HF) backgrounds. I think this bimodality must be a reflection of the clustering of mutation-background combinations mentioned above. I think the authors ought to draw this connection explicitly. Do all HF backgrounds have a bimodal DFE? What mutations occupy the "moving" peak? 

      (1e) In several figures, the authors compare the patterns for HF and low-fitness (LF) genotypes. In some cases, there are some stark differences between these two groups, most notably in the shape of the DFE (Figure 7B, C). But there is no discussion about what could underlie these differences. Why are the statistics of epistasis different for HF and LF genotypes? Can the authors at least speculate about possible reasons? Why do HF and LF genotypes have qualitatively different DFEs? I actually don't quite understand why the transition between bimodal DFE in Figure 7B and unimodal DFE in Figure 7C is so abrupt. Is there something biologically special about the threshold that separates LF and HF genotypes? My understanding was that this was just a statistical cutoff. Perhaps the authors can plot the DFEs for all backgrounds on the same plot and just draw a line that separates HF and LF backgrounds so that the reader can better see whether the DFE shape changes gradually or abruptly.

      (1f) The analysis of the synonymous mutations is also interesting. However I think a few additional analyses are necessary to clarify what is happening here. I would like to know the extent to which synonymous mutations are more often neutral compared to non-synonymous ones. Then, synonymous pairs interact in the same way as non-synonymous pair (i.e., plot Figure 1 for synonymous pairs)? Do synonymous or non-synonymous mutations that are neutral exhibit less epistasis than non-neutral ones? Finally, do non-synonymous mutations alter epistasis among other mutations more often than synonymous mutations do? What about synonymous-neutral versus synonymous-non-neutral. Basically, I'd like to understand the extent to which a mutation that is neutral in a given background is more or less likely to alter epistasis between other mutations than a non-neutral mutation in the same background. 

      (2) I have two related methodological concerns. First, in several analyses, the authors employ thresholds that appear to be arbitrary. And second, I did not see any account of measurement errors. For example, the authors chose the 0.05 threshold to distinguish between epistasis and no epistasis, but why this particular threshold was chosen is not justified. Another example: is whether the product s12 × (s1 + s2) is greater or smaller than zero for any given mutation is uncertain due to measurement errors. Presumably, how to classify each pair of mutations should depend on the precision with which the fitness of mutants is measured. These thresholds could well be different across mutants. We know, for example, that low-fitness mutants typically have noisier fitness estimates than high-fitness mutants. I think the authors should use a statistically rigorous procedure to categorize mutations and their epistatic interactions. I think it is very important to address this issue. I got very concerned about it when I saw on LL 383-388 that synonymous stop codon mutations appear to modulate epistasis among other mutations. This seems very strange to me and makes me quite worried that this is a result of noise in LF genotypes. 

      Thank you for your review of the manuscript. In the revised version, we have addressed both major criticisms, as detailed below.

      When carefully examining the plots in Figure 5 independently, we indeed observe that the fitness effect of a mutation on different genetic backgrounds can be classified into three characteristic patterns. Our reasoning for these patterns is as follows:

      Strong correlation: Typically observed when the mutation is lethal across backgrounds. Linear regression of mutations exhibiting strong global epistasis shows slopes close to −1 and pivot points near −0.7 (Table S4). Since the reported fitness threshold is −0.508, these mutations push otherwise functional backgrounds into the non-functional range, consistent with lethal effects.

      Weak correlation: Observed when a mutation has no significant effect on fitness across backgrounds, consistent with neutrality.

      No correlation: Out of the 261,333 reported variants, 243,303 (93%) lie below the fitness threshold of −0.508, indicating that the low-fitness region is densely populated by nonfunctional variants. The “strong correlation” and “weak correlation” lines intersect in this zone. Most mutations in this region have little effect (neutral), but occasional abrupt fitness increases correspond to “resurrecting” mutations, the converse of lethal changes. For example, mutations such as X→G at locus 4 or X→A at locus 5 restore function, while the reverse changes (e.g. C→A at locus 3) are lethal.

      Thus, the “no-correlation” pattern is largely explained by mutations that reverse the effect of lethal changes, effectively resurrecting non-functional variants. In the revised manuscript, we highlight these nuances within the broader classification of fitness effect versus background fitness (pp. 10–13).

      Additional analyses included in the revision:

      Synonymous vs. non-synonymous pairs: We repeated the Figure 1 analysis for synonymous–synonymous pairs. As expected, synonymous pairs exhibit lower overall frequencies of epistasis, consistent with their greater neutrality. However, the qualitative spectrum remains similar: positive and negative epistasis dominate, while sign epistasis is rare (Supplementary Figs. S6–S7, S9–S10).

      Fitness effect vs. epistasis change: We tested whether the mean fitness effect of a mutation correlates with the percent of cases in which it changes the nature of epistasis. No correlation was found (R² ≈ 0.11), and this analysis is now included in the revised manuscript.

      Epistasis-modulating ability: Non-synonymous mutations more frequently alter the interactions between other mutations than synonymous substitutions. Within synonymous substitutions, the subset with measurable fitness effects disproportionately contributes to epistasis modulation. Thus, the ability of synonymous substitutions to modulate epistasis arises primarily from the non-neutral subset.

      These analyses clarify the role of synonymous mutations in reshaping epistasis on the folA landscape.

      Revision of statistical treatment of epistasis:

      In our original submission, we used an arbitrary threshold of 0.05 to classify the presence or absence of epistasis, following Papkou et al., who based conclusions on a single experimental replicate. However, as the reviewer correctly noted, this does not adequately account for measurement variability across different genotypes.

      In the revised manuscript, we adopt a statistically rigorous framework that incorporates replicate-based error directly. Specifically, we now use the mean fitness across six independent replicates, together with the corresponding standard deviation, to classify fitness peaks and epistasis. This eliminates arbitrary thresholds and ensures that epistatic classifications reflect the precision of measurements for each genotype.

      This revision led to both quantitative and qualitative changes:

      For high-fitness genotypes, the core patterns of higher-order (“fluid”) epistasis remain robust (Figures 2–3).

      For low-fitness genotypes, incorporating replicate-based error removed spurious fluidity effects, yielding a more accurate characterization of epistasis (Figures 2–3; Supplementary Figs. S6–S7, S9–S10).

      We describe these methodological changes in detail in the revised Methods section and provide updated code.

      Together, these revisions directly address the reviewer’s concerns. They improve the statistical rigor of our analysis, strengthen the robustness of our conclusions, and underscore the importance of accounting for measurement error in large-scale fitness landscape studies—a point we now emphasize in the manuscript.

      Reviewer #2 (Public review): 

      Significance: 

      This paper reanalyzes an experimental fitness landscape generated by Papkou et al., who assayed the fitness of all possible combinations of 4 nucleotide states at 9 sites in the E. coli DHFR gene, which confers antibiotic resistance. The 9 nucleotide sites make up 3 amino acid sites in the protein, of which one was shown to be the primary determinant of fitness by Papkou et al. This paper sought to assess whether pairwise epistatic interactions differ among genetic backgrounds at other sites and whether there are major patterns in any such differences. They use a "double mutant cycle" approach to quantify pairwise epistasis, where the epistatic interaction between two mutations is the difference between the measured fitness of the double-mutant and its predicted fitness in the absence of epistasis (which equals the sum of individual effects of each mutation observed in the single mutants relative to the reference genotype). The paper claims that epistasis is "fluid," because pairwise epistatic effects often differs depending on the genetic state at the other site. It also claims that this fluidity is "binary," because pairwise effects depend strongly on the state at nucleotide positions 5 and 6 but weakly on those at other sites. Finally, they compare the distribution of fitness effects (DFE) of single mutations for starting genotypes with similar fitness and find that despite the apparent "fluidity" of interactions this distribution is well-predicted by the fitness of the starting genotype. 

      The paper addresses an important question for genetics and evolution: how complex and unpredictable are the effects and interactions among mutations in a protein? Epistasis can make the phenotype hard to predict from the genotype and also affect the evolutionary navigability of a genotype landscape. Whether pairwise epistatic interactions depend on genetic background - that is, whether there are important high-order interactions -- is important because interactions of order greater than pairwise would make phenotypes especially idiosyncratic and difficult to predict from the genotype (or by extrapolating from experimentally measured phenotypes of genotypes randomly sampled from the huge space of possible genotypes). Another interesting question is the sparsity of such high-order interactions: if they exist but mostly depend on a small number of identifiable sequence sites in the background, then this would drastically reduce the complexity and idiosyncrasy relative to a landscape on which "fluidity" involves interactions among groups of all sites in the protein. A number of papers in the recent literature have addressed the topics of high-order epistasis and sparsity and have come to conflicting conclusions. This paper contributes to that body of literature with a case study of one published experimental dataset of high quality. The findings are therefore potentially significant if convincingly supported. 

      Validity: 

      In my judgment, the major conclusions of this paper are not well supported by the data. There are three major problems with the analysis. 

      (1) Lack of statistical tests. The authors conclude that pairwise interactions differ among backgrounds, but no statistical analysis is provided to establish that the observed differences are statistically significant, rather than being attributable to error and noise in the assay measurements. It has been established previously that the methods the authors use to estimate high-order interactions can result in inflated inferences of epistasis because of the propagation of measurement noise (see PMID 31527666 and 39261454). Error propagation can be extreme because first-order mutation effects are calculated as the difference between the measured phenotype of a single-mutant variant and the reference genotype; pairwise effects are then calculated as the difference between the measured phenotype of a double mutant and the sum of the differences described above for the single mutants. This paper claims fluidity when this latter difference itself differs when assessed in two different backgrounds. At each step of these calculations, measurement noise propagates. Because no statistical analysis is provided to evaluate whether these observed differences are greater than expected because of propagated error, the paper has not convincingly established or quantified "fluidity" in epistatic effects. 

      (2) Arbitrary cutoffs. Many of the analyses involve assigning pairwise interactions into discrete categories, based on the magnitude and direction of the difference between the predicted and observed phenotypes for a pairwise mutant. For example, the authors categorize as a positive pairwise interaction if the apparent deviation of phenotype from prediction is >0.05, negative if the deviation is <-0.05, and no interaction if the deviation is between these cutoffs. Fluidity is diagnosed when the category for a pairwise interaction differs among backgrounds. These cutoffs are essentially arbitrary, and the effects are assigned to categories without assessing statistical significance. For example, an interaction of 0.06 in one background and 0.04 in another would be classified as fluid, but it is very plausible that such a difference would arise due to error alone. The frequency of epistatic interactions in each category as claimed in the paper, as well as the extent of fluidity across backgrounds, could therefore be systematically overestimated or underestimated, affecting the major conclusions of the study. 

      (3) Global nonlinearities. The analyses do not consider the fact that apparent fluidity could be attributable to the fact that fitness measurements are bounded by a minimum (the fitness of cells carrying proteins in which DHFR is essentially nonfunctional) and a maximum (the fitness of cells in which some biological factor other than DHFR function is limiting for fitness). The data are clearly bounded; the original Papkou et al. paper states that 93% of genotypes are at the low-fitness limit at which deleterious effects no longer influence fitness. Because of this bounding, mutations that are strongly deleterious to DHFR function will therefore have an apparently smaller effect when introduced in combination with other deleterious mutations, leading to apparent epistatic interactions; moreover, these apparent interactions will have different magnitudes if they are introduced into backgrounds that themselves differ in DHFR function/fitness, leading to apparent "fluidity" of these interactions. This is a well-established issue in the literature (see PMIDs 30037990, 28100592, 39261454). It is therefore important to adjust for these global nonlinearities before assessing interactions, but the authors have not done this. 

      This global nonlinearity could explain much of the fluidity claimed in this paper. It could explain the observation that epistasis does not seem to depend as much on genetic background for low-fitness backgrounds, and the latter is constant (Figure 2B and 2C): these patterns would arise simply because the effects of deleterious mutations are all epistatically masked in backgrounds that are already near the fitness minimum. It would also explain the observations in Figure 7. For background genotypes with relatively high fitness, there are two distinct peaks of fitness effects, which likely correspond to neutral mutations and deleterious mutations that bring fitness to the lower bound of measurement; as the fitness of the background declines, the deleterious mutations have a smaller effect, so the two peaks draw closer to each other, and in the lowest-fitness backgrounds, they collapse into a single unimodal distribution in which all mutations are approximately neutral (with the distribution reflecting only noise). Global nonlinearity could also explain the apparent "binary" nature of epistasis. Sites 4 and 5 change the second amino acid, and the Papkou paper shows that only 3 amino acid states (C, D, and E) are compatible with function; all others abolish function and yield lower-bound fitness, while mutations at other sites have much weaker effects. The apparent binary nature of epistasis in Figure 5 corresponds to these effects given the nonlinearity of the fitness assay. Most mutations are close to neutral irrespective of the fitness of the background into which they are introduced: these are the "non-epistatic" mutations in the binary scheme. For the mutations at sites 4 and 5 that abolish one of the beneficial mutations, however, these have a strong background-dependence: they are very deleterious when introduced into a high-fitness background but their impact shrinks as they are introduced into backgrounds with progressively lower fitness. The apparent "binary" nature of global epistasis is likely to be a simple artifact of bounding and the bimodal distribution of functional effects: neutral mutations are insensitive to background, while the magnitude of the fitness effect of deleterious mutations declines with background fitness because they are masked by the lower bound. The authors' statement is that "global epistasis often does not hold." This is not established. A more plausible conclusion is that global epistasis imposed by the phenotype limits affects all mutations, but it does so in a nonlinear fashion. 

      In conclusion, most of the major claims in the paper could be artifactual. Much of the claimed pairwise epistasis could be caused by measurement noise, the use of arbitrary cutoffs, and the lack of adjustment for global nonlinearity. Much of the fluidity or higher-order epistasis could be attributable to the same issues. And the apparently binary nature of global epistasis is also the expected result of this nonlinearity. 

      We thank the reviewer for raising this important concern. We fully agree that the use of arbitrary thresholds in the earlier version of the manuscript, together with the lack of an explicit treatment of measurement error, could compromise the rigor of our conclusions. To address this, we have undertaken a thorough re-analysis of the folA landscape.

      (1)  Incorporating measurement error and avoiding noise-driven artifacts

      In the original version, we followed Papkou et al. in using a single experimental replicate and applying fixed thresholds to classify epistasis. As the reviewer correctly notes, this approach allows noise to propagate from single-mutant measurements to double-mutant effects, and ultimately to higher-order epistasis.

      In the revised analysis, we now:

      Use the mean fitness across all six independent replicates for each genotype.

      Incorporate the corresponding standard deviation as a measure of experimental error.

      Classify epistatic interactions only when differences between a genotype and its neighbors exceed combined error margins, rather than using a fixed cutoff.

      This ensures that observed changes in epistasis are statistically distinguishable from noise. Details are provided in the revised Methods section and updated code.

      (2) Replacing arbitrary thresholds with error-based criteria

      Previously, we used an arbitrary ±0.05 cutoff to define the presence/absence of epistasis. As the reviewer notes, this could misclassify interactions (e.g. labeling an effect as “fluid” when the difference lies within error). In the revised framework, these thresholds have been eliminated. Instead, interactions are classified based on whether their distributions overlap within replicate variance.

      This approach scales naturally with measurement precision, which differs between high-fitness and low-fitness genotypes, and removes the need for a universal cutoff.

      (3) Consequences of re-analysis

      Implementing this revised framework produced several important updates:

      High-fitness backgrounds: The qualitative picture of higher-order (“fluid”) epistasis remains robust. The patterns reported originally are preserved.

      Low-fitness backgrounds: Accounting for replicate variance revealed that part of the previously inferred “fluidity” arose from noise. These spurious effects are now removed, giving a more conservative but more accurate view of epistasis in non-functional regions.

      Fitness peaks: Our replicate-aware analysis identifies 127 peaks, compared to 514 in Papkou et al. Importantly, all 127 peaks occur in functional regions of the landscape. This difference highlights the importance of replicate-based error treatment: relying on a single run without demonstrating repeatability can yield artifacts.

      (4) Addressing bounding effects and terminology

      We also agree with the reviewer that bounding effects, arising from the biological limits of fitness, can create apparent nonlinearities in the genotype–phenotype map. To clarify this, we made the following changes:

      Terminology: We now use the term higher-order epistasis instead of fluid epistasis, emphasizing that the observed background-dependence involves more than two mutations and cannot be explained by global nonlinearities alone.

      We also clarify the definitions of sign-epistasis used in this work.

      By replacing arbitrary cutoffs with replicate-based error estimates and by explicitly considering bounding effects, we have substantially increased the rigor of our analysis. While this reanalysis led to both quantitative and qualitative changes in some regions, the central conclusion remains unchanged: higher-order epistasis is pervasive in the folA landscape, especially in functional backgrounds.

      All analysis scripts and codes are provided as Supplementary Material.

      Reviewer #3 (Public review): 

      Summary: 

      The authors have studied a previously published large dataset on the fitness landscape of a 9 base-pair region of the folA gene. The objective of the paper is to understand various aspects of epistasis in this system, which the authors have achieved through detailed and computationally expensive exploration of the landscape. The authors describe epistasis in this system as "fluid", meaning that it depends sensitively on the genetic background, thereby reducing the predictability of evolution at the genetic level. However, the study also finds two robust patterns. The first is the existence of a "pivot point" for a majority of mutations, which is a fixed growth rate at which the effect of mutations switches from beneficial to deleterious (consistent with a previous study on the topic). The second is the observation that the distribution of fitness effects (DFE) of mutations is predicted quite well by the fitness of the genotype, especially for high-fitness genotypes. While the work does not offer a synthesis of the multitude of reported results, the information provided here raises interesting questions for future studies in this field. 

      Strengths: 

      A major strength of the study is its detailed and multifaceted approach, which has helped the authors tease out a number of interesting epistatic properties. The study makes a timely contribution by focusing on topical issues like the prevalence of global epistasis, the existence of pivot points, and the dependence of DFE on the background genotype and its fitness. The methodology is presented in a largely transparent manner, which makes it easy to interpret and evaluate the results. 

      The authors have classified pairwise epistasis into six types and found that the type of epistasis changes depending on background mutations. Switches happen more frequently for mutations at functionally important sites. Interestingly, the authors find that even synonymous mutations in stop codons can alter the epistatic interaction between mutations in other codons. Consistent with these observations of "fluidity", the study reports limited instances of global epistasis (which predicts a simple linear relationship between the size of a mutational effect and the fitness of the genetic background in which it occurs). Overall, the work presents some evidence for the genetic context-dependent nature of epistasis in this system. 

      Weaknesses: 

      Despite the wealth of information provided by the study, there are some shortcomings of the paper which must be mentioned. 

      (1) In the Significance Statement, the authors say that the "fluid" nature of epistasis is a previously unknown property. This is not accurate. What the authors describe as "fluidity" is essentially the prevalence of certain forms of higher-order epistasis (i.e., epistasis beyond pairwise mutational interactions). The existence of higher-order epistasis is a well-known feature of many landscapes. For example, in an early work, (Szendro et. al., J. Stat. Mech., 2013), the presence of a significant degree of higher-order epistasis was reported for a number of empirical fitness landscapes. Likewise, (Weinreich et. al., Curr. Opin. Genet. Dev., 2013) analysed several fitness landscapes and found that higher-order epistatic terms were on average larger than the pairwise term in nearly all cases. They further showed that ignoring higher-order epistasis leads to a significant overestimate of accessible evolutionary paths. The literature on higher-order epistasis has grown substantially since these early works. Any future versions of the present preprint will benefit from a more thorough contextual discussion of the literature on higher-order epistasis.

      (2) In the paper, the term 'sign epistasis' is used in a way that is different from its wellestablished meaning. (Pairwise) sign epistasis, in its standard usage, is said to occur when the effect of a mutation switches from beneficial to deleterious (or vice versa) when a mutation occurs at a different locus. The authors require a stronger condition, namely that the sum of the individual effects of two mutations should have the opposite sign from their joint effect. This is a sufficient condition for sign epistasis, but not a necessary one. The property studied by the authors is important in its own right, but it is not equivalent to sign epistasis. 

      (3) The authors have looked for global epistasis in all 108 (9x12) mutations, out of which only 16 showed a correlation of R^2 > 0.4. 14 out of these 16 mutations were in the functionally important nucleotide positions. Based on this, the authors conclude that global epistasis is rare in this landscape, and further, that mutations in this landscape can be classified into one of two binary states - those that exhibit global epistasis (a small minority) and those that do not (the majority). I suspect, however, that a biologically significant binary classification based on these data may be premature. Unsurprisingly, mutational effects are stronger at the functional sites as seen in Figure 5 and Figure 2, which means that even if global epistasis is present for all mutations, a statistical signal will be more easily detected for the functionally important sites. Indeed, the authors show that the means of DFEs decrease linearly with background fitness, which hints at the possibility that a weak global epistatic effect may be present (though hard to detect) in the individual mutations. Given the high importance of the phenomenon of global epistasis, it pays to be cautious in interpreting these results. 

      (4) The study reports that synonymous mutations frequently change the nature of epistasis between mutations in other codons. However, it is unclear whether this should be surprising, because, as the authors have already noted, synonymous mutations can have an impact on cellular functions. The reader may wonder if the synonymous mutations that cause changes in epistatic interactions in a certain background also tend to be non-neutral in that background. Unfortunately, the fitness effect of synonymous mutations has not been reported in the paper. 

      (5) The authors find that DFEs of high-fitness genotypes tend to depend only on fitness and not on genetic composition. This is an intriguing observation, but unfortunately, the authors do not provide any possible explanation or connect it to theoretical literature. I am reminded of work by (Agarwala and Fisher, Theor. Popul. Biol., 2019) as well as (Reddy and Desai, eLife, 2023) where conditions under which the DFE depends only on the fitness have been derived. Any discussion of possible connections to these works could be a useful addition.  

      We thank the reviewer for the summary of our work and for highlighting both its strengths and areas for improvement. We have carefully considered the points raised and revised the manuscript accordingly. The revised version:

      (1) Clarifies the conceptual framework. We emphasize the distinction between background-dependent, higher-order epistasis and global nonlinearities. To avoid ambiguity, we have replaced the term “fluid” epistasis with higher-order epistasis throughout, in line with prior literature (e.g. Szendro et al., 2013; Weinreich et al., 2013). We now explicitly situate our results in the context of these studies and clarify our definitions of epistasis, correcting the earlier error where “strong sign epistasis” was used in place of “sign epistasis.”

      (2) Improves statistical rigor. We now incorporate replicate variance and statistical error criteria in place of arbitrary thresholds. This ensures that classification of epistasis reflects experimental precision rather than fixed, arbitrary cutoffs.

      (3) Expands treatment of synonymous mutations. We now explicitly analyze synonymous mutations, separating those that are neutral from those that are non-neutral. Our results show that non-neutral synonymous mutations are disproportionately responsible for altering epistatic interactions, while neutral synonymous mutations rarely do so. We also report the fitness effects of synonymous mutations directly and include new analyses showing that there is no correlation between the mean fitness effect of a synonymous mutation and the frequency with which it alters epistasis (Supplementary Fig. S11).

      These revisions strengthen both the rigor and the clarity of the manuscript. We hope they address the reviewer’s concerns and make the significance of our findings, particularly the siteresolved quantification of higher-order epistasis in the folA landscape, including in synonymous mutations, more apparent.

      Reviewing Editor Comments: 

      Key revision suggestions: 

      (1) Please quantify the impact of measurement noise on your conclusions, and perform statistical analysis to determine whether the observed differences of epistasis due to different backgrounds are statistically significant. 

      (2) Please investigate how your conclusions depend on the cutoffs, and consider choosing them based on statistical criteria. 

      (3) Please reconsider the possible role of global epistasis. In particular, the effect of bounds on fitness values. All reviewers are concerned that all claims, including about global epistasis, may be consistent with a simple null model where most low fitness genotypes are non-functional and variation in their fitness is simply driven by measurement noise. Please provide a convincing argument rejecting this model. 

      More generally, we recommend that you consider all suggestions by reviewers, including those about results, but also those about terminology and citing relevant works. 

      Thank you for your guidance. We have substantially revised the manuscript to incorporate the reviewers’ suggestions. In addition to addressing the three central issues raised, we have refined terminology, expanded the discussion of prior work, and clarified the presentation of our main results. We believe these changes significantly strengthen both the rigor and the impact of the study. We are grateful to the Reviewing Editor and reviewers for their constructive feedback.

      In the revised manuscript, we address the three major points as follows:

      (1) Quantifying measurement noise and statistical significance. We now use the average of six independent experimental runs for each genotype, together with the corresponding standard deviations, to explicitly quantify measurement uncertainty. Pairwise and higher-order epistasis are assessed relative to these error estimates, rather than against fixed thresholds. This ensures that differences across genetic backgrounds are statistically distinguishable from noise.

      (2) Replacing arbitrary cutoffs with statistical criteria. We have eliminated the use of arbitrary thresholds. Instead, classification of interactions (positive, negative, or neutral epistasis) is based on whether fitness differences exceed replicate variance. This approach scales naturally with measurement precision. While some results change quantitatively for high-fitness backgrounds and qualitatively for low-fitness backgrounds, our central conclusions remain robust.

      (3) Analysis of synonymous mutations. We now separately analyze synonymous mutations to test their role in altering epistasis. Our results show that there is no correlation between the average fitness effect of a synonymous mutation and the frequency with which it changes epistatic interactions.

      We have revised terminology for clarity (replacing “fluid” with higher-order epistasis) and updated the Discussion to place our work in the broader context of the literature on higher-order epistasis.

      Finally, we have rewritten the entire manuscript to improve clarity, refine the narrative flow, and ensure that the presentation more crisply reflects the subject of the study

      Reviewer #1 (Recommendations for the authors): 

      MINOR COMMENTS 

      (1) Lines 102-107. Papkou's definition of non-functional genotypes makes sense since it is based on the fact that some genotypes are statistically indistinguishable in terms of fitness from mutants with premature stop codons in folA. It doesn't really matter whether to call them low fitness or non-functional, but it would be helpful to explain the basis for this distinction. 

      Thank you for raising this point. To maintain consistency with the original dataset and analysis, we retain Papkou et al.’s nomenclature and refer to these genotypes as “functional” or “non-functional.” 

      (2) Lines 111-112. I think the authors need to briefly explain here how they define the absence of epistasis. They do so in the Methods, but this information is essential and needs to be conveyed to the reader in the Results as well. 

      Thank you for the suggestion. We agree that this definition is essential for readers to follow the Results. In the revised manuscript, we have added a brief explanation at the start of the Results section clarifying how we define the absence of epistasis. Specifically, we now state that two mutations are considered non-epistatic when the observed fitness of the double mutant is statistically indistinguishable (within error of six replicates) from the additive expectation based on the single mutants. This ensures that the Results section is selfcontained, while full details remain in the Methods.

      (3) Lines 142 and elsewhere. The authors introduce the qualifier "fluid" to describe the fact that the value or sign of pairwise epistasis changes across genetic backgrounds. I don't see a need for this new terminology, since it is already captured adequately by the term "higher-order epistasis". The epistasis field is already rife with jargon, and I would prefer if new terms were introduced only when absolutely necessary. 

      Thank you for this helpful suggestion. We agree that introducing new terminology is unnecessary here. In the revised manuscript, we have replaced the term “fluid” epistasis with “higher-order epistasis” throughout, to align with established usage and avoid adding jargon.

      (4) Figure 6. I don't think this is the best way of showing that the pivot points are clustered. A histogram would be more appropriate and would take less space. However it would allow the authors to display a null distribution to demonstrate that this clustering is indeed surprising. 

      (5) Lines 320-321. Mann-Whitney U tests whether one distribution is systematically shifted up or down relative to the other. Please change the language here. It looks like the authors also performed the Kolmogorov-Smirnoff test, which is appropriate, but it doesn't look like the results are reported anywhere. Please report. 

      (6) Lines 330-334. The fact that HF genotypes seem to have more similar DFEs than LF genotypes is somewhat counterintuitive. Could this be an artifact of the fact that any two random HF genotypes are more similar to each other than any two randomly sampled LF genotypes? 

      (7) Lines 427. The sentence "The set of these selected variants are assigned their one hamming distance neighbours to construct a new 𝑛-base sequence space" is confusing. I think it is pretty clear how to construct a n-base sequence space, and this sentence adds more confusion than it removes. 

      Thank you for raising this point. To maintain consistency with the original dataset and analysis, we retain Papkou et al.’s nomenclature and refer to these genotypes as “functional” or “non-functional.” 

      We now start the results section of the manuscript with a brief description of how each type of epistasis is defined. Specifically, we now state that two mutations are considered non-epistatic when the observed fitness of the double mutant is statistically indistinguishable (within the error of six replicates) from the additive expectation based on the single mutants. This ensures that the Results section is self-contained, while full details remain in the Methods.

      We also agree that introducing new terminology is unnecessary. In the revised manuscript, we have replaced the term “fluid” epistasis with “higher-order epistasis” throughout, to align with established usage and avoid adding jargon. Finally, we concur that the identified sentence was unnecessary and potentially confusing; it has been removed from the revised manuscript to improve clarity. In fact, we have rewritten the entire manuscript for better flow and readability. 

      Reviewer #2 (Recommendations for the authors): 

      (1) Supplementary Figure S2A and S3 seem to be the same. 

      (3) The classification scheme for reciprocal sign/single sign/other sign epistasis differs from convention and should be made more explicit or renamed. 

      (4) Re the claim that high and low fitness backgrounds have different frequencies of the various types of epistasis: 

      Are the frequency distributions of the different types of epistasis statistically different between high and low fitness backgrounds statistically significant? It seems that they follow similar general patterns, and the sample size is much smaller for high fitness backgrounds so more variance in their distributions is expected. 

      Do bounding of fitness measurements play a role in generating the differences in types of epistasis seen in high vs. low-fitness backgrounds? If many variants are at the lower bound of the fitness assay, then positive epistasis might simply be less detectable for these backgrounds (which seems to be the biggest difference between high/low fitness backgrounds). 

      (5) In Figure 4B, points are not independent, because the mutation effects are calculated for all mutations in all backgrounds, rather than with reference to a single background or fluorescence value. The same mutations are therefore counted many times. 

      (6) It is not clear how the "pivot growth rate" was calculated or what the importance of this metric is. 

      (7) In the introduction, the justification for reanalyzing the Papkou et al dataset in particular is not clear. 

      (8) Epistasis at the nucleotide level is expected because of the genetic code: fitness and function are primarily affected by amino acid changes, and nucleotide mutations will affect amino acids depending on the state at other nucleotide sites in the same codon. For the most part, this is not explicitly taken account of in the paper. I recommend separating apparent epistasis due to the genetic code from that attributable to dependence among codons. 

      Thank you for noting this. Figure S2A shows results for high-fitness peaks only, whereas Figure S3 shows results for all peaks across the landscape. We have now made this distinction explicit in the figure legends and main text of the revised manuscript. 

      In the revised analysis, peaks are defined using the average fitness across six experimental replicates along with the corresponding standard deviation. Each genotype is compared with all single-step neighbors, and it is classified as a peak only if its mean fitness is significantly higher than all neighbors (p < 0.05). This procedure explicitly accounts for measurement error and replaces the arbitrary thresholding used previously. Full details are now described in the Methods.

      To avoid confusion, we now state our definitions explicitly at the start of the analysis. We have now corrected our definition in the text. We define sign epistasis as a one where at least one mutation switches from being beneficial to deleterious. 

      We have clarified our motivation in the Introduction. The Papkou et al. dataset is the most comprehensive experimental map of a complete 9-bp region of folA and provides six independent replicates, making it uniquely suited for testing hypotheses about backgrounddependent epistasis. Importantly, Papkou et al. based their conclusions on a single run, whereas our reanalysis incorporates replicate means and variances, leading to substantive differences—for example, a reduction in reported peaks from 514 to 127. By recalibrating the analysis, we provide a more rigorous account of this landscape and highlight how methodological choices affect conclusions.

      We also agree that some nucleotide-level epistasis reflects the structure of the genetic code (i.e., codon degeneracy and context-dependence of amino acid substitutions). In the revised manuscript, we explicitly separate epistasis attributable to codon structure from epistasis arising among codons. For example, synonymous mutations that alter epistasis within codons are treated separately from those affecting interactions across codons, and this distinction is now clearly indicated in the Results.

      Reviewer #3 (Recommendations for the authors): 

      (1) The analysis of peak density and accessibility in the paragraph starting on line 96 seems a bit out of context. Its connection with the various forms of epistasis treated in the rest of the paper is unclear. 

      (2) As mentioned in the Public Review, the term 'sign epistasis' has been used in a non-standard way. My suggestion would be to use a different term. Even a slightly modified term, such as "strong sign epistasis", should help to avoid any confusion. 

      (3)  mentioned in the public review that it is not clear whether the synonymous mutations that change the type of epistasis also tend to be non-neutral. This issue could be addressed by computing, for example, the fitness effects of all synonymous mutations for backgrounds and mutation pairs where a switch in epistasis occurs, and comparing it with fitness effects where no such switch occurs. 

      (4) Do the authors have any proposal for why synonymous mutations seem to cause more frequent changes in epistasis in low-fitness backgrounds? Related to this, is there any systematic difference between the types of switch caused by synonymous mutations in the low- versus high-fitness backgrounds? 

      (5) It is unclear exactly how the pivot points were determined, especially since the data for many mutations is noisy. The protocol should be provided in the Methods section. 

      (6) Line 303: possible typo, "accurate" --> "inaccurate". 

      (7) The value of Delta used for the "phenotypic DFE" has not been mentioned in the main text (including Methods).

      We agree that the connection needed to be clearer. In the revised manuscript, we (i) relocate and retitle this material as a brief “Landscape overview” preceding the epistasis analyses, (ii) explicitly link multi-peakedness and path accessibility to epistasis (e.g., multi-peak structure implies the presence of sign/reciprocal-sign epistasis; accessibility is shaped by background-dependent effects), and (iii) move derivations to the Supplement. We also recomputed peak density and accessibility using replicate-averaged fitness with replicate SDs, so the overview and downstream epistasis sections now use a single, error-aware landscape (updated in Figs. 1–3, with cross-references in the text).

      We have aligned our terminology and now state definitions upfront. 

      After replacing fixed cutoffs with replicate-based error criteria, switches are more frequent in high-fitness backgrounds (Fig. 3). Mechanistically, near the lower fitness bound, deleterious effects are masked (global nonlinearity), reducing apparent switching. Functional/high-fitness backgrounds allow both beneficial and deleterious outcomes, so background-dependent (higher-order) interactions manifest more readily. Switch types also vary by background fitness: high-fitness backgrounds show more sign/strong-sign switches, whereas low-fitness backgrounds show mostly magnitude reclassifications (Fig. 3C; Supplement Fig. Sx).

      Finally, we corrected a typo by replacing “accurate” with “inaccurate” and now define Δ (equal to 0.05) in the main text (in Results and Figure 8 caption).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      Dendrotweaks provides its users with a solid tool to implement, visualize, tune, validate, understand, and reduce single-neuron models that incorporate complex dendritic arbors with differential distribution of biophysical mechanisms. The visualization of dendritic segments and biophysical mechanisms therein provide users with an intuitive way to understand and appreciate dendritic physiology.

      Strengths:

      (1) The visualization tools are simplified, elegant, and intuitive.

      (2) The ability to build single-neuron models using simple and intuitive interfaces.

      (3) The ability to validate models with different measurements.

      (4) The ability to systematically and progressively reduce morphologically-realistic neuronal models.

      Weaknesses:

      (1) Inability to account for neuron-to-neuron variability in structural, biophysical, and physiological properties in the model-building and validation processes.

      We agree with the reviewer that it is important to account for neuron-to-neuron variability. The core approach of DendroTweaks, and its strongest aspect, is the interactive exploration of how morpho-electric parameters affect neuronal activity. In light of this, variability can be achieved through the interactive updating of the model parameters with widgets. In a sense, by adjusting a widget (e.g., channel distribution or kinetics), a user ends up with a new instance of a cell in the parameter space and receives almost real-time feedback on how this change affected neuronal activity. This approach is much simpler than implementing complex optimization protocols for different parameter sets, which would detract from the interactivity aspect of the GUI. In its revised version, DendroTweaks also accounts for neuron-to-neuron morphological variability, as channel distributions are now based on morphological domains (rather than the previous segment-specific approach). This makes it possible to apply the same biophysical configuration across various morphologies. Overall, both biophysical and morphological variability can be explored within DendroTweaks. 

      (2) Inability to account for the many-to-many mapping between ion channels and physiological outcomes. Reliance on hand-tuning provides a single biased model that does not respect pronounced neuron-to-neuron variability observed in electrophysiological measurements.

      We acknowledge the challenge of accounting for degeneracy in the relation between ion channels and physiological outcomes and the importance of capturing neuron-to-neuron variability. One possible way to address this, as we mention in the Discussion, is to integrate automated parameter optimization algorithms alongside the existing interactive hand-tuning with widgets. In its revised version, DendroTweaks can integrate with Jaxley (Deistler et al., 2024) in addition to NEURON. The models created in DendroTweaks can now be run with Jaxley (although not all types of models, see the limitations in the Discussion), and their parameters can be optimized via automated and fast gradient-based parameter optimization, including optimization of heterogeneous channel distributions. In particular, a key advantage of integrating Jaxley with DendroTweaks was its NMODL-to-Python converter, which significantly reduced the need to manually re-implement existing ion channel models for Jaxley (see here: https://dendrotweaks.readthedocs.io/en/latest/tutorials/convert_to_jaxley.html).

      (1) Michael Deistler, Kyra L. Kadhim, Matthijs Pals, Jonas Beck, Ziwei Huang, Manuel Gloeckler, Janne K. Lappalainen, Cornelius Schröder, Philipp Berens, Pedro J. Gonçalves, Jakob H. Macke Differentiable simulation enables large-scale training of detailed biophysical models of neural dynamics bioRxiv 2024.08.21.608979; doi:https://doi.org/10.1101/2024.08.21.608979

      Lack of a demonstration on how to connect reduced models into a network within the toolbox.

      Building a network of reduced models is an exciting direction, yet beyond the scope of this manuscript, whose primary goal is to introduce DendroTweaks and highlight its capabilities. DendroTweaks is designed for single-cell modeling, aiming to cover its various aspects in great detail. Of course, we expect refined single-cell models, both detailed and simplified, to be further integrated into networks. But this does not need to occur within DendroTweaks. We believe this network-building step is best handled by dedicated network simulation platforms. To facilitate the network-building process, we extended the exporting capabilities of DendroTweaks. To enable the export of reduced models in DendroTweaks’s modular format, as well as in plain simulator code, we implemented a method to fit the resulting parameter distributions to analytical functions (e.g., polynomials). This approach provided a compact representation, requiring a few coefficients to be stored in order to reproduce a distribution, independently of the original segmentation. The reduced morphologies can be exported as SWC files, standardized ion channel models as MOD files, and channel distributions as JSON files. Moreover, plain NEURON code (Python) to instantiate a cell class can be automatically generated for any model, including the reduced ones. Finally, to demonstrate how these exported models can be integrated into larger simulations, we implemented a "toy" network model in a Jupyter notebook included as an example in the GitHub repository. We believe that these changes greatly facilitate the integration of DendroTweaks-produced models into networks while also allowing users to run these networks on their favorite platforms.

      (4) Lack of a set of tutorials, which is common across many "Tools and Resources" papers, that would be helpful in users getting acquainted with the toolbox.

      This is an important point that we believe has been addressed fully in the revised version of the tool and manuscript. As previously mentioned, the lack of documentation was due to the software's early stage. We have now added comprehensive documentation, which is available at https://dendrotweaks.readthedocs.io. This extensive material includes API references, 12 tutorials, 4 interactive Jupyter notebooks, and a series of video tutorials, and it is regularly updated with new content. Moreover, the toolbox's GUI with example models is available through our online platform at https://dendrotweaks.dendrites.gr.  

      Reviewer #2 (Public review):

      The paper by Makarov et al. describes the software tool called DendroTweaks, intended for the examination of multi-compartmental biophysically detailed neuron models. It offers extensive capabilities for working with very complex distributed biophysical neuronal models and should be a useful addition to the growing ecosystem of tools for neuronal modeling.

      Strengths

      (1) This Python-based tool allows for visualization of a neuronal model's compartments.

      (2) The tool works with morphology reconstructions in the widely used .swc and .asc formats.

      (3) It can support many neuronal models using the NMODL language, which is widely used for neuronal modeling.

      (4) It permits one to plot the properties of linear and non-linear conductances in every compartment of a neuronal model, facilitating examination of the model's details.

      (5) DendroTweaks supports manipulation of the model parameters and morphological details, which is important for the exploration of the relations of the model composition and parameters with its electrophysiological activity.

      (6) The paper is very well written - everything is clear, and the capabilities of the tool are described and illustrated with great attention to detail.

      Weaknesses

      (1) Not a really big weakness, but it would be really helpful if the authors showed how the performance of their tool scales. This can be done for an increasing number of compartments - how long does it take to carry out typical procedures in DendroTweaks, on a given hardware, for a cell model with 100 compartments, 200, 300, and so on? This information will be quite useful to understand the applicability of the software.

      DendroTweaks functions as a layer on top of a simulator. As a result, its performance scales in the same way as for a given simulator. The GUI currently displays the time taken to run a simulation (e.g., in NEURON) at the bottom of the Simulation tab in the left menu. While Bokeh-related processing and rendering also consume time, this is not as straightforward to measure. It is worth noting, however, that this time is short and approximately equivalent to rendering the corresponding plots elsewhere (e.g., in a Jupyter notebook), and thus adds negligible overhead to the total simulation time. 

      (2) Let me also add here a few suggestions (not weaknesses, but something that can be useful, and if the authors can easily add some of these for publication, that would strongly increase the value of the paper).

      (3) It would be very helpful to add functionality to read major formats in the field, such as NeuroML and SONATA.

      We agree with the reviewer that support for major formats will substantially improve the toolbox, ensuring the reproducibility and reusability of the models. While integration with these formats has not been fully implemented, we have taken several steps to ensure elegant and reproducible model representation. Specifically, we have increased the modularity of model components and developed a custom compact data format tailored to single-cell modeling needs. We used a JSON representation inspired by the Allen Cell Types Database schema, modified to account for non-constant distributions of the model parameters. We have transitioned from a representation of parameter distributions dependent on specific segmentation graphs and sections to a more generalized domain-based distribution approach. In this revised methodology, segment groups are no longer explicitly defined by segment identifiers, but rather by specification of anatomical domains and conditional expressions (e.g., “select all segments in the apical domain with the maximum diameter < 0.8 µm”). Additionally, we have implemented the export of experimental protocols into CSV and JSON files, where the JSON files contain information about the stimuli (e.g., synaptic conductance, time constants), and the CSV files store locations of recording sites and stimuli. These features contribute toward a higher-level, structured representation of models, which we view as an important step toward eventual compatibility with standard formats such as NeuroML and SONATA. We have also initiated a two-way integration between DendroTweaks and SONATA. We developed a converter from DendroTweaks to SONATA that automatically generates SONATA files to reproduce models created in DendroTweaks. Additionally, support for the DendroTweaks JSON representation of biophysical properties will be added to the SONATA data format ecosystem, enabling models with complex dendritic distributions of channels. This integration is still in progress and will be included in the next version of DendroTweaks. While full integration with these formats is a goal for future releases, we believe the current enhancements to modularity and exportability represent a significant step forward, providing immediate value to the community.

      (4) Visualization is available as a static 2D projection of the cell's morphology. It would be nice to implement 3D interactive visualization.

      We offer an option to rotate a cell around the Y axis using a slider under the plot. This is a workaround, as implementing a true 3D visualization in Bokeh would require custom Bokeh elements, along with external JavaScript libraries. It's worth noting that there are already specialized tools available for 3D morphology visualization. In light of this, while a 3D approach is technically feasible, we advocate for a different method. The core idea of DendroTweaks’ morphology exploration is that each section is “clickable”, allowing its geometric properties to be examined in a 2D "Section" view. Furthermore, we believe the "Graph" view presents the overall cell topology and distribution of channels and synapses more clearly.

      (5) It is nice that DendroTweaks can modify the models, such as revising the radii of the morphological segments or ionic conductances. It would be really useful then to have the functionality for writing the resulting models into files for subsequent reuse.

      This functionality is fully available in local installations. Users can export JSON files with channel distributions and SWC files after morphology reduction through the GUI. Please note that for resource management purposes, file import/export is disabled on the public online demo. However, it can be enabled upon local installation by modifying the configuration file (app/default_config.json). In addition, it is now possible to generate plain NEURON (Python) code to reproduce a given model outside the toolbox (e.g., for network simulations). Moreover, it is now possible to export the simulation protocols as CSV files for locations of stimuli and recordings and JSON files for stimuli parameters.

      (6) If I didn't miss something, it seems that DendroTweaks supports the allocation of groups of synapses, where all synapses in a group receive the same type of Poisson spike train. It would be very useful to provide more flexibility. One option is to leverage the SONATA format, which has ample functionality for specifying such diverse inputs.

      Currently, each population of “virtual” neurons that form synapses on the detailed cell shares the same set of parameters for both biophysical properties of synapses (e.g., reversal potential, time constants) and presynaptic "population" activity (e.g., rate, onset). The parameter that controls an incoming Poisson spike train is the rate, which is indeed shared across all synapses in a population. Unfortunately, the current implementation lacks the capability to simulate complex synaptic inputs with heterogeneous parameters across individual synapses or those following non-uniform statistical distributions (the present implementation is limited to random uniform distributions). We have added this information in the Discussion (3. Discussion - 3.2 Limitations and future directions - ¶.5) to make users aware of the limitations. As it requires a substantial amount of additional work, we plan to address such limitations in future versions of the toolbox.

      (7) "Each session can be saved as a .json file and reuploaded when needed" - do these files contain the whole history of the session or the exact snapshot of what is visualized when the file is saved? If the latter, which variables are saved, and which are not? Please clarify.

      In the previous implementation, these files captured the exact snapshot of the model's latest state. In the new version, we adopted a modular approach where the biophysical configuration (e.g., channel distributions) and stimulation protocols are exported to separate files. This allows the user to easily load and switch the stimulation protocols for a given model. In addition, the distribution of parameters (e.g., channel conductances) is now based on the morphological domains and is agnostic of the exact morphology (i.e., sections and segments), which allows the same JSON files with biophysical configurations to be reused across multiple similar morphologies. This also allows for easy file exchange between the GUI and the standalone version.

      Joint recommendations to Authors:

      The reviewers agreed that the paper is well written and that DendroTweaks offers a useful collection of tools to explore models of single-cell biophysics. However, the tooling as provided with this submission has critical limitations in the capabilities, accessibility, and documentation that significantly limit the utility of DendroTweaks. While we recognize that it is under active development and features may have changed already, we can only evaluate the code and documentation available to us here.

      We thank the reviewers for their positive evaluation of the manuscript and express our sincere appreciation for their feedback. We acknowledge the limitations they have pointed out and have addressed most of these concerns in our revised version.

      In particular, we would emphasize:

      (1) While the features may be rich, the documentation for either a user of the graphical interface or the library is extremely sparse. A collection of specific tutorials walking a GUI user through simple and complex model examples would be vital for genuine uptake. As one category of the intended user is likely to be new to computational modeling, it would be particularly good if this documentation could also highlight known issues that can arise from the naive use of computational techniques. Similarly, the library aspect needs to be documented in a more standard manner, with docstrings, an API function list, and more didactic tutorials for standard use cases.

      DendroTweaks now features comprehensive documentation. The standalone Python library code is well-documented with thorough docstrings. The overall code modularity and readability have improved. The documentation is created using the widely adopted Sphinx generator, making it accessible for external contributors, and it is available via ReadTheDocs https://dendrotweaks.readthedocs.io/en/latest/index.html. The documentation provides a comprehensive set of tutorials (6 basic, 6 advanced) covering all key concepts and workflows offered by the toolbox. Interactive Jupyter notebooks are included in the documentation, along with the quick start guide. All example models also have corresponding notebooks that allow users to build the model from scratch.

      The toolbox has its own online platform, where a quick-start guide for the GUI is available https://dendrotweaks.dendrites.gr/guide.html. We have created video tutorials for the GUI covering the basic use cases. Additionally, we have added tips and instructions alongside widgets in the GUI, as well as a status panel that displays application status, warnings, and other information. Finally, we plan to familiarize the community with the toolbox by organizing online and in-person tutorials, as the one recently held at the CNS*2025 conference (https://cns2025florence.sched.com/event/25kVa/building-intuitive-and-efficient-biophysicalmodels-with-jaxley-and-dendrotweaks). Moreover, the toolbox was already successfully used for training young researchers during the Taiwan NeuroAI 2025 Summer School, founded by Ching-Lung Hsu. The feedback was very positive.

      (2) The paper describes both a GUI web app and a Python library. However, the code currently mixes these two in a way that largely makes sense for the web app but makes it very difficult to use the library aspect. Refactoring the code to separate apps and libraries would be important for anyone to use the library as well as allowing others to host their own DendroTweak servers. Please see the notes from the reviewing editor below for more details.

      The code in the previous `app/model` folder, responsible for the core functionality of the toolbox, has been extensively refactored and extended, and separated into a standalone library. The library is included in the Python package index (PyPI, https://pypi.org/project/dendrotweaks).

      Notes from the Reviewing Editor Comments (Recommendations for the authors):

      (1) While one could import morphologies and use a collection of ion channel models, details of synapse groups and stimulation approaches appeared to be only configurable manually in the GUI. The ability to save and load full neuron and simulation states would be extremely useful for reproducibility and sharing data with collaborators or as an interactive data product with a publication. There is a line in the text about saving states as json files (also mentioned by Reviewer #2), but I could see no such feature in the version currently online.

      We decided to reserve the online version for demonstration and educational purposes, with more example models being added over time. However, this functionality is available upon local installation of the app (and after specifying it in the ‘default_config.json’ in the root directory of the app). We’ve adopted a modular model representation to store separately morphology, channel models, biophysical parameters, and stimulation protocols.

      (2) Relatedly, GUI exploration of complex data is often a precursor to a more automated simulation run. An easy mechanism to go from a user configuration to scripting would be useful to allow the early strength of GUIs to feed into the power of large-scale scripting.

      Any model could be easily exported to a modular DendroTweaks representation and later imported either in the GUI or in the standalone version programmatically. This ensures a seamless transition between the two use cases.

      (3) While the paper discusses DendroTweaks as both a GUI and a python library, the zip file of code in the submission is not in good form as a library. Back-end library code is intermingled with front-end web app code, which limits the ability to install the library from a standard python interface like PyPI. API documentation is also lacking. Functions tend to not have docstrings, and the few that do, do not follow typical patterns describing parameters and types.

      As stated above, all these issues have been resolved in the new version of the toolbox. The library code is now housed in a separate repository https://github.com/Poirazi-Lab/DendroTweaks and included in PyPI https://pypi.org/project/dendrotweaks. The classes and public methods follow Numpy-style docstrings, and the API reference is available in the documentation: https://dendrotweaks.readthedocs.io/en/latest/genindex.html.

      (4) Library installation is very difficult. The requirements are currently a lockfile, fully specifying exact versions of all dependencies. This is exactly correct for web app deployment to maintain consistency, but is not feasible in the context of libraries where you want to have minimal impact on a user's environment. Refactoring the library from the web app is critical for making DendroTweaks usable in both forms described in the paper.

      The lockfile makes installation more or less impossible on computer setups other than that of the author. Needless to say, this is not acceptable for a tool, and I would encourage the authors to ask other people to attempt to install their code as they describe in the text. For example, attempting to create a conda environment from the environment.yml file on an M1 MacBook Pro failed because it could not find several requirements. I was able to get it to install within a Linux docker image with the x86 platform specified, but this is not generally viable. To make this be the tool it is described as in text, this must be resolved. A common pattern that would work well here is to have a requirements lockfile and Docker image for the web app that imports a separate, more minimally restrictive library package with that could be hosted on PyPI or, less conveniently, through conda-forge.

      The installation of the standalone library is now straightforward via pip install dendrotweaks.On the Windows platform, however, manual installation of NEURON is required as described          in the official NEURON documentation https://nrn.readthedocs.io/en/8.2.6/install/install_instructions.html#windows.

      (5) As an aside, to improve potential uptake, the authors might consider an MIT-style license rather than the GNU Public License unless they feel strongly about the GPL. Many organizations are hesitant to build on GPL software because of the wide-ranging demands it places on software derived from or using GPL code.

      We thank the editor for this suggestion. We are considering changing the licence to MPL 2.0. It will maintain copyleft restrictions only on the package files while allowing end-users to freely choose their own license for any derived work, including the models, generated data files, and code that simply imports and uses our package.

      Reviewer #1 (Recommendations for the authors):

      (1) Abstract: Neurons rely on the interplay between dendritic morphology and ion channels to transform synaptic inputs into a sequence of somatic spikes. Technically, this would have to be morphology, ion channels, pumps, transporters, exchangers, buffers, calcium stores, and other molecules. For instance, if the calcium buffer concentration is large, then there would be less free calcium for activating the calcium-activated potassium channels. If there are different chloride co-transporters - NKCC vs. KCC - expressed in the neuron or different parts of the neuron, that would alter the chloride reversal for all the voltage- or ligand-gated chloride channels in the neuron. So, while morphology and ion channels are two important parts of the transformation, it would be incorrect to ignore the other components that contribute to the transformation. The statement might be revised to make these two components as two critical components.

      The phrase “Two critical components” was added as it was suggested by the reviewer.

      (2) Section 2.1 - The overall GUI looks intuitive and simple.

      (3) Section 2.2

      (a) The Graph view of morphology, especially accounting for the specific d_lambda is useful.

      (b) "Note that while microgeometry might not significantly affect the simulation at a low spatial resolution (small number of segments) due to averaging, it can introduce unexpected cell behavior at a higher level of spatial discretization."

      It might be good to warn the users that the compartmentalization and error analyses are with reference to the electrical lambda. If users have to account for calcium microdomains, these analyses wouldn't hold given the 2 orders of magnitude differences between the electrical and the calcium lambdas (e.g., Zador and Koch, J Neuroscience, 1994). Please sensitize users that the impact of active dendrites in regulating calcium microdomains and signaling is critical when it comes to plasticity models in morphologically realistic structures.

      We thank the reviewer for this important point. We have clarified in the text that our spatial discretization specifically refers to the electrical length constant. We acknowledge that electrical and chemical processes operate on fundamentally different spatial and temporal scales, which requires special consideration when modeling phenomena like synaptic plasticity. We have sensitized users about this distinction. However, we do not address such examples in the manuscript, thus leaving the detailed discussion of non-electrical compartmentalization beyond the scope of this work.

      (c) I am not very sure if the "smooth" tool for diameters that is illustrated is useful. Users shouldn't consider real variability in morphology as artifacts of reconstruction. As mentioned above, while this might not be an issue with electrical compartmentalization, calcium compartmentalization will severely be affected by small changes in morphology. Any model that incorporates calcium-gated channels should appropriately compartmentalize calcium. Without this, the spread of activation of calcium-dependent conductances would be an overestimate. Even small changes in cellular shape and curvature can have large impacts when it comes to signaling in terms of protein aggregation and clustering.

      Although this functionality is still available in the toolbox, we have removed the emphasis from it in the manuscript. Nevertheless, for the purpose of addressing the reviewer’s comment, we provide an example when this “smoothening” might be needed:please see Figure S1 from Tasciotti et al. 2025.

      (2) Simone Tasciotti, Daniel Maxim Iascone, Spyridon Chavlis, Luke Hammond, Yardena Katz, Attila Losonczy, Franck Polleux, Panayiota Poirazi. From Morphology to Computation: How Synaptic Organization Shapes Place Fields in CA1 Pyramidal Neurons bioRxiv 2025.05.30.657022; doi: https://doi.org/10.1101/2025.05.30.657022

      (4) Section 2.3

      (a) The graphical representation of channel gating kinetics is very useful.

      (b) Please warn the users that experimental measurements of channel gating kinetics are extremely variable. Taking the average of the sigmoids or the activation/deactivation/inactivation kinetics provides an illusion that each channel subtype in a given cell type has fixed values of V_1/2, k, delta, and tau, but it is really a range obtained from several experiments. The heterogeneity is real and reflects cell-to-cell variability in channel gating kinetics, not experimental artifacts. Please sensitize the readers that there is not a single value for these channel parameters.

      This is a fair comment, and it refers to a general problem in neuronal modeling. In DendroTweaks, we follow the approach widely used in the community that indeed doesn't account for heterogeneity. We added a paragraph in the revised manuscript's Discussion (3. Discussion - 3.3 Limitations and future directions - ¶.3) to address this issue.

      (5) Section 2.4

      (a) Same as above: Please sensitize users that the gradients in channel conductances are measured as an average of measurements from several different cells. This gradient need not be present in each neuron, as there could be variability in location-dependent measurements across cells. The average following a sigmoid doesn't necessarily mean that each neuron will have the channel distributed with that specific sigmoid (or even a sigmoid!) with the specific parametric values that the average reported. This is extremely important because there is an illusion that the gradient is fixed across cells and follows a fixed functional form.

      We added this information to our Discussion in the same paragraph mentioned above.

      (b) Please provide an example where the half-maximal voltage of a channel varies as a function of distance (such as Poolos et al., Nature Neuroscience, 2002 or Migliore et al., 1999; Colbert and Johnston, 1997). This might require a step-like function in some scenarios. An illustration would be appropriate because people tend to assume that channel gating kinetics are similar throughout the dendrite. Again, please mention that these shifts are gleaned from the average and don't really imply that each neuron must have that specific gradient, given neuron-to-neuron variability in these measurements.

      We thank the reviewer for the provided literature, which we now cite when describing parameter distributions (2. Results - 2.4 Distributing ion channels - ¶.1). Please note that DendroTweaks' programming interface and data format natively support non-linear distribution of kinetic parameters alongside the channel conductances. As for the step-like function, users can either directly apply the built-in step-like distribution function or create it by combining two constant distributions.

      (6) Section 2.5

      (a) It might be useful to provide a mechanism for implementing the normalization of unitary conductances at the cell body, (as in Magee and Cook, 2000; Andrasfalvy et al., J Neuroscience, 2001). Specifically, users should be able to compute AMPAR conductance values at each segment which would provide a somatic EPSP value of 0.2 mV.

      This functionality is indeed useful and will be added in future releases. Currently, it has been mentioned in the list of known limitations when working with synaptic inputs (3. Discussion - 3.3 Limitations and future directions - ¶.5).

      (b) Users could be sensitized about differences in decay time constants of GABA_A receptors that are associated with parvalbamin vs. somatostatin neurons. As these have been linked to slow and fast gamma oscillations and different somatodendritic locations along different cell types, this might be useful (e.g., 10.1016/j.neuron.2017.11.033;10.1523/jneurosci.0261-20.2020; 10.7554/eLife.95562.1; 10.3389/fncel.2023.1146278).

      We thank the reviewer for highlighting this important biological detail. DendroTweaks enables users to define model parameters specific to their cell type of interest. For practical reasons, we leave the selection of biologically relevant parameters to the users. However, we will consider adding an explicit example in our tutorials to showcase the toolbox's flexibility in this regard.

      (7) Section 2.6

      While reducing the morphological complexity has its advantages, users of this tool should be sensitized in this section about how the reduction does not capture all the complexity of the dendritic computation. For instance, the segregation/amplification properties of Polsky et al., 2004, Larkum et al., 2009 would not be captured by a fully reduced model. An example across different levels of reductions, implementing simulations in Figure 7F (but for synapses on the same vs. different branches), would be ideal. Demonstrate segregation/amplification in the full model for the same set of synapses - coming on the same branch/different branch (linear integration of synapses on different branches and nonlinear integration of synapses on the same branch). Then, show that with different levels of reduction, this segregation/amplification vanishes in the reduced model. In addition, while impedance-based approaches account for account for electrical computation, calcium-based computation is not something that is accountable with reduced models, given the small lambda_calcium values. Given the importance of calcium-activated conductances in electrical behaviour, this becomes extremely important to account for and sensitize users to. The lack of such sensitization results in presumptuous reductions that assume that all dendritic computation is accounted for by reduced models!

      We agree with the reviewer that reduction leads to a loss in the complexity of dendritic computation. This has been stated in both the original algorithm paper (Amsalem et al., 2020) and in our manuscript (e.g., 3. Discussion - 3.2 Comparison to existing modeling software - ¶.6). In fact, to address this problem, we extended the functionality of neuron_reduce to allow for multiple levels of morphology reduction. Our motivation for integrating morphology reduction in the toolbox was to leverage the exploratory power of DendroTweaks to assess how different degrees of reduction alter cell integrative properties, determining which computations are preserved, which are lost, and at what specific reduction level these changes occur. Nevertheless, to address this comment, we've made it more explicit in the Discussion that reduction inevitably alters integrative properties and, at a certain level, leads to loss of dendritic computations.

      (8) Section 2.7

      (a) The validation process has two implicit assumptions:

      (i) There is only one value of physiological measurements that neurons and dendrites are endowed with. The heterogeneity in these measurements even within the same cell type is ignored. The users should be allowed to validate each measurement over a range rather than a single value. Users should be sensitized about the heterogeneity of physiological measurements.

      (ii) The validation process is largely akin to hand-tuning models where a one-to-one mapping of channels to measurements is assumed. For instance, input resistance can be altered by passive properties, by Ih, and by any channel that is active under resting conditions. Firing rate and patterns can be changed by pretty much every single ion channel that expresses along the somatodendritic axis.

      An updated validation process that respects physiological heterogeneities in measurements and accounts for global dependencies would be more appropriate. Please update these to account for heterogeneities and many-to-many mappings between channels and measurements. An ideal implementation would be to incorporate randomized search procedures (across channel parameters spanning neuron-to-neuron variability in channel conductances/gating properties) to find a population of models that satisfy all physiological constraints (including neuron-to-neuron variability in each physiological measurement), rather than reliance on procedures that are akin to hand-tuning models. Such population-based approaches are now common across morphologically-realistic models for different cell types (e.g., Rathour and Narayanan, PNAS, 2014; Basak and Narayanan, J Physiology, 2018; Migliore et al., PLoS Computational Biology, 2018; Basak and Narayanan, Brain Structure and Function, 2020; Roy and Narayanan, Neural Networks, 2021; Roy and Narayanan, J Physiology, 2023; Arnaudon et al., iScience, 2023; Reva et al., Patterns, 2023; Kumari and Narayanan, J Neurophysiology, 2024) and do away with the biases introduced by hand-tuning as well as the assumption of one-to-one mapping between channels and measurements.

      We appreciate the reviewer’s comment and the suggested alternatives to our validation process. We have extended the discussion on these alternative approaches (3. Discussion - 2. Comparison to existing modeling software - ¶.5). However, it is important to note that neither one-value nor one-to-one mapping assumption is imposed in our approach. It is true that validation is performed on a given model instance with fixed single-value parameters. However, users can discover heterogeneity and degeneracy in their models via interactive exploration. In the GUI, a given parameter can be changed, and the influence of this change on model output can be observed in real time. Validation can be run after each change to see whether the model output still falls within a biologically plausible regime or not. This is, of course, time-consuming and less efficient than any automated parameter optimization.

      However, and importantly, this is the niche of DendroTweaks. The approach we provide here can indeed be referred to as model hand-tuning. This is intentional: we aim to complement black-box optimization by exposing the relationship between parameters and model outputs. DendroTweaks is not aimed at automated parameter optimization and is not meant to provide the user with parameter ranges automatically. The built-in validation in DendroTweaks is intended as a lightweight, fast feedback tool to guide manual tuning of dendritic model parameters so as to enhance intuitive understanding and assess the plausibility of outputs, not as a substitute for comprehensive model validation or optimization. The latter can be done using existing frameworks, designed for this purpose, as mentioned by the reviewer. 

      (b) Users could be asked to wait for RMP to reach steady state. For instance, in some of the traces in Figure 7, the current injection is provided before RMP reaches steady-state. In the presence of slow channels (HCN or calcium-activated channels), the RMP can take a while to settle down. Users might be sensitized about this. This would also bring to attention the ability of several resting channels in modulating RMP, and the need to wait for steady-state before measurements are made.

      We agree with the observation and updated the validation process accordingly. We have added functionality for simulation stabilization, allowing users to pre-run a simulation before the main simulation time. For example, model.run(duration=1000, prerun_time=300) could be used to stabilize the model for a period of 300 ms before running the main simulation for 1 s.

      (c) Strictly speaking, it is incorrect to obtain membrane time constant by fitting a single exponential to the initial part of the sag response (Figure 7A). This may be confirmed in the model by setting HCN to zero (strictly all active channel conductances to zero), obtaining the voltage-response to a pulse current, fitting a double exponential (as Rall showed, for a finite cable or for a real neuron, a single exponential would yield incorrect values for the tau) to the voltage response, and mapping membrane time constant to the slower of the two time-constants (in the double exponential fit). This value will be very different from what is obtained in Figure 7A. Please correct this, with references to Rall's original papers and to electrophysiological papers that use this process to assess membrane properties of neurons and their dendrites (e.g., Stuart and Spruston, J Neurosci, 1998; Golding and Spruston, J Physiology, 2005).

      We updated the algorithm for calculating the membrane time constant based on the reviewer's suggestions and added the suggested references. The time constant is now obtained in a model with blocked HCN channels (setting maximal conductance to 0) via a double exponential fit, taking the slowest component.

      (9) Section 3

      (a) May be good to emphasize the many-to-many mapping between ion channels and neuronal functions here in detail, and on how to explore this within the Dendrotweaks framework.

      We have added a paragraph in the Discussion that addresses both the problems of heterogeneity and degeneracy in biological neurons and neuronal models (3. Discussion - 3.3 Limitations and future directions - ¶.3)

      (b) May be good to have a specific section either here or in results about how the different reduced models can actually be incorporated towards building a network.

      As mentioned earlier, building a network of reduced models is a promising new direction. However, it is beyond the scope of this manuscript, whose primary goal is to introduce DendroTweaks and highlight its capabilities. DendroTweaks is designed for single-cell modeling and provides export capabilities that allow integrating it into broader workflows, including network modeling. We have added a paragraph in the manuscript (3. Discussion - 3.1 Conceptual and implementational accessibility - ¶.2) that addresses how DendroTweaks could be used alongside other software, in particular for scaling up single-cell models to the network level.

      (10) Section 4

      (a) Section 4.3: In the second sentence (line 568), the "first Kirchhoff's law" within parentheses immediately after Q=CV gives an illusion that Q=CV is the first Kirchhoff's law! Please state that this is with reference to the algebraic sum of currents at a node.

      We have corrected the equations and apologize for this oversight. 

      (b) Table 1: In the presence of active ion channels, input resistance, membrane time constant, and voltage attenuation are not passive properties. Input resistance is affected by any active channel that is active at rest (HCN, Kir, A-type K+ through the window current, etc). The same holds for membrane time constant and voltage attenuation as well. This could be made clear by stating if these measurements are obtained in the presence or absence of active ion channels. In real neurons, all these measurements are affected by active ion channels; so, ideally, these are also active properties, not passive! Also, please mention that in the presence of resonating channels (e.g., HCN, M-type K+), a single exponential fit won't be appropriate to obtain tau, given the presence of sag.

      We thank the reviewer for pointing out this ambiguity. What the term “Passive” means in Table 1 (e.g., for the input resistance, R_in) is that the minimal set of parameters needed to validate R_in are the passive ones (i.e., Cm, Ra, and Leak). We have changed the table listing to reflect this.

      Reviewer #2 (Recommendations for the authors):

      (1) Figure 2B and the caption to Figure 2F show and describe the diameter of the sections, whereas the image in Figure 2F shows the radius. Which is the correct one?

      The reason for this is that Figure 2B shows the sections' geometry as it is represented in NEURON, i.e., with diameters, while Figure 2F shows the geometry as it is represented in an SWC file (as these changes are made based on the SWC file). Nevertheless, as mentioned earlier, we decided to remove panel F from the figure in the new version, to present a more important panel on tree graph representations.

      (2) "Each segment can be viewed as an equivalent RC circuit representing a part of the membrane". The example in Figure 2B is perhaps a relatively simple case. For more complex cases where multiple nonlinear conductances are present in each section, would it be possible to show each of these conductances explicitly? If yes, it would be nice to illustrate that.

      We would like to clarify that "can be viewed" here was intended to mean "can be considered," and we have updated the text accordingly. The schematic RC circuits were added to the corresponding figure for illustration purposes only and are not present in the GUI, as this would indeed be impractical for multiple conductances.

      (3) Some extra citations could be added. For example, it is a little strange that BRIAN2 is mentioned, but NEST is not. It might be worth mentioning and citing it. Also, the Allen Cell Types Database is mentioned, but no citation for it is given. It could be useful to add such citations (https://doi.org/10.1038/s41593-019-0417-0, https://doi.org/10.1038/s41467-017-02718-3).

      Brian 2 is extensively used in our lab on its own and as a foundation of the Dendrify library (Pagkalos et al., 2023). As stated in the discussion, we are considering bridging reduced Hodgkin-Huxley-type models to Dendrify leaky integrate-and-fire type models. For these reasons, Brian 2 is mentioned in the discussion. However, we acknowledge that our previous overview omitted references to some key software, which have now been added to the updated manuscript. We appreciate the reviewer providing references that we had overlooked.

      (3) Pagkalos, M., Chavlis, S. & Poirazi, P. Introducing the Dendrify framework for incorporating dendrites to spiking neural networks. Nat Commun 14, 131 (2023). https://doi.org/10.1038/s41467-022-35747-8

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewing Editor Comments:

      The study design used reversal learning (i.e. the CS+ becomes the CS- and vice versa), while the title mentions 'fear learning and extinction'. In my opinion, the paper does not provide insight into extinction and the title should be changed.

      Thank you for this important point. We agree that our paradigm focuses more directly on reversal learning than on standard extinction, as the test phases represent extinction in the absence of a US but follow a reversal phase. To better reflect the core of our investigation, we have changed the title.

      Proposed change in manuscript (Title): Original Title: Distinct representational properties of cues and contexts shape fear learning and extinction 

      New Title: Distinct representational properties of cues and contexts shape fear and reversal learning

      Secondly, the design uses 'trace conditioning', whereas the neuroscientific research and synaptic/memory models are rather based on 'delay conditioning'. However, given the limitations of this design, it would still be possible to make the implications of this paper relevant to other areas, such as declarative memory research.

      This is an excellent point, and we thank you for highlighting it. Our design, where a temporal gap exists between the CS offset and US onset, is indeed a form of trace conditioning. We also agree that this feature, particularly given the known role of the hippocampus in trace conditioning, strengthens the link between our findings and the broader field of episodic memory.

      Proposed change in manuscript (Methods, Section "General procedure and stimuli"): We inserted the following text (lines 218-220): "It is important to note that the temporal gap between the CS offset and potential US delivery (see Figure 1A) indicates that our paradigm employs a trace conditioning design. This form of learning is known to be hippocampus-dependent and has been distinguished from delay conditioning.

      Proposed change in manuscript (Discussion): We added the following to the discussion (lines 774-779): "Furthermore, our use of a trace conditioning paradigm, which is known to engage the hippocampus more than delay conditioning does, may have facilitated the detection of item-specific, episodiclike memory traces and their interaction with context. This strengthens the relevance of our findings for understanding the interplay between aversive learning and mechanisms of episodic memory."

      The strength of the evidence at this point would be described as 'solid'. In order to increase the strength (to convincing), analyses including FWE correction would be necessary. I think exploratory (and perhaps some FDR-based) analyses have their valued place in papers, but I agree that these should be reported as such. The issue of testing multiple independent hypotheses also needs to be addressed to increase the strength of evidence (to convincing). Evaluating the design with 4 cues could lead to false positives if, for example, current valence, i.e. (CS++ and CS-+) > (CS+- and CS--), and past valence (CS++ > CS+-) > (CS-+ > CS--) are tested as independent tests within the same data set. Authors need to adjust their alpha threshold.

      We fully agree. As summarized in our general response, we have implemented two major changes to our statistical approach to address these concerns comprehensively. These, are stated above, are the following:

      (1) Correction for Multiple Hypotheses: We previously used FWER-corrected p-values that were obtained through permutation testing. We have now applied a Bonferroni adjustment to the FWER-corrected threshold (previously 0.05) used in our searchlight analyses. For instance, in the acquisition phase, since 2 independent tests (contrasts) were conducted, the significance threshold of each of these searchlight maps was set to p <0.025 (after FWE-correction estimated through non-parametric permutation testing); in reversal, 4 tests were conducted, hence the significance threshold was set to p<0.0125. This change is now clearly described in the Methods section (section “Searchlight approach” (lines 477484). This change had no impact on our searchlight results, given that all clusters that were previously as significant with the previous FWER alpha of 0.05 were also significant at the new, Bonferroni-adjusted thresholds; we also now report the cluster-specific corrected p-values in the cluster tables in Supplementary Material.

      (2) ROI Analyses: Our ROI-based analyses used FDR-based correction for within each item reinstatement/generalized reinstatement pair of each ROI. We now explicitly state in the abstract, methods and results sections that these ROI-based analyses are exploratory and secondary to the primary whole-brain results, given that the correction method used is more liberal, in accordance with the exploratory character of these analyses.

      We are confident that these changes ensure both the robustness and transparency of our reported findings.

      Reviewer #1 (Public Review):

      (1) I had a difficult time unpacking lines 419-420: "item stability represents the similarity of the neural representation of an item to other representations of this same item."

      We thank the reviewer for pointing out this lack of clarity. We have revised the definition to be more intuitive and have ensured it is introduced earlier in the manuscript.

      Proposed change in manuscript (Introduction, lines 144-150): We introduced the concept earlier and more clearly: "Furthermore, we can measure the consistency of a neural pattern for a given item across multiple presentations. This metric, which we refer to as “item stability”, quantifies how consistently a specific stimulus (e.g., the image of a kettle) is represented in the brain across multiple repetitions of the same item. Higher item stability has been linked to successful episodic memory encoding (Xue et al., 2010)."

      Proposed change in manuscript (Methods, Section "Item stability and generalization of cues"): Original text: "Thus, item stability represents the similarity of the neural representation of an item to other representations of this same item (Xue, 2018), or the consistency of neural activity across repetitions (Sommer et al., 2022)."

      Revised text (lines 434-436): "Item stability is defined as the average similarity of neural patterns elicited by multiple presentations of the same item (e.g., the kettle). It therefore measures the consistency of an item's neural representation across repeated encounters."

      (2) The authors use the phrase "representational geometry" several times in the paper without clearly defining what they mean by this.

      We apologize for this omission. We have now added a clear and concise definition of "representational geometry" in the Introduction, citing the foundational work by Kriegeskorte et al. (2008).

      Proposed change in manuscript (Introduction): We inserted the following text (lines 117-125): " By contrast, multivariate pattern analyses (MVPA), such as representational similarity analysis (RSA; Kriegeskorte et al., 2008) has emerged as a powerful tool to investigate the content and structure of these representations (e.g., Hennings et al., 2022). This approach allows us to characterize the “representational geometry” of a set of items – that is, the structure of similarities and dissimilarities between their associated neural activity patterns. This geometry reveals how the brain organizes information, for instance, by clustering items that are conceptually similar while separating those that are distinct."

      (3) The abstract is quite dense and will likely be challenging to decipher for those without a specialized knowledge of both the topic (fear conditioning) and the analytical approach. For instance, the goal of the study is clearly articulated in the first few sentences, but then suddenly jumps to a sentence stating "our data show that contingency changes during reversal induce memory traces with distinct representational geometries characterized by stable activity patterns across repetitions..." this would be challenging for a reader to grok without having a clear understanding of the complex analytical approach used in the paper.

      We agree with your assessment. We have rewritten it to be more accessible to a general scientific audience, by focusing on the conceptual findings rather than methodological jargon.

      Proposed change in manuscript (Abstract): We revised the abstract to be clearer. It now reads: " When we learn that something is dangerous, a fear memory is formed. However, this memory is not fixed and can be updated through new experiences, such as learning that the threat is no longer present. This process of updating, known as extinction or reversal learning, is highly dependent on the context in which it occurs. How the brain represents cues, contexts, and their changing threat value remains a major question. Here, we used functional magnetic resonance imaging and a novel fear learning paradigm to track the neural representations of stimuli across fear acquisition, reversal, and test phases. We found that initial fear learning creates generalized neural representations for all threatening cues in the brain’s fear network. During reversal learning, when threat contingencies switched for some of the cues, two distinct representational strategies were observed. On the one hand, we still identified generalized patterns for currently threatening cues, whereas on the other hand, we observed highly stable representations of individual cues (i.e., item-specific) that changed their valence, particularly in the precuneus and prefrontal cortex. Furthermore, we observed that the brain represents contexts more distinctly during reversal learning. Furthermore, additional exploratory analyses showed that the degree of this context specificity in the prefrontal cortex predicted the subsequent return of fear, providing a potential neural mechanism for fear renewal. Our findings reveal that the brain uses a flexible combination of generalized and specific representations to adapt to a changing world, shedding new light on the mechanisms that support cognitive flexibility and the treatment of anxiety disorders via exposure therapy."

      (4) Minor: I believe it is STM200 not the STM2000.

      Thank you for pointing this out. We have corrected it in the Methods section.

      Proposed change in manuscript (Methods, Page 5, Line 211): Original: STM2000 -> Corrected: STM200

      (5) Line 146: "...could be particularly fruitful as a means to study the influence of fear reversal or extinction on context representations, which have never been analyzed in previous fear and extinction learning studies." I direct the authors to Hennings et al., 2020, Contextual reinstatement promotes extinction generalization in healthy adults but not PTSD, as an example of using MVPA to decipher reinstatement of the extinction context during test.

      Thank for pointing us towards this relevant work. We have revised the sentence to reflect the state of the literature more accurately.

      Proposed change in manuscript (Introduction, Page 3): Original text: "...which have never been analyzed in previous fear and extinction learning studies." 

      Revised text (lines 154-157): "...which, despite some notable exceptions (e.g., Hennings et al., 2020), have been less systematically investigated than cue representations across different learning stages."

      (6) This is a methodological/conceptual point, but it appears from Figure 1 that the shock occurs 2.5 seconds after the CS (and context) goes off the screen. This would seem to be more like a trace conditioning procedure than a standard delay fear conditioning procedure. This could be a trivial point, but there have been numerous studies over the last several decades comparing differences between these two forms of fear acquisition, both behaviorally and neurally, including differences in how trace vs delay conditioning is extinguished.

      Thank you for this pertinent observation; this was also pointed out by the editor. As detailed in our response to the editor, we now explicitly acknowledge that our paradigm uses a trace conditioning design, and have added statements to this effect in the Methods and Discussion sections (lines 218-220, and 774-779).

      (7) In Figure 4, it would help to see the individual data points derived from the model used to test significance between the different conditions (reinstatement between Acq, reversal, and test-new).

      We agree that this would improve the transparency of our results. We have revised Figure 4 to include individual data points, which are now plotted over the bar graphs. 

      Reviewer #2 (Public Review & Recommendations)

      Use a more stringent method of multiple comparison correction: voxel-wise FWE instead of FDR; Holm-Bonferroni across multiple hypothesis tests. If FDR is chosen then the exploratory character of the results should be transparently reported in the abstract.

      Thank you for these critical comments regarding our statistical methods. As detailed in the general response and response to the editor (Comment 3), we have thoroughly revised our approach to ensure its rigor. We now clarify that our whole-brain analyses consistently use FWER-corrected pvalues. Additionally, the significance of these FWER-corrected p-values (obtained through permutation testing), which were previously considered significant against a default threshold of 0.05, are now compared with a Bonferroni-adjusted threshold equal to the number of tested contrasts in each experimental phase. We have modified the revised manuscript accordingly, in the methods section (lines 473-484) and in the supplementary material, where we added the p-values (FWER-corrected) of each cluster, evaluated against the new Bonferroni-adjusted thresholds. It is to be of note that this had no impact on our searchlight results, given that all clusters that were previously reported as significant with the alpha threshold of 0.05 were also significant at the new, corrected thresholds.

      Proposed change in manuscript (Methods): We revised the relevant paragraphs (lines 473-484): "Significance corresponding to the contrast between conditions of the maps of interest was FWER-corrected using nonparametric permutation testing at the cluster level (10,000 permutations) to estimate significant cluster size. Additionally, we adjusted the alpha threshold against which we assessed the significance of the cluster-specific FWERcorrected p-values using Bonferroni correction. In this order, we divided the default alpha corrected threshold of 0.05 by the number of statistical comparisons that were conducted in each experimental phase. For example, for fear acquisition, we compared the CS+>CS- contrast for both item stability and cue generalization, resulting in 2 comparisons and hence a corrected alpha threshold of 0.025. Only clusters that had a FWER-corrected p-value below the Bonferroni-adjusted threshold were deemed significant. All searchlight analyses were restricted within a gray matter mask.”

      The authors report fMRI results from line 96 onwards; all of these refer exclusively to mass-univariate fMRI which could be mentioned more transparently... The authors contrast "activation fMRI" with "RSA" (line 112). Again, I would suggest mentioning "mass-univariate fMRI", and contrasting this with "multivariate" fMRI, of which RSA is just one flavour. For example, there is some work that is clear and replicable, demonstrating human amygdala involvement in fear conditioning using SVM-based analysis of highresolution amygdala signals (one paper is currently cited in the discussion).

      Thank you for this important clarification. We have revised the manuscript to incorporate your suggestions. We now introduce our initial analyses as "mass-univariate" and contrast them with the "multivariate pattern analysis" (MVPA) approach of RSA.

      Proposed change in manuscript (Introduction): We revised the relevant paragraphs (lines 113-125): " While mass-univariate functional magnetic resonance imaging (fMRI) activation studies have been instrumental in identifying the brain regions involved in fear learning and extinction, they are insensitive to the patterns of neural activity that underlie the stimulus-specific representations of threat cues and contexts. Contrastingly, multivariate pattern analyses methods, such as representational similarity analysis (RSA; Kriegeskorte et al., 2008), have emerged as a powerful tool to investigate the content and structure of these representations (e.g., Hennings et al., 2022). This approach allows us to characterize the “representational geometry” of a set of items – i.e., the structure of similarities and dissimilarities between their associated neural activity patterns. This geometry reveals how the brain organizes information, for instance, by clustering items that are conceptually similar while separating those that are distinct.”

      Line 177: unclear how incomplete data was dealt with. If there are 30 subjects and 9 incomplete data sets, then how do they end up with 24 in the final sample?

      We apologize for the unclear wording in our original manuscript. We have clarified the participant exclusion pipeline in the Methods section.

      Proposed change in manuscript (Methods, Section "Participants"): Original text: "The number of participants with usable fMRI data for each phase was as follows: N = 30 for the first phase of day one, N = 29 for the second phase of day one, N = 27 for the first phase of day two, and N = 26 for the second phase of day two. Of the 30 participants who completed the first session, four did not return for the second day and thus had incomplete data across the four experimental phases. An additional two participants were excluded from the analysis due to excessive head movement (>2.5 mm in any direction). This resulted in a final sample of 24 participants (8 males) between 18 and 32 years of age (mean: 24.69 years, standard deviation: 3.6) with complete, low-motion fMRI data for all analyses." 

      Revised text: "The number of participants with usable fMRI data for each phase was as follows: N = 30 for the first phase of day one, N = 29 for the second phase of day one, N = 27 for the first phase of day two, and N = 26 for the second phase of day two. An additional two participants were excluded from the analysis due to excessive head movement (>2.5 mm in any direction). This resulted in a final sample of 24 participants (8 males) between 18 and 32 years of age (mean: 24.69 years, standard deviation: 3.6) with complete, low-motion fMRI data for all analyses."

      Typo in line 201.  

      Thank you for your comment. We have re-examined line 201 (“interval (Figure 1A). A total of eight CSs were presented during each phase and”) and the surrounding text but were unable to identify a clear typographical error in the provided quote. However, in the process of revising the manuscript for clarity, we have rephrased this section.

      it would be good to see all details of the US calibration procedure, and the physical details of the electric shock (e.g. duration, ...).

      Thank you for your comment. We have expanded the Methods section to include these important details.

      Proposed change in manuscript (Methods, Section "General procedure and stimuli"): We inserted the following text (lines 225-230): "Electrical stimulation was delivered via two Ag/AgCl electrodes attached to the distal phalanx of the index and middle fingers of the non-dominant hand. he intensity of the electrical stimulation was calibrated individually for each participant prior to the experiment. Using a stepping procedure, the voltage was gradually increased until the participant rated the sensation as 'unpleasant but not painful'.

      "beta series modelling" is a jargon term used in some neuroimaging software but not others. In essence, the authors use trial-by-trial BOLD response amplitude estimates in their model. Also, I don't think this requires justification - using the raw BOLD signal would seem outdated for at least 15 years.

      Thank you for this helpful suggestion. We have simplified the relevant sentences for improved clarity.

      Proposed change in manuscript (Methods, Section "RSA"): Original text: "...an approach known as beta-series modeling (Rissman et al., 2004; Turner et al., 2012)." 

      Revised text (lines 391-393): "...an approach that allows for the estimation of trial-by-trial BOLD response amplitudes, often referred to as beta-series modeling (Rissman et al., 2004). Specifically, we used a Least Square Separate (LSS) approach..."

      I found the use of "Pavlovian trace" a bit confusing. The authors are coming from memory research where "memory trace" is often used; however, in associative learning the term "trace conditioning" means something else. Perhaps this can be explained upon first occurrence, and "memory trace" instead of "Pavlovian trace" might be more common.

      We are grateful for this comment, as it highlights a critical point of potential confusion, especially given that we now acknowledge our paradigm uses a trace conditioning design. To eliminate this ambiguity, we have replaced all instances of "Pavlovian trace" with "lingering fear memory trace" throughout the manuscript (lines 542 and 599).

      I would suggest removing evaluative statements from the results (repeated use of "interesting").

      Thank you for this valuable suggestion. We have reviewed the Results section and removed subjective evaluative words to maintain a more objective tone. 

      Line 882: one of these references refers to a multivariate BOLD analysis using SVM, not explicitly using temporal information in the signal (although they do show session-by-session information).

      Thank you for this correction. We have re-examined the cited paper (Bach et al., 2011) and removed its inclusion in the text accordingly.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The study explores the use of Transport-based morphometry (TBM) to predict hematoma expansion and growth 24 hours post-event, leveraging Non-Contrast Computed Tomography (NCCT) scans combined with clinical and location-based information. The research holds significant clinical potential, as it could enable early intervention for patients at high risk of hematoma expansion, thereby improving outcomes. The study is well-structured, with detailed methodological descriptions and a clear presentation of results. However, the practical utility of the predictive tool requires further validation, as the current findings are based on retrospective data. Additionally, the impact of this tool on clinical decision-making and patient outcomes needs to be further investigated.

      Strengths:

      (1) Clinical Relevance: The study addresses a critical need in clinical practice by providing a tool that could enhance diagnostic accuracy and guide early interventions, potentially improving patient outcomes.

      (2) Feature Visualization: The visualization and interpretation of features associated with hematoma expansion risk are highly valuable for clinicians, aiding in the understanding of model-derived insights and facilitating clinical application.

      (3) Methodological Rigor: The study provides a thorough description of methods, results, and discussions, ensuring transparency and reproducibility.

      Weaknesses:

      (1) The limited sample size in this study raises concerns about potential model overfitting. While the reported AUCROC of 0.71 may be acceptable for clinical use, the robustness of the model could be further enhanced by employing techniques such as k-fold crossvalidation. This approach, which aggregates predictive results across multiple folds, mimics the consensus of diagnoses from multiple clinicians and could improve the model's reliability for clinical application. Additionally, in clinical practice, the utility of the model may depend on specific conditions, such as achieving high specificity to identify patients at risk of hematoma expansion, thereby enabling timely interventions. Consequently, while AUC is a commonly used metric, it may not fully capture the model's clinical applicability. The authors should consider discussing alternative performance metrics, such as specificity and sensitivity, which are more aligned with clinical needs. Furthermore, evaluating the model's performance in real-world clinical scenarios would provide valuable insights into its practical utility and potential impact on patient outcomes.

      We thank the reviewer for these thoughtful comments. We agree that k-fold cross validation is a valid approach to reduce bias associated with overfitting and account for variability in the dataset composition. During the training and optimization process, this was employed within the VISTA dataset where data were shuffled at random and separated into independent training (60%) and internal validation (40%) datasets. This process was repeated 1000 times, to generate 1000 different training and internal validation splits. Statistical analyses and data visualization were performed independently on each of the 1000 cross-validation samples, and the mean results with corresponding 95% confidence intervals are presented. The p-values were averaged using the Fisher’s method. We have included this information in the methods section. [Page 22; Paragraph 1, Lines 8-10]. External validation was performed on the ERICH dataset and analyzed only once. We chose not to perform k-fold cross validation with the test dataset in attempt to assess the model’s generalizability to unseen data from a different patient cohort. However, we agree that taking advantage of the full 1,066 ERICH cases for model validation would improve the strength of our conclusions regarding the model’s robustness. This has been included in the discussion. [Page 15; Paragraph 1; Lines 11-14].

      We agree that the AUC alone will not effectively describe the clinical applicability of the intended model. We have added the sensitivity and specificity metrics for the TBM’s performance in the external dataset to the discussion. The design of the present study was primarily a pre-clinical methodological study. However, we have suggested that future external validation studies should seek to identify ideal sensitivity and specificity thresholds when evaluating the model’s translatability to a clinical setting. [Page 11; Paragraph 2; Line 22 and Page 12; Paragraph 1; Lines 2-4]. We agree that future validation studies should also assess the model’s performance in a real-world clinical setting and have emphasized this point in the discussion. [Page 13; Paragraph 2; Lines 22-23 and Page 14; Paragraph 1; Lines 1-4].

      (2) The authors compared the performance of TBM with clinical and location-based information, as well as other machine learning methods. While this comparison highlights the relative strengths of TBM, the study would benefit from providing concrete evidence on how this tool could enhance clinicians' ability to assess hematoma expansion in practice. For instance, it remains unclear whether integrating the model's output with a clinician's own assessment would lead to improved diagnostic accuracy or decisionmaking. Investigating this aspect-such as through studies evaluating the combined performance of clinician judgment and model predictions-could significantly enhance the tool's practical value.

      We thank the reviewer for this suggestion. The present study intended to suggest potential advantages of the TBM method with comparison to alternate clinician-based and machine learning methods. While we agree that the TBM method warrants further evaluation in a realworld clinical setting to determine its practical utility, we propose that further optimization of TBM is first needed to improve its predictive accuracy. 

      In developing TBM, our eventual goal is to produce a prediction tool, which can identify patients at risk for hematoma expansion early in the disease course, who may benefit from intervention with surgical and/or medical therapies. Current clinician-based risk stratification methods are highly variable in accuracy, inefficient, and require subjective interpretation of the NCCT scan. Our eventual goal is to aid clinical decision making with an automated, accurate and efficient model. In follow up work, we will study how to combine information derived from imaging and TBM with other assessment tools and clinical data in order to best inform clinicians. This has been incorporated into the discussion. [Page 14; Paragraph 1; Lines 1-4].

      Reviewer #2 (Public review):

      Summary:

      The author presents a transport-based morphometry (TBM) approach for the discovery of noncontrast computed tomography (NCCT) markers of hematoma expansion risk in spontaneous intracerebral hemorrhage (ICH) patients. The findings demonstrate that TBM can quantify hematoma morphological features and outperforms existing clinical scoring systems in predicting 24-hour hematoma expansion. In addition, the inversion model can visualize features, which makes it interpretable. In conclusion, this research has clinical potential for ICH risk stratification, improving the precision of early interventions.

      Strengths:

      TBM quantifies hematoma morphological changes using the Wasserstein distance, which has a well-defined physical meaning. It identifies features that are difficult to detect through conventional visual inspection (such as peripheral density distribution and density heterogeneity), which provides evidence supporting the "avalanche effect" hypothesis in hematoma expansion pathophysiology.

      Weaknesses:

      (1) As a methodology-focused study, the description of the methods section somewhat lacks depth and focus, which may make it challenging for readers to fully grasp the overall structure and workflow of the approach. For instance, the manuscript lacks a systematic overview of the entire process, from NCCT image input to the final prediction output. A potential improvement would be to include a workflow figure at the beginning of the manuscript, summarizing the proposed method and subsequent analytical procedures. This would help readers better understand the mechanism of the model.

      We thank the reviewer for this suggestion. We have included a figure detailing the TBM workflow to improve reader understanding. [Figure 1, Page 5; Paragraph 2; Lines 19-20 and Page 30; Paragraph 1].

      (2) The description of the comparison algorithms could be more detailed. Since TBM directly utilizes NCCT images as input for prediction, while SVM and K-means are not inherently designed to process raw imaging data, it would be beneficial to clarify which specific features or input data were used for these comparison models. This would better highlight the effectiveness and advantages of the TBM method.

      We thank the reviewer for this suggestion. We have included additional details of the comparison with machine learning models in the methods section. While we used PCA on the extracted transport maps and raw image data for dimensionality reduction prior to classification, we agree that the machine learning methods described may not have been optimally tuned to examine the data in the format in which it was presented. Future studies should aim to compare TBM with optimized machine and deep learning methods to determine TBM’s potential as an automated clinical risk stratification tool. We have added this to the limitations section of the discussion. [Page 14; Paragraph 2; Lines 22-23 and Page 15; Paragraph 1; Lines 1-2].

      (3) The relatively small training and testing dataset may limit the model's performance and generalizability. Notably, while the study mentions that 1,066 patients from the ERICH dataset met the inclusion criteria, only 170 were randomly selected for the test set. Leveraging the full 1,066 ERICH cases for model training and internal validation might potentially enhance the model's robustness and performance.

      We thank the reviewer for this suggestion. As the reviewer highlights, the intention of the manuscript was to present a methodologically focused study which led to our small validation cohort of 170 patients from the ERICH dataset. It is our intention to further optimize and validate the TBM method in a future larger study which is underway, taking full advantage of the ERICH dataset. This has been incorporated into the discussion section. [Page 15; Paragraph 1; Lines 1114].

      (4) Some minor textual issues need to be checked and corrected, such as line 16 in the abstract "Incorporating these traits into a v achieved an AUROC of 0.71 ...".

      We thank the reviewer for this comment. The typographical error has been corrected. 

      (5) Some figures need to be reformatted (e.g., the x-axis in Figure 2 a is blocked).

      We thank the reviewer for this comment. This was intentional to demonstrate that the X-axis in Figure 2a and 2b are identical and thereby highlight image features corresponding to the regression line on the graph.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      While the authors have largely ruled out zebrin II as the key protein underlying PC vulnerability or resistance to age-related loss, the molecular basis of this phenomenon remains unidentified. This reviewer acknowledges the complexity of this investigation and considers it a minor issue, as the manuscript thoughtfully discusses the gap and highlights it as a future direction.

      We appreciate the reviewer’s acknowledgement of the complexity of determining the molecular basis of differential Purkinje cell vulnerability. Moreover, we acknowledge that zebrin II expression/identity is not the only factor in determining vulnerability; rather, the compartmentalized map as a whole may dictate these differences. We are eager to shed light on this issue through future study.

      In cases where no PC loss is observed in aged mice (Figure 1F), it is unclear whether these PCs undergo morphological degeneration, such as thickened axons and shrunken dendrites. Further characterization of these resilient PCs would help understand why the aged mice without PC loss still exhibit motor deficits (Figure 7).

      Thank you for the excellent idea of examining Purkinje cell morphology in aged mice without Purkinje cell loss. Upon looking for hallmarks of neurodegeneration, such as shrunken dendrites and axonal swellings, in aged mice without Purkinje cell loss, we observed minimal axonal pathology and no shrinkage of the molecular layer.  However, we note that while the features we examined are wellstudied hallmarks of degeneration, they are specific rather than exhaustive, and subtle morphological characteristics that are beyond our methods’ detection may change. We have added these new results to Figure 2C and added these notes to the manuscript.

      The histologic analysis is based on mice with different genetic backgrounds. For example, the PC-specific reporter mice include two strains: Pcp2-Cre; Ai32 and Pcp2-Cre; Ai40D. These genetic variations may contribute to the heterogeneity of PC loss (Figure 1). To improve clarity, please add the genetic background details to Table 1.

      We have added the genetic backgrounds of all mice used in the study to Table 1.

      Please indicate from which lobule in the anterior or posterior human cerebellum the images in Figure 8 were taken.

      Unfortunately, because of the limitations of human postmortem tissue collection (in some cases, we are provided with a very small block that was collected after the pathologist completed their primary duty for that individual), we cannot with full certainty distinguish the lobules from which the images were taken. However, we are grateful that, upon our request, the pathologists were able to collect tissue mainly from the vermis, which is where we wished to begin, knowing that the vermis in rodents and non-human primates typically has the clearest and most well-studied pattern. That said, this is an important issue that we are addressing for future studies.

      Reviewer #2 (Public review):

      (1) Limited strain diversity: The study exclusively uses C57BL/6J mice despite known genetic and motor differences even the closely related strains like C57BL/6N.

      Thank you for pointing out this limitation of our study. We chose to limit this initial study to C57BL/6J mice based on their widespread use as a background strain on many currently maintained lines. That said, our study intentionally included several different crosses to provide genetic variability, even though C57BL/6J is still the predominant genetic background. In addition to the motor differences in genetic strains, we are also particularly interested in the differences in cerebellar morphology across strains (Inouye and Oda, 1980; Sillitoe and Joyner, 2007). Our use of mice maintained on the C57BL/6J background leaves open an exciting future direction: investigating age-related Purkinje cell loss in mice of different inbred and outbred strains. Given the importance of the topic, we have included new text in the discussion to alert the reader to this limitation of our study and to highlight interesting differences across strains that will be important to disentangle in our future work.

      (2) No correlation quantified between the degree of PC loss, aging, and motor performance.

      We sought to conduct a broad overview of motor problems that might be caused by age-related Purkinje cell loss, rather than a comprehensive investigation of how motor behavior changes with advancing Purkinje cell loss. Therefore, we agree with the reviewer’s comment, and we have added text to indicate that stronger correlations between these domains would be best tackled with deeper behavioral phenotyping conducted over time to match the potentially cooccurring progressive changes in cerebellar morphology, with a focus on Purkinje cell degeneration and eventual loss.

      (3) It has not been demonstrated whether the neurodegenerative changes are indeed observed in zebrin-negative PCs.

      We have added Supplementary Figure 4, which includes an example of reduced dendritic density and loss of Purkinje cell somata in zebrin II-negative stripes in lobules II and III. Please also see Figure 4B for an example of reduced dendritic density in zebrin II-negative Purkinje cells in lobules III and IV.

      (4) The mechanisms of why only a subset of mice show PC loss remain unexplored and not discussed.

      We agree that our manuscript would benefit from discussion of why some aged mice are resistant to age-related Purkinje cell loss. We have elaborated upon possible reasons for this differential vulnerability in the discussion.

      (5) Linkages with normal human aging and cerebellar function are not well supported. While motor behavioral assays captured phenotypes that mimic aged people, correlation with PC loss is demonstrated to be absent in mice. It remains unclear whether this PC loss phenomenon is universal or specific to a particular individual; and whether specific to a human PC subtype.

      In our study, we sought to show that patterned age-related Purkinje cell loss presents a promising area for future research in humans. We agree that further study is needed to solidify a link between age-related Purkinje cell loss in mice and humans and the implications for motor function. The reviewer raises a fair criticism that reflects the current state of knowledge: studies that link cerebellar aging to  motor function and cognitive decline in humans are few, as are studies of the cellular-level morphological changes of cerebellar aging –there is a pressing need for deeper study of human tissue. To address the issue raised by the reviewer, we have included new text to the discussion of our manuscript indicating these gaps in knowledge. 

      (6) Analyses in the paraflocculus are currently not easy to understand. This lobule has heterogeneous PC subtypes, developmentally or molecularly. Zebrin-weak and Zebrinintense PCs are known to be arranged in stripes, which resembles the pattern of developmentally defined PC subsets (Fujita et al., 2014, Plos one; Fujita et al., 2012, J Neurosci). In the data presented, it is hard to appreciate whether the viewing angle is consistent relative to the angle of the paraflocculus. This may be a limitation of the analysis of the paraflocculus in general, that the orientation of this lobule is so susceptible to fixation and dissection. Discrepancy between PC loss stripe and zebrin pattern may be an overstatement, because appropriate analyses on the paraflocculus would require a rigorously standardized analytic method.

      Thank you for your valuable insights on the complexity of analyzing the paraflocculus. We have altered our language to more accurately reflect the nuanced zebrin II expression pattern of this region. We also agree with and very much appreciate your advice that “analyses on the paraflocculus would require a rigorously standardized analytic method.” We have added these arguments to the revised manuscript text.

      Reviewer #3 (Public review):

      (1) In Figure 3, the authors use Pcp2-CRE mice to drive GFP expression in Purkinje cells in order to avoid the confounding variable of loss of calbindin expression in aging Purkinje cells. The authors go on to say, "we argue that calbindin expression alone is not a reliable, sufficient indicator of Purkinje cell loss". However, in Figure 4, the authors return to calbindin staining alone to assess the correlation of Purkinje cell loss with zebrin-II expression. Could the authors comment on why zebrin-II co-staining experiments were not performed in GFP reporter mice to avoid potential confounds of calbindin expression? Without this experiment, should readers accept the data presented in Figure 4 as a "reliable, sufficient indicator of Purkinje cell loss", given the author's prior claim?

      This is a very good point, thank you. We agree that the data presented in Figure 4 alone would not be a sufficient indicator of Purkinje cell loss. However, we prefaced our calbindin and zebrin II co-staining with calbindin and GFP costaining (Figure 3), which showed that Purkinje cell-specific reporter expression revealed the same pattern of Purkinje cell loss as calbindin expression, and Neutral Red staining (Figure 2 and Supplementary Figure 3B), which confirmed the loss of Purkinje cells independent of immunofluorescence. For this reason, we feel confident that the data in Figure 4 is representative of the striped pattern of age-related Purkinje cell loss. Still, we see and agree with the reviewer’s comment, and therefore, to further show the correlation of Purkinje cell loss with zebrin II expression, we have added a new Supplementary Fig. 4, which shows co-staining of calbindin, GFP, and zebrin II.

      (2) Throughout the manuscript, there is a considerable reliance on the authors' interpretation of imaging data with no accompanying quantification (categorization of "striped" or "non-striped" PC loss, correlation of GFP/calbindin/zebrin-II staining, etc.). While this may be difficult to obtain, the results would be much stronger with a quantitative approach to support the stated categorizations/observations.

      Thank you for your suggestion. Quantifying stripe properties has been a challenging task for the field, given the regionalized features of stripe compartmentalization that make its complex architecture tricky to measure in its typical organization within the 3D anatomy of lobules and fissures and even harder to interpret when there are abnormalities. However, to quantitatively support our categorization of “striped” and “non-striped” Purkinje loss and the observed correlation between calbindin and GFP expression in aged mice, we have quantified the mediolateral pixel intensity across lobules II-IV, in which Purkinje cell loss reliably occurs in zebrin II-negative stripes. The results can be found in Supplementary Figure 1B and Supplementary Figure 3.

      Reviewer #1 (Recommendations for the authors):

      (1) In Figure 1, both staining artifacts and PC degeneration appear in light color. Please clarify how these two were differentiated.

      Thank you for your comment, which raises an important point about distinguishing staining artifacts from Purkinje cell degeneration. Cerebellar patterning is symmetrical across the midline, so asymmetrical abnormalities are one clue that differentiates staining artifacts from the degenerative pattern. Another indicator of a staining artifact seen in wholemount preparations is the gradual fading of the stain (seen in some hemispheres in Figure 1), which is caused by continuous rubbing of the cerebellum against the tube during the staining process. In some cases, such as in Figure 1F, the cerebellum was damaged during the dissection of the meninges after staining, and in such cases the accidental removal of cerebellar tissue (molecular layer) reveals unstained tissue beneath the surface of the cerebellum. This type of staining artifact can be identified by a missing chunk of tissue surrounded by stained Purkinje cells, compared to the smooth, unmarred tissue where PCs have degenerated. We have added new text to the results (the legends) to clarify these critical points for the reader.

      (2) In Figure 7C, please consider changing "Aged without stripes" to "Aged without PC loss" to be consistent with the labeling used in other panels.

      Thank you for pointing out this discrepancy. We have made the suggested changes.

      Reviewer #3 (Recommendations for the authors):

      Could the authors comment on why zebrin-II co-staining experiments were not performed in GFP reporter mice to avoid potential confounds of calbindin expression? Without this experiment, should readers accept the data presented in Figure 4 as a "reliable, sufficient indicator of Purkinje cell loss", given the author's prior claim?

      Thank you for this recommendation; we appreciate this advice. As we described above, our response to this suggestion reads:

      This is a very good point, thank you. We agree that the data presented in Figure 4 alone would not be a sufficient indicator of Purkinje cell loss. However, we prefaced our calbindin and zebrin II co-staining with calbindin and GFP costaining (Figure 3), which showed that Purkinje cell-specific reporter expression revealed the same pattern of Purkinje cell loss as calbindin expression, and Neutral Red staining (Figure 2 and Supplementary Figure 3B), which confirmed the loss of Purkinje cells independent of immunofluorescence. For this reason, we feel confident that the data in Figure 4 is representative of the striped pattern of age-related Purkinje cell loss. Still, we see and agree with the reviewer’s comment, and therefore to further show the correlation of Purkinje cell loss with zebrin II expression, we have added a new Supplementary Fig. 4, which shows co-staining of calbindin, GFP, and zebrin II.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewing Editor Comments:

      Recommendations for improvement:

      (1) Address data presentation, editing, and other issues of lack of clarity as pointed out by the reviewers.

      We have now addressed all comments from reviewers that identify editing errors and lack of clarity issues. Regarding data presentation we have made some changes, for example including a combined heatmap to show consistency between row names (Figure 2 - figure supplement 2), but also kept some stylistic features such as the balance between main and supplemental figures that we think fits more naturally with the story of the paper.

      (2) Inclusion of requested and critical details in the methodology section, an important component for broad applicability of a new methodology by other investigators.

      We have added the requested details to the methods section, specifically the RCA protocol.

      (3) More in-depth discussion of the limitations of the methodology and approach to capture important but more complex components of tissues of interest, for example, sexual dimorphism.

      We have now edited the ‘pitfalls of study’ section in the discussion to include further detail of the limitations of the number of genes that can be used to deeply profile transcriptomic types, including sexual dimorphism. Regarding its use in other tissues of interest, we have now included a reference in the discussion (Bintu et al., 2025) where a similar strategy has been used to profile cells in the olfactory epithelium and olfactory bulb. We have also used hamFISH in other brain areas (as commented in our public reviews responses) but as this is unpublished work we will refrain from mentioning it in the main text.

      Reviewer #1 (Recommendations for the authors):

      The manuscript by Edwards et al. would benefit from minor revisions. Here, we outline several points that could / should be addressed:

      (1) General balance of data presentation between main and supplementary figures

      (a) quantifications were often missing from main figures and only presented in the supplements

      Thank you for raising this point. We believe that the balance of panels between the main and supplemental figures matches our story and results section well with quantifications included in the main figures where appropriate.

      (b) more informative figure legends in supplements (e.g.: Supplementary Figure I - Figure 3)

      We have now revised the figure legends and added more description where appropriate.

      (c) missing subpanel in Figure 3; figure legend describes 3H, which is missing in the figure

      We thank the reviewer for pointing this out and have now amended the subpanel.

      stand-alone figure on inhibitory neuron cluster i3 cells

      We agree that this is an important characterisation of i3 cells but decided to place this figure in the supplement as it does not fall within the main storyline (defining transcriptomic characterisation of cell types in a multimodal fashion), but rather acts as accessory information for those specifically interested in these inhibitory cell types.

      statistical tests used (e.g.: Figure 1 C -, Supplementary Figure 3 - Figure 2)/ graphs shown (Supplementary Figure 1 - 1 D)

      The statistical tests used are described in the figure legends.

      t-SNE dimensionality reduction of positional parameters

      Explanations of the t-SNE dimensionality reduction of positional parameters can be found in the materials and methods.

      (d) heatmaps similarly informative and more convincing

      We have included an extra heatmap (Figure 2 - figure supplement 2) in response to Reviewer 3’s comment (see below) in order to more easily follow genes across all the different clusters. We hope this helps to make the heatmaps more convincing and informative.

      code availability

      Code availability is described in the methods section of the manuscript.

      page 6, 3rd paragraph wrong description of PMCo abbreviation

      We thank the reviewer for identifying the mistake and we have now amended it.

      Reviewer #2 (Recommendations for the authors):

      The pre-existing scRNA-seq dataset on which the manuscript is based is an older Drop-seq dataset for which minimal QC information is provided. The authors should include QC information (genes/cells and UMIs/cells) in the Methods. Moreover, the Seurat clustering of these cells and depiction of marker genes in feature plots are not shown.

      It is therefore difficult to determine how the authors selected their 31 genes for their hamFISH panel, or how selective they are to the original Drop-seq clusters.

      The QC information of this dataset can be found in the original publication (Chen et al., 2019) with our clustering methods described in the materials and methods section. We have not included individual gene names in our heatmap plots for presentation purposes (there are over 200 rows), but the data and cluster descriptions can be found in supplemental tables.

      Reviewer #3 (Recommendations for the authors):

      (1) The imaging modality is not entirely clear in the methods. The microscopy technique is referenced to prior work and involves taking z-stacks, but analysis appears to be done on maximum z-projections, which seems like it would introduce the risk of false attribution of gene expression to cells that are overlapping in "z".

      Thank you for pointing out the technical limitation of the microscopy. For imaging we used epifluorescence microscopy with 14x 500 nm z-steps to collect our raw data and generate a maximum intensity projection for further analysis. Because of the thin sections (10 um) used for the imaging, the overlap between cells in z is expected to be minimal. However, we cannot completely rule out misattribution raised in the comment. The method section contains this information.

      (2) Supplemental Figure 1 - Figure Supplement 2B: RCA looks significantly different when compared to v2 smFISH from the representative image, although it is written as comparable. Additionally, there is no information about RCA mentioned in the Materials and Methods section. Supplemental Figure 1 - Figure Supplement 2B: The figure label for RCA is missing.

      By comparable we are referring to the intensity rather than pattern as mentioned in the results section. We did not analyze the number of spots. It is true that the pattern of RCA signal is much sparser due to its inherent insensitivity compared with hamFISH. We thank the reviewer for identifying the lack of a methodological RCA description and have amended the manuscript to include this. We have also now amended the missing RCA label in the figure.

      (3) Figure 2C and associated supplement: The rows (each gene) are not consistent across the subpanels (i.e. they do not line up left-to-right), this makes it difficult for the reader to follow the patterns that distinguish the cell types in each subset.

      We have done this as we believe it makes for an easier interpretation of inhibitory vs excitatory clusters for the reader. However, we agree with the reviewer that one may wish to look at the dataset as a whole with a consistent gene order, and we have now provided this in the corresponding supplemental figure.  

      (4) "Consistent with previous work, most inhibitory classes are localized in the dorsal and ventral subdivisions of the MeA, whereas excitatory neurons occupy primarily the ventral MeA (Figure 2D, Figure 2 - Figure Supplement 2C, Figure 1D)". - The reference to Figure 1D seems to be an error.

      We thank the reviewer for identifying the mistake, and we have now amended it.

      (5) Supplemental Figure 2 - Figure Supplement 1, "published by Chen et al." - should have a proper reference number to be compatible with the rest of the manuscript. Also, the lack of gene info makes it difficult to understand Panel A. Finally, the text on Panel B refers to "hamMERFISH" which seems an error.

      We thank the reviewer for identifying the mistake on Panel B, it has now been amended. We have also changed the reference format. Regarding the lack of gene information in panel A, it is difficult to present all row names due to the large number of rows (>200), but this information can be found in supplemental table 2.

      (6) Supplemental Figure 2 - Figure Supplement 1: there are thin dividing lines drawn on each section, but these are not described or defined, making it difficult to understand what is being delineated.

      We thank the reviewer for identifying this omission and have now edited to figure legend to contain a description.

      (7) Page 4, "...we found 26 clusters in cells that are positive for Slc32a1 (inhibitory) or Slc17a6 (encoding Vglut2 and therefore excitatory) positive (Figure 2 - figure supplement 1A, Table S2)."

      This seems to be an error as Figure 2 - figure supplement 1A does not show this.

      We double-checked that this description describes the panel accurately.

      (8) "The clustering revealed that inhibitory and excitatory classes generally have different spatial properties (Figure 1E, left), although the salt-and-pepper, sparse nature of e10 (Nts+) cells is more similar to inhibitory cells than other excitatory classes".

      The references to Figure 1E's should be to Figure 2E.

      We thank the reviewer for identifying the mistake, and we have now amended it.

      (9) "Comparison of the proportion of all cells that are cluster X vs projection neurons labelled by CTB that are cluster X". Please explain cluster X in this context.

      We have now rephrased this sentence in the figure legend for clarity.

      (10) Figure 3 - figure supplement 3: There appears to be quite a bit of heterogeneity in the patterns of activity across clusters even within behavioral contexts (e.g. the bottom 2 animals paired with females). It might be worth commenting on (or quantifying) whether there were any evident differences in the social behaviors observed (e.g. mating or not?) in individuals demonstrating these patterns.

      We thank the reviewer for this observation. We unfortunately did not quantify the behaviors, but we agree that more work is needed to link the pattern of c-fos activity with incrementally measured behavioral variables. At least, we did not include animals that did not display the anticipated social behaviours (as described in the materials and methods) in the in situ transcriptomic profiling work.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      In the current article, Octavia Soegyono and colleagues study "The influence of nucleus accumbens shell D1 and D2 neurons on outcome-specific Pavlovian instrumental transfer", building on extensive findings from the same lab. While there is a consensus about the specific involvement of the Shell part of the Nucleus Accumbens (NAc) in specific stimulus-based actions in choice settings (and not in General Pavlovian instrumental transfer - gPIT, as opposed to the Core part of the NAc), mechanisms at the cellular and circuitry levels remain to be explored. In the present work, using sophisticated methods (rat Cre-transgenic lines from both sexes, optogenetics, and the well-established behavioral paradigm outcome-specific PIT-sPIT), Octavia Soegyono and colleagues decipher the diNerential contribution of dopamine receptors D1 and D2 expressing spiny projection neurons (SPNs). 

      After validating the viral strategy and the specificity of the targeting (immunochemistry and electrophysiology), the authors demonstrate that while both NAc Shell D1- and D2SPNs participate in mediating sPIT, NAc Shell D1-SPNs projections to the Ventral Pallidum (VP, previously demonstrated as crucial for sPIT), but not D2-SPNs, mediates sPIT. They also show that these eNects were specific to stimulus-based actions, as valuebased choices were left intact in all manipulations. 

      This is a well-designed study, and the results are well supported by the experimental evidence. The paper is extremely pleasant to read and adds to the current literature.

      We thank the Reviewer for their positive assessment. 

      Reviewer 2 (Public Review):

      Summary: 

      This manuscript by Soegyono et al. describes a series of experiments designed to probe the involvement of dopamine D1 and D2 neurons within the nucleus accumbens shell in outcome-specific Pavlovian-instrumental transfer (osPIT), a well-controlled assay of cueguided action selection based on congruent outcome associations. They used an optogenetic approach to phasically silence NAc shell D1 (D1-Cre mice) or D2 (A2a-Cre mice) neurons during a subset of osPIT trials. Both manipulations disrupted cue-guided action selection but had no eNects on negative control measures/tasks (concomitant approach behavior, separate valued guided choice task), nor were any osPIT impairments found in reporter-only control groups. Separate experiments revealed that selective inhibition of NAc shell D1 but not D2 inputs to ventral pallidum was required for osPIT expression, thereby advancing understanding of the basal ganglia circuitry underpinning this important aspect of decision making.

      Strengths: 

      The combinatorial viral and optogenetic approaches used here were convincingly validated through anatomical tract-tracing and ex vivo electrophysiology. The behavioral assays are sophisticated and well-controlled to parse cue and value-guided action selection. The inclusion of reporter-only control groups is rigorous and rules out nonspecific eNects of the light manipulation. The findings are novel and address a critical question in the literature. Prior work using less decisive methods had implicated NAc shell D1 neurons in osPIT but suggested that D2 neurons may not be involved. The optogenetic manipulations used in the current study provide a more direct test of their involvement and convincingly demonstrate that both populations play an important role. Prior work had also implicated NAc shell connections to ventral pallidum in osPIT, but the current study reveals the selective involvement of D1 but not D2 neurons in this circuit. The authors do a good job of discussing their findings, including their nuanced interpretation that NAc shell D2 neurons may contribute to osPIT through their local regulation of NAc shell microcircuitry. 

      We thank the Reviewer for their positive assessment. 

      Weaknesses: 

      The current study exclusively used an optogenetic approach to probe the function of D1 and D2 NAc shell neurons. Providing a complementary assessment with chemogenetics or other appropriate methods would strengthen conclusions, particularly the novel demonstration of D2 NAc shell involvement. Likewise, the null result of optically inhibiting D2 inputs to the ventral pallidum leaves open the possibility that a more complete or sustained disruption of this pathway may have impaired osPIT.

      We acknowledge the reviewer's valuable suggestion that demonstrating NAc-S D1- and D2-SPNs engagement in outcome-specific PIT through another technique would strengthen our optogenetic findings. Several approaches could provide this validation. Chemogenetic manipulation, as the reviewer suggested, represents one compelling option. Alternatively, immunohistochemical assessment of phosphorylated histone H3 at serine 10 (P-H3) oMers another promising avenue, given its established utility in reporting striatal SPNs plasticity in the dorsal striatum (Matamales et al., 2020). We hope to complete such an assessment in future work since it would address the limitations of previous work that relied solely on ERK1/2 phosphorylation measures in NAc-S SPNs (Laurent et al., 2014). The manuscript was modified to report these future avenues of research (page 12). 

      Regarding the null result from optical silencing of D2 terminals in the ventral pallidum, we agree with the reviewer's assessment. While we acknowledge this limitation in the current manuscript (page 13), we aim to address this gap in future studies to provide a more complete mechanistic understanding of the circuit.

      Reviewer 3 (Public Review):

      Summary:

      The authors present data demonstrating that optogenetic inhibition of either D1- or D2MSNs in the NAc Shell attenuates expression of sensory-specific PIT while largely sparing value-based decision on an instrumental task. They also provide evidence that SS-PIT depends on D1-MSN projections from the NAc-Shell to the VP, whereas projections from D2-MSNs to the VP do not contribute to SS-PIT.

      Strengths:

      This is clearly written. The evidence largely supports the authors' interpretations, and these eNects are somewhat novel, so they help advance our understanding of PIT and NAc-Shell function.

      We thank the Reviewer for their positive assessment. 

      Weaknesses:

      I think the interpretation of some of the eNects (specifically the claim that D1-MSNs do not contribute to value-based decision making) is not fully supported by the data presented.

      We appreciate the reviewer's comment regarding the marginal attenuation of valuebased choice observed following NAc-S D1-SPN silencing. While this manipulation did produce a slight reduction in choice performance, the behavior remained largely intact. We are hesitant to interpret this marginal eMect as evidence for a direct role of NAc-S D1SPNs in value-based decision-making, particularly given the substantial literature demonstrating that NAc-S manipulations typically preserve such choice behavior (Corbit et al., 2001; Corbit & Balleine, 2011; Laurent et al., 2012). Furthermore, previous work has shown that NAc-S D1 receptor blockade impairs outcome-specific PIT while leaving value-based choice unaMected (Laurent et al., 2014). We favor an alternative explanation for our observed marginal reduction. As documented in Supplemental Figure 1, viral transduction extended slightly into the nucleus accumbens core (NAc-C), a region established as critical for value-based decision-making (Corbit et al., 2001; Corbit & Balleine, 2011; Laurent et al., 2012; Parkes et al., 2015). The marginal impairment may therefore reflect inadvertent silencing of a small number of  NAc-C D1-SPNs rather than a functional contribution from NAc-S D1-SPNs. Future studies specifically targeting larger NAc-C D1-SPN populations would help clarify this possibility and provide definitive resolution of this question.

      Reviewer 1 (Recommendations for the Author):

      My main concerns and comments are listed below.

      (1) Could the authors provide the "raw" data of the PIT tests, such as PreSame vs Same vs PreDiNerent vs DiNerent? Could the authors clarify how the Net responding was calculated? Was it Same minus PreSame & DiNerent minus PreDiNerent, or was the average of PreSame and PreDiNerent used in this calculation?

      The raw data for PIT testing across all experiments are now included in the Supplemental Figures (Supplemental Figures S1E, S2E, S3E, and S4E). Baseline responding was quantified as the average number of lever presses per minute for both actions during the two-minute period (i.e., average of PreSame and PreDiMerent) preceding each stimulus presentation. This methodology has been clarified in the revised manuscript (page 7).

      (2) While both sexes are utilized in the current study, no statistical analysis is provided. Can the authors please comment on this point and provide these analyses (for both training and tests)?

      As noted in the original manuscript, the final sample sizes for female and male rats were insuMicient to provide adequate statistical power for sex-based analyses (page 15). To address this limitation, we have now cited a previous study from our laboratory (Burton et al., 2014) that conducted such analyses with suMicient power in identical behavioural tasks. That study identified only marginal sex diMerences in performance, with female rats exhibiting slightly higher magazine entry rates during Pavlovian conditioning. Importantly, no diMerences were observed in outcome-specific PIT or value-based choice performance between sexes.

      (3) Regarding Figure 1 - Anterograde tracing in D1-Cre and A2a-Cre rats (from line 976), I have one major and one minor question:

      (3.1) I do not understand the rationale of showing anterograde tracing from the Dorsal Striatum (DS) as this region is not studied in the current work. Moreover, sagittal micrographs of D1-Cre and A2a-Cre would be relevant here. Could the authors please provide these micrographs and explain the rationale for doing tracing in DS?

      We included dorsal striatum (DS) tracing data as a reference because the projection patterns of D1 and D2 SPNs in this region are well-established and extensively characterized, in contrast to the more limited literature on these cell types in the NAc-S. Regarding the comment about sagittal micrographs, we are uncertain of the specific concern as these images are presented in Figure 1B.

      If the reviewer is requesting sagittal micrographs for NAc-S anterograde tracing, we did not employ this approach because: (1) the NAc-S and ventral pallidum are anatomically adjacent regions and (2) the medial-lateral coordinates of the ventral pallidum and lateral hypothalamus do not align optimally with those of the NAc-S, limiting the utility of sagittal analysis for these projections.

      (3.2) There is no description about how the quantifications were done: manually? Automatically? What script or plugin was used? If automated, what were the thresholding conditions? How many brain sections along the anteroposterior axis? What was the density of these subpopulations? Can the authors include a methodological section to address this point?

      We apologize for the omission of quantification methods used to assess viral transduction specificity. This methodological description has now been added to the revised manuscript (page 22). Briefly, we employed a manual procedure in two sections per rat, and cell counts were completed in a defined region of interest located around the viral infusion site.

      (4) Lex A & Hauber (2008) Dopamine D1 and D2 receptors in the nucleus accumbens core and shell mediate Pavlovian-instrumental transfer. Learning & memory 15:483- 491, should be cited and discussed. It also seems that the contribution of the main dopaminergic source of the brain, the ventral tegmental area, is not cited, while it has been investigated in PIT in at least 3 studies regarding sPIT only, notably the VP-VTA pathway (Leung & Balleine 2015, accurately cited already).

      We did not include the Lex & Hauber (2008) study because its experimental design (single lever and single outcome) prevents diMerentiation between the eMects of Pavlovian stimuli on action performance (general PIT) versus action selection (outcome-specific PIT, as examined in the present study). Drawing connections between their findings and our results would require speculative interpretations regarding whether their observed eMects reflect general or outcome-specific PIT mechanisms, which could distract from the core findings reported in the article.

      Several studies examining the role of the VTA in outcome-specific PIT were referenced in the manuscript's introduction. Following the reviewer's recommendation, these references have also been incorporated into the discussion section (page 13). 

      (5) While not directly the focus of this study, it would be interesting to highlight the accumbens dissociation between General vs Specific PIT, and how the dopaminergic system (diNerentially?) influences both forms of PIT.

      We agree with the reviewer that the double dissociation between nucleus accumbens core/shell function and general/specific PIT is an interesting topic. However, the present manuscript does not examine this dissociation, the nucleus accumbens core, or general PIT. Similarly, our study does not directly investigate the dopaminergic system per se. We believe that discussing these topics would distract from our core findings and substantially increase manuscript length without contributing novel data directly relevant to these areas. 

      (6) While authors indicate that conditioned response to auditory stimuli (magazine visits) are persevered in all groups, suggesting intact sensitivity to the general motivational properties of reward-predictive stimuli (lines 344, 360), authors can't conclude about the specificity of this behavior i.e. does the subject use a mental representation of O1 when experiencing S1, leading to a magazine visits to retrieve O1 (and same for S2-O2), or not? Two food ports would be needed to address this question; also, authors should comment on the fact that competition between instrumental & pavlovian responses does not explain the deficits observed.

      We agree with the Reviewer that magazine entry data cannot be used to draw conclusions about specificity, and we do not make such claims in our manuscript. We are therefore unclear about the specific concern being raised. Following the Reviewer’s recommendation, we have commented on the fact that response competition could not explain the results obtained (page 11, see also supplemental discussion). 

      The minor comments are listed below.

      (7) A high number of rats were excluded (> 32 total), and the number of rats excluded for NAc-S D1-SPNs-VP is not indicated.

      We apologize for omitting the number of rats excluded from the experiment examining NAc-S D1-SPN projections to the ventral pallidum. This information has been added to the revised manuscript (page 22).

      (7.1) Can authors please comment on the elevated number of exclusions?

      A total of 133 rats were used across the reported experiments, with 40 rats excluded based on post-mortem analyses. This represents an attrition rate of approximately 30%, which we consider reasonable given that most animals received two separate viral infusions and two separate fiber-optic cannula implantations, and that the inclusion of both female and male rats contributed to some variability in coordinates and so targeting. 

      (7.2) Can authors please present the performance of these animals during the tasks (OFF conditions, and for control ones, both ON & OFF conditions)?

      Rats were excluded after assessing the spread of viral infusions, placement of fibre-optic cannulas and potential damage due to the surgical procedures (page 21). The requested data are presented below and plotted in the same manner as in Figures 3-6. The pattern of performance in excluded animals was highly variable. 

      Author response image 1.

       

      (8) For tracing, only males were used, and for electrophysiology, only females were used.

      (8.1) Can authors please comment on not using both sexes in these experiments? 

      We agree that equal allocation of female and male rats in the experiments presented in Figures 1-2 would have been preferable. Animal availability was the sole factor determining these allocations. Importantly, both female and male D1-Cre and A2A-Cre rats were used for the NAc-S tracing studies, and no sex diMerences were observed in the projection patterns. The article describing the two transgenic lines of rats did not report any sex diMerence (Pettibone et al., 2019). 

      (8.2) Is there evidence in the literature that the electrophysiological properties of female versus male SPNs could diNer?

      The literature indicates that there is no sex diMerence in the electrophysiological properties of NAc-S SPNs (Cao et al., 2018; Willett et al., 2016).  

      (8.3) It seems like there is a discrepancy between the number of animals used as presented in the Figure 2 legend versus what is described in the main text. In the Figure legend, I understand that 5 animals were used for D1-Cre/DIO-eNpHR3.0 validation, and 7 animals for A2a-Cre/DIO-eNpHR3.0; however, the main text indicates the use of a total of 8 animals instead of the 12 presented in the Figure legend. Can authors please address this mismatch or clarify?

      The number of rats reported in the main text and Figure 2 legend was correct. However, recordings sometimes involved multiple cells from the same animal, and this aspect of the data was incorrectly reported and generated confusion. We have clarified the numbers in both the main text and Figure 2 legend to distinguish between animal counts and cell counts. 

      (9) Overall, in the study, have the authors checked for outliers?

      Performance across all training and testing stages was inspected to identify potential behavioral outliers in each experiment. Abnormal performance during a single session within a multi-session stage was not considered suMicient grounds for outlier designation. Based on these criteria, no subjects remaining after post-mortem analyses exhibited performance patterns warranting exclusion through statistical outlier analysis. However, we have conducted the specific analyses requested by the Reviewer, as described below. 

      (9.1) In Figure 3, it seems that one female in the eYFP group, in the OFF situation, for the diNerent condition, has a higher level of responding than the others. Can authors please confirm or refute this visual observation with the appropriate statistical analysis?

      Statistical analysis (z-score) confirmed the reviewer's observation regarding responding of the diMerent action in the OFF condition for this subject (|z| = 2.58). Similar extreme responding was observed in the ON condition (|z| = 2.03). Analyzing responding on the diMerent action in isolation is not informative in the context of outcome-specific PIT. Additional analyses revealed |z| < 2 when examining the magnitude of choice discrimination in outcome-specific PIT (i.e., net same versus net diMerent responding) in both ON and OFF conditions. Furthermore, this subject showed |z| < 2 across all other experimental stages. Based on these analyses, we conclude that the subject should be kept in all analyses. 

      (9.2) In Figure 5, it seems that one male, in the ON situation, in the diNerent condition, has a quite higher level of responding - is this subject an outlier? If so, how does it aNect the statistical analysis after being removed? And who is this subject in the OFF condition?

      The reviewer has identified two diMerent male rats infused with the eNpHR3.0 virus and has asked closer examination of their performance.

      The first rat showed outlier-level responding on the diMerent action in the ON condition (|z| = 2.89) but normal responding for all other measures across LED conditions (|z| < 2). Additional analyses revealed |z| = 2.55 when examining choice discrimination magnitude in outcome-specific PIT during the ON condition but not during the OFF condition (|z| = 0.62). This subject exhibited |z| < 2 across all other experimental stages.

      The second rat showed outlier-level responding on the same action in the OFF condition (|z| = 2.02) but normal responding for all other measures across LED conditions (|z| < 2). Additional analyses revealed |z| = 2.12 when examining choice discrimination magnitude in outcome-specific PIT during the OFF condition but not during the ON condition (|z| = 0.67). This subject also exhibited |z| < 2 across all other experimental stages.

      We excluded these two subjects and conducted the same analyses as described in the original manuscript. Baseline responding did not diMer between groups (p = 0.14), allowing to look at the net eMect of the stimuli. Overall lever presses were greater in the eYFP rats (Group: F(1,16) = 6.08, p < 0.05; η<sup>2</sup> = 0.28) and were reduced by LED activation (LED: F(1,16) = 9.52, p < 0.01; η<sup>2</sup> = 0.44) and this reduction depended on the group considered (Group x LED: F(1,16) = 12.125, p < 0.001; η<sup>2</sup> = 0.43). Lever press rates were higher on the action earning the same outcome as the stimuli compared to the action earning the diMerent outcome (Lever: F(1,16)= 49.32; η<sup>2</sup> = 0.76; p < 0.001), regardless of group (Group x Lever: p = 0.14). There was a Lever by LED light condition interaction (Lever x LED: F(1,16)= 5.25; η<sup>2</sup> = 0.24; p < 0.05) but no an interaction between group, LED light condition, and Lever during the presentation of the predictive stimuli (p = 0.10). Given the significant Group x LED and Lever x LED interactions, additional analyses were conducted to determine the source of these interactions. In eYFP rats, LED activation had no eMect (LED: p = 0.70) and lever presses were greater on the same action (Lever: (F(1,9) = 23.94, p < 0.001; η<sup>2</sup> = 0.79) regardless of LED condition (LED x Lever: p = 0.72). By contrast, in eNpHR3.0 rats, lever presses were reduced by LED activation (LED: F(1,9) = 23.97, p < 0.001; η<sup>2</sup> = 0.73), were greater on the same action (Lever: F(1,9) = 16.920, p < 0.001; η<sup>2</sup> = 0.65) and the two factors interacted (LED x Lever: F(1,9) = 9.12, p < 0.01; η<sup>2</sup> = 0.50). These rats demonstrated outcome-specific PIT in the OFF condition (F(1,9) = 27.26, p < 0.001; η<sup>2</sup> = 0.75) but not in the ON condition (p = 0.08).

      Overall, excluding these two rats altered the statistical analyses, but both the original and revised analyses yielded the same outcome: silencing the NAc-S D1-SPN to VP pathway disrupted PIT. More importantly, we do not believe there are suMicient grounds to exclude the two rats identified by the reviewer. These animals did not display outlier-level responding across training stages or during the choice test. Their potential classification as outliers would be based on responding during only one LED condition and not the other, with notably opposite patterns between the two rats despite belonging to the same experimental group. 

      (10) I think it would be appreciable if in the cartoons from Figure 5.A and 6.A, the SPNs neurons were color-coded as in the results (test plots) and the supplementary figures (histological color-coding), such as D1- in blue & D2-SPNs in red.

      Our current color-coding system uses blue for D1-SPNs transduced with eNpHR3.0 and red for D2-SPNs transduced with eNpHR3.0. The D1-SPNs and D2-SPNs shown in Figures 5A and 6A represent cells transduced with either eYFP (control) or eNpHR3.0 virus and therefore cannot be assigned the blue or red color, which is reserved for eNpHR3.0transduced cells specifically. The micrographs in the Supplemental Figures maintain consistency with the color-coding established in the main figures.

      (11) As there are (relatively small) variations in the control performance in term of Net responding (from ~3 to ~7 responses per min), I wonder what would be the result of pooling eYFP groups from the two first experiments (Figures 3 & 4) and from the two last ones (Figures 5 & 6) - would the same statically results stand or vary (as eYFP vs D1-Cre vs A2a-Cre rats)? In particular for Figures 3 & 4, with and without the potential outlier, if it's indeed an outlier.

      We considered the Reviewer’s recommendation but do not believe the requested analysis is appropriate. The Reviewer is requesting the pooling of data from subjects of distinct transgenic strains (D1-Cre and A2A-Cre rats) that underwent surgical and behavioral procedures at diMerent time points, sometimes months apart. Each experiment was designed with necessary controls to enable adequate statistical analyses for testing our specific hypotheses. 

      (12) Presence of cameras in operant cages is mentioned in methods, but no data is presented regarding recordings, though authors mention that they allow for real-time observations of behavior. I suggest removing "to record" or adding a statement about the fact that no videos were recorded or used in the present study.

      We have removed “to record” from the manuscript (page 18). 

      (13) In all supplementary Figures, "F" is wrongly indicated as "E".

      We thank the Reviewer for reporting these errors, which have been corrected. 

      (14) While the authors acknowledge that the eNicacy of optogenetic inhibition of terminals is questionable, I think that more details are required to address this point in the discussion (existing literature?). Maybe, the combination of an anterograde tracer from SPNs to VP, to label VP neurons (to facilitate patching these neurons), and the Credependent inhibitory opsin in the NAc Shell, with optogenetic illumination at the level of the VP, along with electrophysiological recordings of VP neurons, could help address this question but may, reasonably, seem challenging technically.

      Our manuscript does not state that optogenetic inhibition of terminals is questionable. It acknowledges that we do not provide any evidence about the eMicacy of the approach. Regardless, we have provided additional details and suggestions to address this lack of evidence (page 13). 

      (15) A nice addition could be an illustration of the proposed model (from line 374), but it may be unnecessary.

      We have carefully considered the reviewer's recommendation. The proposed model is detailed in three published articles, including one that is freely accessible, which we have cited when presenting the model in our manuscript (page 14). This reference should provide interested readers with easy access to a comprehensive illustration of the model.

      Reviewer 2 (Recommendations for the Author):

      As noted in my public comments, this is a truly excellent and compelling study. I have only a few minor comments.

      (1) I could not find the coordinates/parameters for the dorsal striatal AAV injections for that component of the tract tracing experiment.

      We apologize for this omission, which has now been corrected (page 16). 

      (2) Please add the final group sizes to the figure captions.

      We followed the Reviewer’s recommendation and added group sizes in the main figure captions. 

      (3) The discussion of group exclusions (p 21 line 637) seems to accidentally omit (n = X) the number of NAc-S D1-SPNs-VP mice excluded.

      We apologize for this omission, which has now been corrected (page 22). 

      (4) There were some labeling issues in the supplementary figures (perhaps elsewhere, too). Specifically, panel E was listed twice (once for F) in captions.

      We apologize for this error, which has now been corrected.  

      (5) Inspection of the magazine entry data from PIT tests suggests that the optogenetic manipulations may have had some eNects on this behavior and would encourage the authors to probe further. There was a significant group diNerence for D1-SPN inhibition and a marginal group eNect for D2-SPNs. The fact that these eNects were in opposite directions is intriguing, although not easily interpreted based on the canonical D1/D2 model. Of course, the eNects are not specific to the light-on trials, but this could be due to carryover into light-oN trials. An analysis of trial-order eNects seems crucial for interpreting these eNects. One might also consider normalizing for pre-test baseline performance. Response rates during Pavlovian conditioning seem to suggest that D2eNpHR mice showed slightly higher conditioned responding during training, which contrasts with their low entry rates at test. I don't see any of this as problematic -- but more should be done to interpret these findings.

      We thank the reviewer for raising this interesting point regarding magazine entry rates. Since these data are presented in the Supplemental Figures, we have added a section in the Supplemental Material file that elaborates on these findings. This section does not address trial order eMects, as trial order was fully counterbalanced in our experiments and the relevant statistical analyses would lack adequate power. Baseline normalization was not conducted because the reviewer's suggestion was based on their assumption that eNpHR3.0 rats in the D2-SPNs experiment showed slightly higher magazine entries during Pavlovian training. However, this was not the case. In fact, like the eNpHR3.0 rats in the D1-SPNs experiment, they tended to display lower magazine entries during training. The added section therefore focuses on the potential role of response competition during outcome-specific PIT tests. Although we concluded that response competition cannot explain our findings, we believe it may complicate interpretation of magazine entry behavior. Thus, we recommend that future studies examine the role of NAc-S SPNs using purely Pavlovian tasks. It is worth nothing that we have recently completed experiments (unpublished) examining NAc-S D1- and D2-SPN silencing during stimulus presentation in a Pavlovian task identical to the one used here. Silencing of either SPN population had no eMect on magazine entry behavior.

      Reviewer 3 (Recommendations for the Author):

      Broad comments:

      Throughout the manuscript, the authors draw parallels between the eNect established via pharmacological manipulations and those shown here with optogenetic manipulation. I understand using the pharmacological data to launch this investigation, but these two procedures address very diNerent physiological questions. In the case of a pharmacological manipulation, the targets are receptors, wherever they are expressed, and in the case of D2 receptors, this means altering function in both pre-synaptically expressed autoreceptors and post-synaptically expressed D2 MSN receptors. In the case of an optogenetic approach, the target is a specific cell population with a high degree of temporal control. So I would just caution against comparing results from these types of studies too closely.

      Related to this point is the consideration of the physiological relevance of the manipulation. Under normal conditions, dopamine acts at D1-like receptors to increase the probability of cell firing via Ga signaling. In contrast, dopamine binding of D2-like receptors decreases the cell's firing probability (signaling via Gi/o). Thus, shunting D1MSN activation provides a clear impression of the role of these cells and, putatively, the role of dopamine acting on these cells. However, inhibiting D2-MSNs more closely mimics these cells' response to dopamine (though optogenetic manipulations are likely far more impactful than Gi signaling). All this is to say that when we consider the results presented here in Experiment 2, it might suggest that during PIT testing, normal performance may require a halting of DA release onto D2-MSNs. This is highly speculative, of course, just a thought worth considering.

      We agree with the comments made by the Reviewer, and the original manuscript included statements acknowledging that pharmacological approaches are limited in the capacity to inform about the function of NAc-S SPNs (pages 4 and 9). As noted by the Reviewer, these limitations are especially salient when considering NAc-S D2-SPNs. Based on the Reviewer’s comment, we have modified our discussion to further underscore these limitations (page 12). Finally, we agree with the suggestion that PIT may require a halting of DA release onto D2-SPNs. This is consistent with the model presented, whereby D2-SPNs function is required to trigger enkephalin release (page 13).     

      Section-Specific Comments and Questions:

      Results:

      Anterograde tracing and ex vivo cell recordings in D1 Cre and A2a Cre rats: Why are there no statistics reported for the e-phys data in this section? Was this merely a qualitative demonstration? I realize that the A2a-Cre condition only shows 3 recordings, so I appreciate the limitations in analyzing the data presented.

      The reviewer is correct that we initially intended to provide a qualitative demonstration. However, we have now included statistical analyses for the ex vivo recordings. It is important to note that there were at least 5 recordings per condition, though overlapping data points may give the impression of fewer recordings in certain conditions. We have provided the exact number of recordings in both the main text (page 5) and figure legend. 

      What does trial by trial analysis look like, because in addition to the eNects of extinction, do you know if the responsiveness of the opsin to light stimulation is altered after repeated exposures, or whether the cells themselves become compromised in any way with repeated light-inhibition, particularly given the relatively long 2m duration of the trial.

      The Reviewer raises an interesting point, and we provide complete trial-by-trial data for each experiment below. As identified by the Reviewer, there is some evidence for extinction, although it remained modest. Importantly, the data suggest that light stimulation did not aMect the physiology of the targeted cells. In eNpHR3.0 rats, performance across OFF trials remained stable (both for Same and DiMerent) even though they were preceded by ON trials, indicating no carryover eMects from optical stimulation.

      Author response image 2.

       

      The statistics for the choice test are not reported for eNpHR-D1-Cre rats, but do show a weakening of the instrumental devaluation eNect "Group x Lever x LED: F1,18 = 10.04, p < 0.01, = 0.36". The post hoc comparisons showed that all groups showed devaluation, but it is evident that there is a weakening of this eNect when the LED was on (η<sup>2</sup> = 0.41) vs oN (η<sup>2</sup> = 0.78), so I think the authors should soften the claim that NAcS-D1s are not involved in value-based decision-making. (Also, there is a typo in the legend in Figure S1, where the caption for panel "F" is listed as "E".) I also think that this could be potentially interesting in light of the fact that with circuit manipulation, this same weakening of the instrumental devaluation eNect was not observed. To me, this suggests that D1-NAcS that project to a diNerent region (not VP) contribute to value-based decision making.

      This comment overlaps with one made in the Public Review, for which we have already provided a response. Given its importance, we have added a section addressing this point in the supplemental discussion of the Supplementary Material file, which aligns with the location of the relevant data. The caption labelling error has been corrected.

      Materials and Methods:

      Subjects:

      Were these heterozygous or homozygous rats? If hetero, what rats were used for crossbreeding (sex, strain, and vendor)? Was genotyping done by the lab or outsourced to commercial services? If genotyping was done within the lab, please provide a brief description of the protocol used. How was food restriction established and maintained (i.e., how many days to bring weights down, and was maintenance achieved by rationing or by limiting ad lib access to food for some period in the day)?

      The information requested by the Reviewer have been added to the subjects section (pages 15-16).  

      Were rats pair/group housed after implantation of optic fibers?

      We have clarified that rats were group houses throughout (see subjects section; pages 15-16). 

      Behavioral Procedures:

      How long did each 0.2ml sucrose infusion take? For pellets, for each US delivery, was it a single pellet or two in quick succession?

      We have modified the method section to indicate that the sucrose was delivered across 2 seconds and that a single pellet was provided (page 17). 

      The CS to ITI duration ratio is quite low. Is there a reason such a short ratio was used in training?

      These parameters are those used in all our previous experiments on outcome-specific PIT. There is no specific reason for using such a ratio, except that it shortens the length of the training session. 

      Relative to the end of training, when were the optical implantation surgeries conducted, and how much recovery time was given before initiating reminder training and testing?

      Fibre-optic implantation was conducted 3-4 days after training and another 3-4 days were given for recovery. This has been clarified in the Materials and methods section (pages 15-16).

      I think a diagram or schematic showing the timeline for surgeries, training, and testing would be helpful to the audience.

      We opted for a text-based experimental timeline rather than a diagram due to slight temporal variations across experiments (page 15).

      On trials, when the LED was on, was light delivered continuously or pulsed? Do these opto-receptors 'bleach' within such a long window?

      We apologize for the lack of clarity; the light was delivered continuously. We have modified the manuscript (pages 6 and 19) and figure legend accordingly. The postmortem analysis did not provide evidence for photobleaching (Supplemental Figures) and as noted above, the behavioural results do not indicate any negative physiological impact on cell function.  

      Immunofluorescence: The blocking solution used during IHC is described as "NHS"; is this normal horse serum?

      The Reviewer is correct; NHS stands for normal horse serum. This has been added (page 21). 

      Microscopy and imaging:

      For the description of rats excluded due to placement or viral spread problems, an n=X is listed for the NAc S D1 SPNs --> VP silencing group. Is this a typo, or was that meant to read as n=0? Also, was there a major sex diNerence in the attrition rate? If so, I think reporting the sex of the lost subjects might be beneficial to the scientific community, as it might reflect a need for better guidance on sex-specific coordinates for targeting small nuclei.

      We apologize for the error regarding the number of excluded animals. This error has been corrected (page 23). There were no major sex diMerences in the attrition rate. The manuscript has been updated to provide information about the sex of excluded animals (page 23). 

      References

      Cao, J., Willett, J. A., Dorris, D. M., & Meitzen, J. (2018). Sex DiMerences in Medium Spiny Neuron Excitability and Glutamatergic Synaptic Input: Heterogeneity Across Striatal Regions and Evidence for Estradiol-Dependent Sexual DiMerentiation. Front Endocrinol (Lausanne), 9, 173. https://doi.org/10.3389/fendo.2018.00173

      Corbit, L. H., Muir, J. L., Balleine, B. W., & Balleine, B. W. (2001). The role of the nucleus accumbens in instrumental conditioning: Evidence of a functional dissociation between accumbens core and shell. J Neurosci, 21(9), 3251-3260. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=11312 310&retmode=ref&cmd=prlinks

      Corbit, L. H., & Balleine, B. W. (2011). The general and outcome-specific forms of Pavlovian-instrumental transfer are diMerentially mediated by the nucleus accumbens core and shell. J Neurosci, 31(33), 11786-11794. https://doi.org/10.1523/JNEUROSCI.2711-11.2011

      Laurent, V., Bertran-Gonzalez, J., Chieng, B. C., & Balleine, B. W. (2014). δ-Opioid and Dopaminergic Processes in Accumbens Shell Modulate the Cholinergic Control of Predictive Learning and Choice. J Neurosci, 34(4), 1358-1369. https://doi.org/10.1523/JNEUROSCI.4592-13.2014

      Laurent, V., Leung, B., Maidment, N., & Balleine, B. W. (2012). μ- and δ-opioid-related processes in the accumbens core and shell diMerentially mediate the influence of reward-guided and stimulus-guided decisions on choice. J Neurosci, 32(5), 1875-1883. https://doi.org/10.1523/JNEUROSCI.4688-11.2012

      Matamales, M., McGovern, A. E., Mi, J. D., Mazzone, S. B., Balleine, B. W., & BertranGonzalez, J. (2020). Local D2- to D1-neuron transmodulation updates goal-directed learning in the striatum. Science, 367(6477), 549-555. https://doi.org/10.1126/science.aaz5751

      Parkes, S. L., Bradfield, L. A., & Balleine, B. W. (2015). Interaction of insular cortex and ventral striatum mediates the eMect of incentive memory on choice between goaldirected actions. J Neurosci, 35(16), 6464-6471. https://doi.org/10.1523/JNEUROSCI.4153-14.2015

      Pettibone, J. R., Yu, J. Y., Derman, R. C., Faust, T. W., Hughes, E. D., Filipiak, W. E., Saunders, T. L., Ferrario, C. R., & Berke, J. D. (2019). Knock-In Rat Lines with Cre Recombinase at the Dopamine D1 and Adenosine 2a Receptor Loci. eNeuro, 6(5). https://doi.org/10.1523/ENEURO.0163-19.2019

      Willett, J. A., Will, T., Hauser, C. A., Dorris, D. M., Cao, J., & Meitzen, J. (2016). No Evidence for Sex DiMerences in the Electrophysiological Properties and Excitatory Synaptic Input onto Nucleus Accumbens Shell Medium Spiny Neurons. eNeuro, 3(1), ENEURO.0147-15.2016. https://doi.org/10.1523/ENEURO.0147-15.2016

    1. Author response:

      Reviewer #1 (Public review):

      In the current article, Octavia Soegyono and colleagues study "The influence of nucleus accumbens shell D1 and D2 neurons on outcome-specific Pavlovian instrumental transfer", building on extensive findings from the same lab. While there is a consensus about the specific involvement of the Shell part of the Nucleus Accumbens (NAc) in specific stimulus-based actions in choice settings (and not in General Pavlovian instrumental transfer - gPIT, as opposed to the Core part of the NAc), mechanisms at the cellular and circuitry levels remain to be explored. In the present work, using sophisticated methods (rat Cre-transgenic lines from both sexes, optogenetics, and the well-established behavioral paradigm outcome-specific PIT-sPIT), Octavia Soegyono and colleagues decipher the differential contribution of dopamine receptors D1 and D2 expressing spiny projection neurons (SPNs).

      After validating the viral strategy and the specificity of the targeting (immunochemistry and electrophysiology), the authors demonstrate that while both NAc Shell D1- and D2-SPNs participate in mediating sPIT, NAc Shell D1-SPNs projections to the Ventral Pallidum (VP, previously demonstrated as crucial for sPIT), but not D2-SPNs, mediates sPIT. They also show that these effects were specific to stimulus-based actions, as value-based choices were left intact in all manipulations.

      This is a well-designed study, and the results are well supported by the experimental evidence. The paper is extremely pleasant to read and adds to the current literature.

      We thank the Reviewer for their positive assessment.

      Reviewer #2 (Public review):

      Summary:

      This manuscript by Soegyono et al. describes a series of experiments designed to probe the involvement of dopamine D1 and D2 neurons within the nucleus accumbens shell in outcome-specific Pavlovian-instrumental transfer (osPIT), a well-controlled assay of cue-guided action selection based on congruent outcome associations. They used an optogenetic approach to phasically silence NAc shell D1 (D1-Cre mice) or D2 (A2a-Cre mice) neurons during a subset of osPIT trials. Both manipulations disrupted cue-guided action selection but had no effects on negative control measures/tasks (concomitant approach behavior, separate valued guided choice task), nor were any osPIT impairments found in reporter-only control groups. Separate experiments revealed that selective inhibition of NAc shell D1 but not D2 inputs to ventral pallidum was required for osPIT expression, thereby advancing understanding of the basal ganglia circuitry underpinning this important aspect of decision making.

      Strengths:

      The combinatorial viral and optogenetic approaches used here were convincingly validated through anatomical tract-tracing and ex vivo electrophysiology. The behavioral assays are sophisticated and well-controlled to parse cue and value-guided action selection. The inclusion of reporter-only control groups is rigorous and rules out nonspecific effects of the light manipulation. The findings are novel and address a critical question in the literature. Prior work using less decisive methods had implicated NAc shell D1 neurons in osPIT but suggested that D2 neurons may not be involved. The optogenetic manipulations used in the current study provide a more direct test of their involvement and convincingly demonstrate that both populations play an important role. Prior work had also implicated NAc shell connections to ventral pallidum in osPIT, but the current study reveals the selective involvement of D1 but not D2 neurons in this circuit. The authors do a good job of discussing their findings, including their nuanced interpretation that NAc shell D2 neurons may contribute to osPIT through their local regulation of NAc shell microcircuitry.

      We thank the Reviewer for their positive assessment.

      Weaknesses:

      The current study exclusively used an optogenetic approach to probe the function of D1 and D2 NAc shell neurons. Providing a complementary assessment with chemogenetics or other appropriate methods would strengthen conclusions, particularly the novel demonstration of D2 NAc shell involvement. Likewise, the null result of optically inhibiting D2 inputs to the ventral pallidum leaves open the possibility that a more complete or sustained disruption of this pathway may have impaired osPIT.

      We acknowledge the reviewer's valuable suggestion that demonstrating NAc-S D1- and D2-SPN engagement in outcome-specific PIT through another technique would strengthen our optogenetic findings. Several approaches could provide this validation. Chemogenetic manipulation, as the reviewer suggested, represents one compelling option. Alternatively, immunohistochemical assessment of phosphorylated histone H3 at serine 10 (P-H3) offers another promising avenue, given its established utility in reporting striatal SPN plasticity in the dorsal striatum (Matamales et al., 2020). We hope to complete such an assessment in future work since it would address the limitations of previous work that relied solely on ERK1/2 phosphorylation measures in NAc-S SPNs (Laurent et al., 2014).

      Regarding the null result from optical silencing of D2 terminals in the ventral pallidum, we agree with the reviewer's assessment. While we acknowledge this limitation in the current manuscript (see discussion), we aim to address this gap in future studies to provide a more complete mechanistic understanding of the circuit.

      Reviewer #3 (Public review):

      Summary:

      The authors present data demonstrating that optogenetic inhibition of either D1- or D2-MSNs in the NAc Shell attenuates expression of sensory-specific PIT while largely sparing value-based decision on an instrumental task. They also provide evidence that SS-PIT depends on D1-MSN projections from the NAc-Shell to the VP, whereas projections from D2-MSNs to the VP do not contribute to SS-PIT.

      Strengths:

      This is clearly written. The evidence largely supports the authors' interpretations, and these effects are somewhat novel, so they help advance our understanding of PIT and NAc-Shell function.

      We thank the Reviewer for their positive assessment.

      Weaknesses:

      I think the interpretation of some of the effects (specifically the claim that D1-MSNs do not contribute to value-based decision making) is not fully supported by the data presented.

      We appreciate the reviewer's comment regarding the marginal attenuation of value-based choice observed following NAc-S D1-SPN silencing. While this manipulation did produce a slight reduction in choice performance, the behavior remained largely intact. We are hesitant to interpret this marginal effect as evidence for a direct role of NAc-S D1-SPNs in value-based decision-making, particularly given the substantial literature demonstrating that NAc-S manipulations typically preserve such choice behavior (Corbit & Balleine, 2011; Corbit et al., 2001; Laurent et al., 2012). Notably, previous work has shown that NAc-S D1 receptor blockade impairs outcome-specific PIT while leaving value-based choice unaffected (Laurent et al., 2014). We favor an alternative explanation for our observed marginal reduction. As documented in Supplemental Figure 1, viral transduction extended slightly into the nucleus accumbens core (NAc-C), a region established as critical for value-based decision-making (Corbit & Balleine, 2011; Corbit et al., 2001; Laurent et al., 2012). The marginal impairment may therefore reflect inadvertent silencing of a small NAc-C D1-SPN population rather than a functional contribution from NAc-S D1-SPNs. Future studies specifically targeting larger NAc-C D1-SPN populations would help clarify this possibility and provide definitive resolution of this question.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The authors used high-density probe recordings in the medial prefrontal cortex (PFC) and hippocampus during a rodent spatial memory task to examine functional sub-populations of PFC neurons that are modulated vs. unmodulated by hippocampal sharp-wave ripples (SWRs), an important physiological biomarker that is thought to have a role in mediating information transfer across hippocampal-cortical networks for memory processes. SWRs are associated with the reactivation of representations of previous experiences, and associated reactivation in hippocampal and cortical regions has been proposed to have a role in memory formation, retrieval, planning, and memory-guided behavior. This study focuses on awake SWRs that are prevalent during immobility periods during pauses in behavior. Previous studies have reported strong modulation of a subset of prefrontal neurons during hippocampal SWRs, with some studies reporting prefrontal reactivation during SWRs that have a role in spatial memory processes. The study seeks to extend these findings by examining the activity of SWR-modulated vs. unmodulated neurons across PFC sub-regions, and whether there is a functional distinction between these two kinds of neuronal populations with respect to representing spatial information and supporting memory-guided decision-making.

      Strengths:

      The major strength of the study is the use of Neuropixels 1.0 probes to monitor activity throughout the dorsal-ventral extent of the rodent medial prefrontal cortex, permitting an investigation of functional distinction in neuronal populations across PFC sub-regions. They are able to show that SWR-unmodulated neurons, in addition to having stronger spatial tuning than SWR-modulated neurons as previously reported, also show stronger directional selectivity and theta-cycle skipping properties.

      Weaknesses:

      (1) While the study is able to extend previous findings that SWR-modulated PFC neurons have significantly lower spatial tuning that SWR-unmodulated neurons, the evidence presented does not support the main conclusion of the paper that only the unmodulated neurons are involved in spatial tuning and signaling upcoming choice, implying that SWR-modulated neurons are not involved in predicting upcoming choice, as stated in the abstract. This conclusion makes a categorical distinction between two neuronal populations, that SWR-modulated neurons are involved and SWR-unmodulated are not involved in predicting upcoming choice, which requires evidence that clearly shows this absolute distinction. However, in the analyses showing non-local population decoding in PFC for predicting upcoming choice, the results show that SWR-unmodulated neurons have higher firing rates than SWR-modulated neurons, which is not a categorical distinction. Higher firing rates do not imply that only SWR-unmodulated neurons are contributing to the non-local decoding. They may contribute more than SWR-modulated neurons, but there are no follow-up analyses to assess the contribution of the two sub-populations to non-local decoding.

      We agree with the reviewer that this is indeed not a categorical distinction, and do not wish to claim that the SWR-modulated neurons have absolutely no role in non-local decoding and signaling upcoming choice. We have adjusted this in the title, abstract and text to clarify this for the reader. Furthermore, we have performed additional analyses to elucidate the role of SWR-modulated neurons in non-local decoding by creating separate decoding models for SWR-modulated and unmodulated PFC neurons respectively. These analyses show that the SWR-unmodulated neurons are indeed encoding representations of the upcoming choice more often than the alternative choice, whereas the SWR-modulated neurons do not reliably differentiate the upcoming and alternative choices in non-local decoding at the choice point (see new Fig 4d).

      (2) Further, the results show that during non-local representations of the hippocampus of the upcoming options, SWR-excited PFC neurons were more active during hippocampal representations of the upcoming choice, and SWR-inhibited PFC neurons were less active during hippocampal representations of the alternative choice. This clearly suggests that SWR-modulated neurons are involved in signaling upcoming choice, at least during hippocampal non-local representations, which contradicts the main conclusion of the paper.

      This does not contradict the main conclusion of the paper, but in fact strengthens the hypothesis we are putting forward: that the SWR-modulated neurons are more linked to the hippocampal non-local representations, whereas the SWR-unmodulated neurons seem to have their own encoding of upcoming choice which is not linked to the signatures in the hippocampus (almost no time overlap with hippocampal representations, no phase locking to hippocampal theta, no time locking to hippocampal SWRs, no increased firing during hippocampal representations of upcoming choice).

      (3) Similarly, one of the analyses shows that PFC nonlocal representations show no preference for hippocampal SWRs or hippocampal theta phase. However, the examples shown for non-local representations clearly show that these decodes occur prior to the start of the trajectory, or during running on the central zone or start arm. The time period of immobility prior to the start arm running will have a higher prevalence of SWRs and that during running will have a higher prevalence of theta oscillations and theta sequences, so non-local decoded representations have to sub-divided according to these known local-field potential phenomena for this analysis, which is not followed.

      These analyses are in fact separated based on proximity to SWRs (only segments that occurred within 2 seconds of SWR onset were included, see Methods) and theta periods respectively (selected based on a running speed of more than 5 cm/s and the absence of SWRs in the hippocampus, see Methods). We have clarified this in the main text.

      (4) The primary phenomenon that the manuscript relies on is the modulation of PFC neurons by hippocampal SWRs, so it is necessary to perform the PFC population decoding analyses during SWRs (or examine non-local decoding that occurs specifically during SWRs), as reported in previous studies of PFC reactivation during SWRs, to see if there is any distinction between modulated and unmodulated neurons in this reactivation. Even in the case of independent PFC reactivation as reported by one study, this PFC reactivation was still reported to occur during hippocampal SWRs, therefore decoding during SWRs has to be examined. Similarly, the phenomenon of theta cycle skipping is related to theta sequence representations, so decoding during PFC and hippocampal theta sequences has to be examined before coming to any conclusions.

      The histograms shown in Figure 5a (see updated Fig 5a where we look at the closest SWR in time and compare the occurrence with shuffled data) show that there is no increased prevalence of decoding upcoming and alternative choices in the PFC during hippocampal SWRs. The lack of overlap of non-local decoding between the hippocampus and PFC further shows that these non-local representations occur at different timepoints in the PFC and hippocampus (see updated Fig 4e where we added a statistical test to show the likeliness of the overlap between the decoded segments in the PFC and hippocampus). Based on the reviewer's suggestion, we have additionally decoded the information in the PFC during hippocampal SWRs exclusively, and found that the direction on the maze could not be predicted based on the decoding of SWR time points in the PFC. See figure below. Similarly, we can see from the histograms in Figure 5c that there is no phase locking to the hippocampal theta phase for non-local representations in the PFC, and in contrast there is phase locking of the hippocampal encoding of upcoming choice to the rising phase of the theta cycle (Fig 6c), further highlighting the separation between these two regions in the non-local decoding.

      Reviewer #2 (Public review):

      Summary:

      This work by den Bakker and Kloosterman contributes to the vast body of research exploring the dynamics governing the communication between the hippocampus (HPC) and the medial prefrontal cortex (mPFC) during spatial learning and navigation. Previous research showed that population activity of mPFC neurons is replayed during HPC sharp-wave ripple events (SWRs), which may therefore correspond to privileged windows for the transfer of learned navigation information from the HPC, where initial learning occurs, to the mPFC, which is thought to store this information long term. Indeed, it was also previously shown that the activity of mPFC neurons contains task-related information that can inform about the location of an animal in a maze, which can predict the animals' navigational choices. Here, the authors aim to show that the mPFC neurons that are modulated by HPC activity (SWRs and theta rhythms) are distinct from those "encoding" spatial information. This result could suggest that the integration of spatial information originating from the HPC within the mPFC may require the cooperation of separate sets of neurons.

      This observation may be useful to further extend our understanding of the dynamics regulating the exchange of information between the HPC and mPFC during learning. However, my understanding is that this finding is mainly based upon a negative result, which cannot be statistically proven by the failure to reject the null hypothesis. Moreover, in my reading, the rest of the paper mainly replicates phenomena that have already been described, with the original reports not correctly cited. My opinion is that the novel elements should be precisely identified and discussed, while the current phrasing in the manuscript, in most cases, leads readers to think that these results are new. Detailed comments are provided below.

      Major concerns:

      (1) The main claim of the manuscript is that the neurons involved in predicting upcoming choices are not the neurons modulated by the HPC. This is based upon the evidence provided in Figure 5, which is a negative result that the authors employ to claim that predictive non-local representations in the mPFC are not linked to hippocampal SWRs and theta phase. However, it is important to remember that in a statistical test, the failure to reject the null hypothesis does not prove that the null hypothesis is true. Since this claim is so central in this work, the authors should use appropriate statistics to demonstrate that the null hypothesis is true. This can be accomplished by showing that there is no effect above some size that is so small that it would make the effect meaningless (see https://doi.org/10.1177/070674370304801108).

      We would like to highlight a few important points here. (1) We indeed do not intend to claim that the SWR-modulated neurons are not at all involved in predicting upcoming choice, just that the SWR-unmodulated neurons may play a larger role. We have rephrased the title and abstract to make this clearer. (2) The hypothesis that we put forward is based not only on a negative effect, but on the findings that: the SWR-unmodulated neurons show higher spatial tuning (Fig 3b), more directional selectivity (Fig 3d), more frequent encoding of the upcoming choice at the choice point (new analysis, added in Fig 4d), and higher spike rates during the representations of the upcoming choice (Fig 5b). This is further highlighted by the fact that the representations of upcoming choice in the PFC are not time locked to SWRs (whereas the hippocampal representations of upcoming choice are;  see Fig 5a and Fig 6a), and not time-locked to hippocampal theta phase (whereas the hippocampal representations are; see Fig 5c and Fig 6c). Finally, the representations of upcoming and alternative choices in the PFC do not show a large overlap in time with the representations in the hippocampus (see updated Fig 4e were we added a statistical test to show the likelihood of the overlap of decoded timepoints). All these results together lead us to hypothesize that SWR-modulation is not the driving factor behind non-local decoding in the PFC. (3) Based on the reviewers suggestion, we have added a statistical test to compare the phase-locking based of the non-local decoding to hippocampal SWRs and theta phase to shuffled posterior probabilities. Instead of looking at all SWRs in a -2 to 2 second window, we have now only selected the closest SWR in time within that window, and did the statistical comparison in the bin of 0-20 ms from SWR onset. With this new analysis we are looking more directly at the time-locking of the decoded segments to SWR onset (see updated Fig 5a and 6a).

      (2) The main claim of the work is also based on Figure 3, where the authors show that SWRs-unmodulated mPFC neurons have higher spatial tuning, and higher directional selectivity scores, and a higher percentage of these neurons show theta skipping. This is used to support the claim that SWRs-unmodulated cells encode spatial information. However, it must be noted that in this kind of task, it is not possible to disentangle space and specific task variables involving separate cognitive processes from processing spatial information such as decision-making, attention, motor control, etc., which always happen at specific locations of the maze. Therefore, the results shown in Figure 3 may relate to other specific processes rather than encoding of space and it cannot be unequivocally claimed that mPFC neurons "encode spatial information". This limitation is presented by Mashoori et al (2018), an article that appears to be a major inspiration for this work. Can the authors provide a control analysis/experiment that supports their claim? Otherwise, this claim should be tempered. Also, the authors say that Jadhav et al. (2016) showed that mPFC neurons unmodulated by SWRs are less tuned to space. How do they reconcile it with their results?

      The reviewer is right to assert caution when talking about claims such as spatial tuning where other factors may also be involved. Although we agree that there may be some other factors influencing what we are seeing as spatial tuning, it is very important to note that the behavioral task is executed on a symmetrical 4-armed maze, where two of the arms are always used for the start of the trajectory, and the other two arms (North and South) function as the goal (reward) arms. Therefore, if the PFC is encoding cognitive processes such as task phases related to decision-making and reward, we would not be able to differentiate between the two start arms and the two goal arms, as these represent the same task phases. Note also that the North and South arm are illuminated in a pseudo-random order between trials and during cue-based rule learning this is a direct indication of where the reward will be found. Even in this phase of the task, the PFC encodes where the animal will turn on a trial-to-trial basis (meaning the North and South arm are still differentiated correctly on each trial even though the illumination and associated reward are changing).

      Secondly, importantly, the reviewer mentions that we claimed that Jadhav et al. (2016) showed that mPFC neurons unmodulated by SWRs are less tuned to space, but this is incorrect. Jadhav et al. (2016) showed that SWR-unmodulated neurons had lower spatial coverage, meaning that they are more spatially selective (congruent with our results). We have rephrased this in the text to be clearer.

      (3) My reading is that the rest of the paper mainly consists of replications or incremental observations of already known phenomena with some not necessarily surprising new observations:

      (a) Figure 2 shows that a subset of mPFC neurons is modulated by HPC SWRs and theta (already known), that vmPFC neurons are more strongly modulated by SWRs (not surprising given anatomy), and that theta phase preference is different between vmPFC and dmPFC (not surprising given the fact that theta is a travelling wave).

      The finding that vmPFC neurons are more strongly modulated by SWRs than dmPFC indeed matches what we know from anatomy, but that does not make it a trivial finding. A lot remains unknown about the mPFC subregions and their interactions with the hippocampus, and not every finding will be directly linked to the anatomy. Therefore, in our view this is a significant finding which has not been studied before due to the technical complexity of large-scale recordings along the dorsal-ventral axis of the mPFC.

      Similarly, theta being a traveling wave (which in itself is still under debate), does not mean we should assume that the dorsal and ventral mPFC should follow this signature and be modulated by different phases of the theta cycle. Again, in our view this is not at all trivial, but an important finding which brings us closer to understanding the intricate interactions between the hippocampus and PFC in spatial learning and decision-making.

      (b) Figure 4 shows that non-local representations in mPFC are predictive of the animal's choice. This is mostly an increment to the work of Mashoori et al (2018). My understanding is that in addition to what had already been shown by Mashoori et al here it is shown how the upcoming choice can be predicted. The author may want to emphasize this novel aspect.

      In our view our manuscript focuses on a completely different aspect of learning and memory than the paper the reviewer is referring to (Mashoori et al. 2018). Importantly, the Mashoori et al. paper looked at choice evaluation at reward sites and shows that disappointing reinforcements are associated with reactivations in the ACC of the unselected target. This points to the role of the ACC in error detection and evaluation. Although this is an interesting result, it is in essence unrelated to what we are focusing on here, which is decision making and prediction of upcoming choices. The fact that the turning direction of the animal can be predicted on a trial-to-trial basis, and even precedes the behavioral change over the course of learning, sheds light on the role of the PFC in these important predictive cognitive processes (as opposed to post-choice reflective processes).

      (c) Figure 6 shows that prospective activity in the HPC is linked to SWRs and theta oscillations. This has been described in various forms since at least the works of Johnson and Redish in 2007, Pastalkova et al 2008, and Dragoi and Tonegawa (2011 and 2013), as well as in earlier literature on splitter cells. These foundational papers on this topic are not even cited in the current manuscript.

      We have added these citations to the introduction (line 37).

      Although some previous work is cited, the current narrative of the results section may lead the reader to think that these results are new, which I think is unfair. Previous evidence of the same phenomena should be cited all along the results and what is new and/or different from previous results should be clearly stated and discussed. Pure replications of previous works may actually just be supplementary figures. It is not fair that the titles of paragraphs and main figures correspond to notions that are well established in the literature (e.g., Figure 2, 2nd paragraph of results, etc.).

      We have changed the title of paragraph 2 and Figure 2 to highlight more clearly the novel result (the difference between the dorsal and ventral mPFC), and have improved clarity of the text throughout to highlight the novelty of our results better.

      (d) My opinion is that, overall, the paper gives the impression of being somewhat rushed and lacking attention to detail. Many figure panels are difficult to understand due to incomplete legends and visualizations with tiny, indistinguishable details. Moreover, some previous works are not correctly cited. I tried to make a list of everything I spotted below.

      We have addressed all the comments in the Recommendations for Authors.

      Reviewer #1 (Recommendations for the authors):

      (1) Expanding on the points above, one of the strengths of the study is expanding the previous result that SWR-unmodulated neurons are more spatially selective (Jadhav et al., 2016), across prefrontal sub-regions, and showing that these neurons are more directionally selective and show more theta cycle skipping. Theta cycle skipping is related to theta sequence representations and previous studies have established PFC theta sequences in parallel to hippocampal theta sequences (Tang et al., 2021; Hasz and Redish, 2020; Wang et al., 2024), and the theta cycle skipping result suggests that SWR-unmodulated neurons should show stronger participation than SWR-modulated neurons in PFC theta sequences that decode to upcoming or alternative location, which can be tested in this high-density PFC physiology data. This is still unlikely to make a categorical distinction that only SWR-unmodulated neurons participate in theta sequence decoding, but will be useful to examine.

      We thank the reviewer for their suggestion and have now included results based on separate decoding models that only use SWR-modulated or SWR-unmodulated mPFC neurons. From this analysis we see that indeed SWR-unmodulated neurons are not the only group contributing to theta sequence decoding, but they do distinguish more strongly between the upcoming and alternative arms at the choice point (see new Fig 4d).

      (2) Non-local decoding in 50ms windows on a theta timescale is a valid analysis, but ignoring potential variability in the internal state during running vs. immobility, and as indicated by LFPs by the presence of SWRs or theta oscillations, is incorrect especially when conclusions are being made about decoding during SWRs and theta oscillation phase, and in light of previous evidence that these are distinct states during behavior. There are multiple papers on PFC theta sequences (Tang et al., 2021; Hasz and Redish, 2020; Wang et al., 2024), and on PFC reactivation during SWRs (Shin et al., 2019; Kaefer et al., 2020; Jarovi et al., 2023), and this dataset of high-density prefrontal recordings using Neuropixels 1.0 provides an opportunity to investigate these phenomena in detail. Here, it should be noted that although Kaefer et al. reported independent prefrontal reactivation from hippocampal reactivation, these PFC reactivation events still occurred during hippocampal SWRs in their data, and were linked to memory performance.

      From our data we see that the time segments that represent upcoming or alternative choice in the prefrontal cortex are in fact not time-locked to hippocampal SWRs (updated Fig 5a where we look only at the closest SWR in time and compare this to shuffled data). In addition, these segments do not overlap much with the decoded segments in the hippocampus (see updated Fig 4e where we added a shuffling procedure to assess the likelihood of the overlap with hippocampal decoded segments). Importantly, we are not ignoring the variability during running and immobility, as theta segments were selected based on a running speed of more than 5 cm/s and the absence of SWRs in the hippocampus (see Methods), ensuring that the theta and SWR analyses were done on the two different behavioral states respectively. We have  clarified this in the main text.

      (3) The majority of rodent studies make the distinction between ACC, PrL, and IL, although as the authors noted, there are arguments that rodent mPFC is a continuum (Howland et al., 2022), or even that rodent mPFC is a unitary cingulate cortical region (van Heukelum et al., 2020). The authors choose to present the results as dorsal (ACC + dorsal PrL) vs. ventral mPFC (ventral PrL + IL), however, in my opinion, it will be more useful to the field to see results separately for ACC, PrL, and IL, given the vast literature on connectivity and functional differences in these regions.

      We appreciate the reviewer’s suggestion. Initially, we did perform all analyses separately for the ACC, PLC and ILC subregions. However, we observed that the differences between subregions (strength of SWR-modulation and the phase locking to theta) varied uniformly along the dorsal-ventral axis, i.e., the PLC showed a profile of SWR-modulation and theta phase locking that fell in between that of the ACC and the ILC. This is also highlighted in paragraph 3 of the introduction (lines 52-56). For that reason, and for the sake of reducing the number of variables, increasing statistical power, and improving readability, we focused on the dorsal-ventral distinction instead, as this is where the main differences were seen.

      (4) I suggest that the authors refrain from making categorical distinctions as in their title and abstract, such as "neurons that are involved in predicting upcoming choice are not the neurons that are modulated by hippocampal sharp-wave ripples" when the evidence presented can only support gradation of participation of the two neuronal sub-populations, not an absolute distinction. The division of SWR-modulated and SWR-unmodulated neurons itself is determined by the statistic chosen to divide the neurons into one or two sub-classes and will vary with the statistical threshold employed. Further, previous studies have suggested that SWR-excited and SWR-inhibited neurons comprise distinct functional sub-populations based on their activity properties (Jadhav et al., 2016; Tang et al., 2017), but it is not clear to what degree is SWR-modulated neurons a distinct and singular functional sub-population. In the absence of connectivity information and cross-correlation measures within and across sub-populations, it is prudent to be conservative about this interpretation of SWR-unmodulated neurons.

      We agree with the reviewer that the distinction is not categorical and have changed the wording in the title and abstract. We also do not intend to claim that the SWR-modulated neurons are a distinct and singular functional sub-population, and for that reason the firing rates from the SWR-excited and SWR-inhibited groups are reported separately throughout the paper.

      Reviewer #2 (Recommendations for the authors):

      Minor detailed remarks:

      (1) The authors should provide a statistical test, perhaps against shuffled data, for Figures 5a,c and 6a,c.

      We thank the reviewer for their suggestion and have added statistical tests in Figures 5a, 5c, 6a and 6c.

      (2) The behavioral task is explained only in the legend of Figure 1c, and the explanation is quite vague. In this type of article format, readers need to have a clear understanding of the task without having to refer to the methods section. A clear understanding of the task is crucial for interpreting all subsequent analyses. In my opinion, the word 'trial' in the figure is misleading, as these are sessions composed of many trials.

      We have added a more thorough description of the behavioral task, both in the main text and the Figure legend.

      (3) Figure 1d, legend of markers missing.

      We have added a legend for the markers.

      (4) When there are multiple bars and a single p-value is presented, it is unclear which group comparisons the p-value pertains to. For instance, Figures 2c-f and 3b, d, f (right parts), and 5b...

      For all p-values we have added lines to the figures that indicate the groups that were compared and have added descriptions of the statistical test to the figure legends to indicate what each p-value represents.

      (5) In Figure 3c, the legend does not explain what the colored lines represent, and the lines themselves are very small and almost indistinguishable.

      We have changed the colored lines to quadrants on the maze to clarify what each direction represents.

      (6) Figure 4a is too small, and the elements are so tiny that it is impossible to distinguish them and their respective colors. The term 'segment' has not been unequivocally explained in the text. All the different elements of the panel should be explicitly explained in the legend to make it easily understandable. What do the pictograms of the maze on the left represent? What does the dashed vertical line indicate?

      We have added the definition of a segment in the text (lines 283-286) and have improved the clarity and readability of Figure 4a.

      (7) In Figure 5, what do the red dots on the right part relate to? The legend should explicitly explain what is shown in the left and right parts, respectively. What comparisons do the p-values relate to?

      We have adjusted the legend to explain the left and right parts of the figure and we have added the statistical test that was used to get to the p-value (in addition to the text which already explained this).

      (8) Panels b of Figures 5 and 6 should have the same y-axis scale for comparison. The position of the p-values should also be consistent. With the current arrangement in Figure 6, it is unclear what the p-values relate to.

      We have adjusted the y-scale to be the same for Figures 5 and 6, and we have added a description of the statistical test to the legend.

      (9) Multiple studies have previously shown that mPFC activity contains spatial information (e.g., refs 24-27). It is important that, throughout the paper, the authors frame their results in relation to previous findings, highlighting what is novel in this work.

      We thank the reviewer for this valuable suggestion. In the revised manuscript, we have indicated more clearly which results replicate previous findings and highlighted novel results.

      (10) Please note that Peyrache et al. (2009) do not show trajectory replay, nor do they decode location. I am not familiar with all the cited literature, but this makes me think that the authors may want to double-check their citations to ensure they assign the correct claims to each past work.

      We have adjusted the reference to the work to exclude the word ‘trajectory’ and doublechecked our other citations.

      (11) The authors perform theta-skipping analysis, first described by Kay et al., but do not cite the original paper until the discussion.

      Thank you pointing out this oversight. We have now included this citation earlier in the paper (line 231).

      (12) Additionally, some parts of the text are difficult to grasp, and there are English vocabulary and syntax errors. I am happy to provide comments on the next version of the text, but please include page and line numbers in the PDF. The authors may also consider using AI to correct English mistakes and improve the fluency and readability of their text.

      We have carefully gone through the text to correct any errors.  We have now also included page and line numbers and we will be happy to address any specific issues the reviewer may spot in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews: 

      Reviewer #1 (Public review): 

      This study presents evidence that remote memory in the APP/PS1 mouse model of Alzheimer's disease (AD) is associated with PV interneuron hyperexcitability and increased inhibition of cortical engram cells. Its strength lies in the fact that it explores a neglected aspect of memory research - remote memory impairments related to AD (for which the primary research focus is usually on recent memory impairments) -which has received minimal attention to date. While the findings are intriguing, the weakness of the paper hovers around purely correlational types of evidence and superficial data analyses, which require substantial revisions as outlined below. 

      We thank the reviewer for their feedback, and we appreciate the recognition of the study’s novelty in addressing remote memory impairments in AD. We acknowledge the reviewer’s concerns and have implemented revisions to strengthen the manuscript.

      Major concerns: 

      (1) In light of previous work, including that by the authors themselves, the data in Figure 1 should be implemented by measurements of recent memory recall in order to assess whether remote memories are exclusively impaired or whether remote memory recall merely represents a continuation of recent memory impairments.

      We agree with the reviewer that is an important point. In line with their suggestion in minor comment 1, we now omitted the statement on recent memory in the results (previously on lines 109-111 and 117). Nonetheless, previous independent experiments from our group have repeatedly shown recent memory deficits in APP/PS1 mice at 12 weeks of age, including a recent article published in 2023. We refer the reviewer to figure 2c in Végh et al. (2014) and figure 2i in Kater et al. (2023). We have added a reference of the latter paper to our discussion section (line 458-459). Therefore, we are confident that the recent memory deficit at 12 weeks of age is a stable phenotype in our APP/PS1 mice.

      With these data in mind, we argue that the remote memory recall impairment is not a continuation of recent memory impairments. Recent memory deficits emerge already at 12 weeks of age, and when remote memory is assessed at 16 weeks (4 weeks after training at 12 weeks of age), APP/PS1 mice are still capable of forming and retrieving a remote memory. This suggests that remote memory retrieval can occur even when recent memory is compromised, arguing against the idea that the remote memory deficit observed at 20 weeks is a continuation of earlier recent memory impairments. We have clarified this point in the revised manuscript by adding the following sentence to the discussion section (line 462-465): 

      ‘This suggests that a remote memory can be formed even when recent memory expression is already compromised, indicating that the remote memory deficit in 20-week-old APP/PS1 mice is not a continuation of earlier recent memory impairments.’

      (2) Figure 2 shows electrophysiological properties of PV cells in the mPFC that correlate with the behavior shown in Figure 1. However, the mice used in Figure 2 are different than the mice used in Figure 1. Thus, the data are correlative at best, and the authors need to confirm that behavioral impairments in the APP/PS1 mice crossed to PV-Cre (and SST-Cre mice) used in Figure 2 are similar to those of the APP/PS1 mice used in Figure 1. Without that, no conclusions between behavioral impairments and electrophysiological as well as engram reactivation properties can be made, and the central claims of the paper cannot be upheld. 

      We thank the reviewer for raising this concern. Indeed, the remote memory impairment and PV hyperexcitability are correlative data, and therefore we do not make causal claims based on these data. However, please note that most of our key findings, including behavioural impairments, characterization of the engram ensemble and reactivation thereof, as well as inhibitory input measurements, were acquired using the same mouse line (APP/PS1), strengthening the coherence of our conclusions. Also, our electrophysiological findings in APP/PS1 (enhanced sIPSC frequency) and APP/PS1-PV-Cre-tdTomato (enhanced PV cell excitability) mice align well. Direct comparisons between the transgenic mouse lines APP/PS1 and APP/PS1 Parv-Cre were performed in our previous studies, confirming that these lines are similar in terms of behaviour and pathology. Specifically, we demonstrated that APP/PS1 mice display spatial memory impairments at 16 weeks of age, Fig 4a-d, consistent with the deficits observed in APP/PS1 Parv-Cre mice at 16 weeks of age, Fig 5a-c (Hijazi et al., 2020a). Additionally, Hijazi et al. (2020a) showed that soluble and insoluble Aβ levels do not differ between APP/PS1 Parv-Cre and APP/PS1 mice (sFig. 1), indicating comparable levels of pathology between these lines. While we do not have a similar characterization of the APP/PS1 SST-Cre line, we should mention that we also did not observe excitability differences in SST cells. We now acknowledge the limitation in the revised discussion section (line 480-487), and stress that our electrophysiology and behavioural findings are correlative in nature:

      ‘Although the excitability measurements were performed in APP/PS1-PV-Cre-tdTomato mice, and not in the APP/PS1 parental line, we previously found that these transgenic mouse lines exhibit comparable amyloid pathology (both soluble and insoluble amyloid beta levels) as well as similar spatial memory deficits (Hijazi et al., 2020a; Kater et al., 2023). Thus, our observations indicate that the APP/PS1 PV-Cre-tdTomato and APP/PS1 lines are similar in terms of pathology and behaviour. Nonetheless, further work is needed to identify a causal link between PV cell hyperexcitability and remote memory impairment.’ 

      (3) The reactivation data starting in Figure 3 should be analysed in much more depth: 

      a) The authors restrict their analysis to intra-animal comparisons, but additional ones should be performed, such as inter-animal (WT vs APP/PS1) as well as inter-age (12-16w vs 16-20w). In doing so, reactivation data should be normalized to chance levels per animal, to account for differences in labelling efficiency - this is standard in the field (see original Tonegawa papers and for a reference). This could highlight differences in total reactivation that are already apparent, such as for instance in WT vs APP/PS1 at 20w (Figure 3o) and highlight a decrease in reactivation in AD mice at this age, contrary to what is stated in lines 213-214. 

      We would like to thank the reviewer for the valuable input on the reactivation data in Figure 3. 

      We agree with the reviewer and now depict the data as normalized to chance levels (Figure 3). The original figures are now supplemental (sFig. 5). The reactivation data normalized to chance are similar to the original results, i.e. no difference was observed in the reactivation of the mPFC engram ensemble between genotypes. The reviewer may have overlooked that we did perform inter-animal (WT vs. APP/PS1) comparisons, however these were not significantly different. We have made this clearer in the main text, lines 277, 288-289, 294-295 and 303-304. Moreover, the reviewer recommended including inter-age group comparisons, which have now been added to the supplemental figures (sFig. 6). No genotype-dependent differences were observed. While a main effect of age group did emerge, indicating that there is a potential increased overlap between Fos+ and mCherry+ in animals aged 16-20 weeks, we caution against overinterpreting this finding. These experimental groups were processed in separate cohorts, with viral injection and 4TM-induced tagging performed at different moments in time, which may have contributed to the observed differences in overlap. We have addressed this point in the revised discussion (line 612-617):

      ‘Furthermore, we also observed an increase in the amount overlap between Fos+ and mCherry+ engram cells when comparing the 12-16w and 16-20w age groups. This finding should be interpreted with caution, as the experimental groups were processed in separate cohorts, with viral injections and 4TM-induced tagging performed at different moments in time. This may have contributed to the observed differences between ages.’

      b) Comparing the proportion of mcherry+ cells in PV- and PV+ is problematic, considering that the PV- population is not "pure" like the PV+, but rather likely to represent a mix of different pyramidal neurons (probably from several layers), other inhibitory neurons like SST and maybe even glial cells. Considering this, the statement on line 218 is misleading in saying that PVs are overrepresented. If anything, the same populations should be compared across ages or groups.  

      We thank the reviewer for their insightful comment and agree that the PV- population of cells is likely more heterogenous than the PV+ population. However, we would like to clarify that all quantified cells were selected based on Nissl immunoreactivity, and to exclude non-neuronal cells, stringent thresholding was applied in the script that was used to identify Nissl+ cells. The threshold information has now been added to the methods section (line 758-760). Thus, although heterogenous, the analysed PV- population reflects a neuronal subset. In response to the reviewer’s suggestion, we have now included overlap measurements relative to chance levels (Figure 3). These analyses did not reveal differences with the original analyses, i.e., there are no genotype specific differences. We have also incorporated the suggested inter-age group comparisons (sFig. 6) and found no differences between age groups. In light of the raised concerns, we have removed the statement that PV cells were overrepresented in the engram ensemble.

      c) A similar concern applies to the mcherry- population in Figure 4, which could represent different types of neurons that were never active, compared to the relatively homogeneous engram mcherry+ population. This could be elegantly fixed by restricting the comparison to mCherry+Fos+ vs mCherry+Fos- ensembles and could indicate engram reactivation-specific differences in perisomatic inhibition by PV cells. 

      The comparison the reviewer suggests, comparing mCherry+Fos+ to mCherry+Fos- is indeed conceptually interesting and could provide more insight into engram reactivation and PV input. However, there are practical limitations to performing this analysis, as neurons in close proximity need to be compared in a pairwise manner to account for local variability in staining intensity. As shown in Figure 3c+k and Figure 4a+b, d+e, PV immunostaining intensity varies to a certain extend within a given image. While pairwise comparisons of neighbouring neurons were feasible when analysing mCherry+ and mCherry- cells, they are unfortunately not feasible for the mCherry+Fos+ vs. mCherry+Fos- comparison. The occurrence of spatially adjacent mCherry+Fos+ and mCherry+Fos- neurons is too sparse for a pairwise comparison. This analysis would therefore result in substantial under-sampling and limit the reliability of the analysis. Nonetheless, we agree with the reviewer that the mCherry- population may be more heterogenous than the mCherry+ population, despite the fact that PV+ neurons and that non-neuronal cells were excluded from both populations in the analyses. We therefore added a statement to the discussion to acknowledge this limitation (line 536-539): 

      ‘Although PV+ cells were not included in this analysis and we excluded non-neuronal cells based on the area of the Nissl stain, the mCherry- population was potentially more heterogenous than the mCherry+ population, which may have contributed to the differences we observed.’

      (4) At several instances, there are some doubts about the statistical measures having been employed: 

      a) In Figure 4f, it is unclear why a repeated measurement ANOVA was used as opposed to a regular ANOVA. 

      b) In Supplementary Figure 2b, a Mann-Whitney test was used, supposedly because the data were not normally distributed. However, when looking at the individual data points, the data does seem to be normally distributed. Thus, the authors need to provide the test details as to how they measured the normalcy of distribution. 

      a) Based on the pairwise comparison of neighbouring neurons within animals, the data in Figure 4f was analysed with a repeated measure ANOVA. 

      b) We thank the author for their comment on Supplementary Figure 2b. The data is indeed normally distributed, and we have analysed it using a D’Agostino & Pearson test. We have corrected this in the supplemental figure. 

      Minor concerns: 

      (1) Line 117: The authors cite a recent memory impairment here, as shown by another paper. However, given the notorious difficulty in replicating behavioral findings, in particular in APP/PS1 mice (number of backcrossings, housing conditions, etc., might differ between laboratories), such a statement cannot be made. The authors should either show in their own hands that recent memory is indeed affected at 12 weeks of age, or they should omit this statement. 

      We thank the reviewer for this thoughtful comment. As noted in our response to major concern (1), we have addressed this concern by providing additional information and clarification in the discussion (line 462-465) regarding the possibility that remote memory impairments are a continuation of recent memory impairments. As mentioned in our response, we have added a reference to a more recent study from our lab (Kater et al. (2023). These findings are consistent with the earlier report from our lab (Végh et al. (2014), underscoring the reproducibility of this phenotype across independent cohorts and time. Notably, the experiments in the 2023 and present study were performed using the same housing and experimental conditions. Nevertheless, in light of the reviewer’s suggestion, and to avoid overstatement or speculation, we have now omitted the sentence referring to recent memory impairments at 12 weeks of age from the results section.

      (2) Pertaining to Figure 3, low-resolution images of the mPFC should be provided to assess the spread of injection and the overall degree of double-positive cells.  

      We agree with the reviewer and have added images of the mPFC as a supplemental figure (sFig. 3) that show the spread of the injection. Unfortunately, it is not possible to visualize the overall degree of double-positive cells at a lower magnification (or low-resolution). Representative examples of colocalization are presented in Figure 3.

      Reviewer #2 (Public review): 

      This study presents a comprehensive investigation of remote memory deficits in the APP/PS1 mouse model of Alzheimer's disease. The authors convincingly show that these deficits emerge progressively and are paralleled by selective hyperexcitability of PV interneurons in the mPFC. Using viral-TRAP labeling and patch-clamp electrophysiology, they demonstrate that inhibitory input onto labeled engram cells is selectively increased in APP/PS1 mice, despite unaltered engram size or reactivation. These findings support the idea that alterations in inhibitory microcircuits may contribute to cognitive decline in AD. 

      However, several aspects of the study merit further clarification. Most critically, the central paradox, i.e., increased inhibitory input without an apparent change in engram reactivation, remains unresolved. The authors propose possible mechanisms involving altered synchrony or impaired output of engram cells, but these hypotheses require further empirical support. Additionally, the study employs multiple crossed transgenic lines without reporting the progression of amyloid pathology in the mPFC, which is important for interpreting the relationship between circuit dysfunction and disease stage. Finally, the potential contribution of broader network dysfunction, such as spontaneous epileptiform activity reported in APP/PS1 mice, is also not addressed. 

      We thank the reviewer for their evaluation and appreciate the positive assessment of our study’s contributing to understanding remote memory deficits and the dysfunction of inhibitory microcircuits in AD. We also acknowledge the relevant points raised and have revised the manuscript to clarify our interpretations. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      (1) Line 68: What are "APP23xPS45" mice? This is most likely a typo.

      This line is a previously reported double transgenic amyloid beta mouse model that was obtained by crossing APP23 (overexpressing human amyloid precursor protein with the Swedish double mutation at position 670/671) with PS45 (carrying a transgene for mutant Presenilin 1, G384A mutation) (Busche et al., 2008; Grienberger et al., 2012). 

      (2) Line 148: The authors should also briefly describe in the main text that APP/PS1 x SST-Cre mice were generated and used here.  

      We thank the reviewer for their comment and have added their suggestion to the main text (line 166-168):

      ‘To do this, APP/PS1 mice were crossed with SST-Cre mice to generate APP/PS1 SST-Cre mice. Following microinjection of AAV-hSyn::DIO-mCherry into the mPFC, recordings were obtained from SST neurons.’

      (3) The discussion should be condensed because of redundancies on several occasions. For example, memory allocation is discussed starting on line 371, then again on line 392. This should be combined. Likewise, how the correlative nature of the findings about PV interneurons could be further functionally addressed is discussed on lines 413 and 454, and should be condensed into one paragraph. 

      We thank the reviewer for this suggestion and have revised the discussion to remove the redundancies as proposed.  

      Reviewer #2 (Recommendations for the authors): 

      To strengthen the manuscript, the following points should be addressed: 

      (1) Quantify amyloid pathology: It is essential to assess amyloid-β levels (soluble and insoluble) in the mPFC of APP/PS1-PV-Cre-tdTomato mice at the studied ages. This would help determine whether the observed circuitlevel changes track with disease progression as seen in canonical APP/PS1 models. 

      We thank the reviewer for this valuable suggestion and agree that assessing Aβ levels in the mPFC is important to determine whether the observed circuit level alterations in APP/PS1 mice coincide with the progression of amyloid pathology. Therefore, we assessed the amyloid plaque load in the mPFC of APP/PS1 mice at 16 and 20 weeks of age (new supplemental figure sFig. 1) and observed no difference in plaque load between these two time points. This suggests that the increased excitability in the mPFC cannot be attributed to differences in plaque load (insoluble amyloid beta).

      In line with this, we previously studied both soluble and insoluble Aβ levels in the CA1 and reported that there are no differences between 12 and 16 weeks of age (Kater et al., 2023), while PV cell hyperexcitability is present at 16 weeks of age (Hijazi et al., 2020a). From 24 weeks onwards, the level of amyloid beta increases. Similarly, Végh et al. (2014) showed using immunoblotting that monomeric and low molecular weight oligomeric forms of soluble Aβ are already present as early as 6 weeks of age and become more prominent at 24 weeks of age. Although the soluble Aβ measurements were performed in the hippocampus, we think these findings can be extrapolated to cortical regions, as the APP and PS1 mutations in APP/PS1 mice are driven by a prion promotor, which should induce consistent expression across brain regions. Data from other research groups support this hypothesis (Kim et al., 2015; Zhang et al., 2011). Thus, large regional differences in soluble Aβ are not expected. The temporal progression suggests that increasing levels of soluble amyloid beta might contribute to the emergence of PV cell hyperexcitability. We have added this point to the manuscript (line 585-591):

      ‘Since amyloid beta plaque load in the mPFC remains comparable between 16- and 20-week-old APP/PS1 mice, the observed increased excitability is unlikely the result of changes in insoluble amyloid beta levels. Previous data from our lab show that soluble amyloid beta is already present as early as 6 weeks of age and becomes more prominent at 24 weeks of age (Kater et al., 2023; Végh et al., 2014). The progressive increase in soluble amyloid beta levels may contribute to the emergence of PV cell hyperexcitability.’

      Finally, we previously compared soluble and insoluble amyloid beta levels in APP/PS1 and APP/PS1 Parv Cre mice and show that these are similar (Hijazi et al., 2020a). While our current study shows the progression of amyloid beta accumulation in APP/PS1 mice, these mice also exhibit altered microcircuitry (enhanced sIPSC frequency on engram cells) at 20 weeks of age, the same age at which we observed PV cell hyperexcitability in APP/PS1 Parv Cre tdTomato mice. This further supports the generalizability of our findings across genotypes, between APP/PS1 and APP/PS1 Parv Cre tdTomato mice. 

      (2) Examine later disease stages: Since the current effects are modest, assessing memory performance, PV cell excitability, and engram inhibition at more advanced stages could clarify whether these alterations become more pronounced with disease progression. 

      We thank the reviewer for this thoughtful suggestion. Investigating advanced disease stages could indeed provide valuable insights into whether the observed alterations in memory performance, PV cell hyperexcitability and engram inhibition become more pronounced over time. Our previous work has shown that changes in pyramidal cell excitability emerge at a later stage than in PV cells, supporting the idea of progressive circuit dysfunction (Hijazi et al., 2020a). However, at these more advanced stages, additional pathological processes, such as an increased gliosis (Janota, Brites, Lemere, & Brito, 2015; Kater et al., 2023) and synaptic loss (Alonso-Nanclares, MerinoSerrais, Gonzalez, & DeFelipe, 2013; Bittner et al., 2012), will likely contribute to both electrophysiological and behavioural measurements. Furthermore, we would like to point out that the current changes observed in memory performance, PV hyperexcitability and increased inhibitory input on engram cells at 16-20 weeks of age are not modest, but already quite substantial. Our focus on these early time points in APP/PS1 mice were intentional, as it helps us understand the initial changes in Alzheimer’s disease at a circuit level and to identify therapeutic targets early intervention. What happens at later stages is certainly of interest, but beyond the scope of this study and should therefore be addressed in future studies. We have incorporated a discussion related to this point into the revised manuscript (line 602-606):

      ‘Moreover, it is relevant to investigate whether changes in PV and PYR cell excitability, as well as input onto engram cells in the mPFC, become more pronounced at later disease stages. Nonetheless, by focussing on early disease timepoints in the present study, we aimed to understand the initial circuit-level changes in AD and identify targets for early therapeutic intervention.’

      (3) Address network hyperexcitability: Spontaneous epileptiform activity has been reported in APP/PS1 mice from 4 months of age (Reyes-Marin & Nuñez, 2017). Including EEG data or discussing this point in relation to your findings would help contextualize the observed inhibitory remodeling within broader network dysfunction. 

      We thank the reviewer for this valuable input and for highlighting the study by Reyes-Marin and Nuñez (2017). In line with this, we recently reported longitudinal local field potential (LFP) recordings in freely behaving APP/PS1 Parv-Cre mice and wild type control animals between the ages of 3 to 12 months (van Heusden et al., 2023). Weekly recordings were performed in the home cage under awake mobile conditions. These data showed no indications of epileptiform activity during wakefulness, consistent with previous findings that epileptic discharges in APP/PS1 mice predominantly occur during sleep (Gureviciene et al., 2019). Recordings were obtained from the prefrontal cortex (PFC), parietal cortex and the hippocampus. In contrast, the study by Reyes-Marin and Nuñez (2017) recorded from the somatosensory cortex in anesthetized animals. Here, during spontaneous recordings, no differences were observed in delta, theta or alpha frequency bands between APP/PS1 and WT mice. Interestingly, we observed an early increase in absolute power, particularly in the hippocampus and parietal cortex from 12 to 24 weeks of age in APP/PS1 mice. In the PFC we found a shift in relative power from lower to higher frequencies and a reduction in theta power. Connectivity analyses revealed a progressive, age-dependent decline in theta/alpha coherence between the PFC and both the parietal cortex and hippocampus. Given the well-established role of PV interneurons network synchrony and coordinating theta and gamma oscillations critical for cognitive function (Sohal, Zhang, Yizhar, & Deisseroth, 2009; Xia et al., 2017), these findings support the idea of early circuit dysfunction in APP/PS1 mice. Our findings, i.e. hyperexcitability of PV cells, align with these LFP based networklevel observations. These data suggest an early shift in the E/I balance, contributing to altered oscillatory dynamics and impaired inter-regional connectivity, possibly leading to alterations in memory. However, whether the observed PV hyperexcitability in our study directly contributes to alterations in power and synchrony remains to be elucidated. Furthermore, it would be interesting to determine the individual contribution of PV cell hyperexcitability in the hippocampus versus the mPFC to network changes and concurrent memory deficits. We have added a statement on network hyperexcitability to the discussion (line 561-565). 

      ‘Interestingly, we recently found a progressive disruption of oscillatory network synchrony between the mPFC and hippocampus in APP/PS1 Parv-Cre mice (van Heusden et al., 2023). However, whether the observed PV cell hyperexcitability directly contributes to changes in inter-regional synchrony, and whether this leads to alterations at a network level, i.e. increased inhibitory input on engram cells, and consequently to memory deficits, remains to be elucidated in future studies.’ 

      (4) Mechanisms responsible for PV hyperexcitability: Related to the previous point, a discussion of the possible underlying mechanisms, e.g., direct effects of amyloid-β, inflammatory processes, or compensatory mechanisms, would strengthen the discussion. 

      We agree with the reviewer that this will strengthen the discussion. We have now added a comprehensive discussion in the revised manuscript to address potential mechanisms responsible for PV cell hyperexcitability (line 579-594).:

      ‘Prior studies have shown that neurons in the vicinity of amyloid beta plaques show increased excitability (Busche et al., 2008). We demonstrated that PV neurons in the CA1 are hyperexcitable and that treatment with a BACE1 inhibitors, i.e. reducing amyloid beta levels, rescues PV excitability (Hijazi et al., 2020a). In line with this, we also reported that addition of amyloid beta to hippocampal slices increases PV excitability, without altering pyramidal cell excitability (Hijazi et al., 2020a). Finally, applying amyloid beta to an induced mouse model of PV hyperexcitability further impairs PV function (Hijazi et al., 2020b). Since amyloid beta plaque load in the mPFC remains comparable between 16- and 20-week-old APP/PS1 mice, the observed increased excitability is unlikely the result of changes in insoluble amyloid beta levels. Previous data from our lab show that soluble amyloid beta is already present as early as 6 weeks of age and becomes more prominent at 24 weeks of age (Kater et al., 2023; Végh et al., 2014). The progressive increase in soluble amyloid beta levels may contribute to the emergence of PV cell hyperexcitability. We hypothesize that the hyperexcitability induced by amyloid beta may result from disrupted ion channel function, as PV neuron dysfunction can result from altered potassium (Olah et al., 2022) and sodium channel activity (Verret et al., 2012).’

      (5) Excitatory-inhibitory balance: While the main focus is on increased inhibition onto engram cells, the reported increase in sEPSC frequency (Figure 5g) across genotypes suggests the presence of excitatory remodelling as well. A brief discussion of how this may interact with increased inhibition would be valuable.  

      We thank the reviewer for this comment regarding the interaction between excitatory and inhibitory remodelling. We have now incorporated this discussion point into the revised manuscript (line 528-534):

      ‘Interestingly, both WT and APP/PS1 mice showed an increase in sEPSC frequency onto engram cells, suggesting that increased excitatory input is a consequence of memory retrieval and not affected by genotype. However, only in APP/PS1 mice, the augmented excitatory input coincided with an elevation of inhibitory input onto engram cells. The resulting imbalance between excitation and inhibition could therefore potentially disrupt the precise control of engram reactivation and contribute to the observed remote memory impairment.’

      References

      Alonso-Nanclares, L., Merino-Serrais, P., Gonzalez, S., & DeFelipe, J. (2013). Synaptic changes in the dentate gyrus of APP/PS1 transgenic mice revealed by electron microscopy. J Neuropathol Exp Neurol, 72(5), 386-395. doi:10.1097/NEN.0b013e31828d41ec

      Bittner, T., Burgold, S., Dorostkar, M. M., Fuhrmann, M., Wegenast-Braun, B. M., Schmidt, B., . . . Herms, J. (2012). Amyloid plaque formation precedes dendritic spine loss. Acta Neuropathologica, 124(6), 797807. doi:10.1007/s00401-012-1047-8

      Busche, M. A., Eichhoff, G., Adelsberger, H., Abramowski, D., Wiederhold, K. H., Haass, C., . . . Garaschuk, O. (2008). Clusters of hyperactive neurons near amyloid plaques in a mouse model of Alzheimer's disease. Science, 321(5896), 1686-1689. doi:10.1126/science.1162844

      Grienberger, C., Rochefort, N. L., Adelsberger, H., Henning, H. A., Hill, D. N., Reichwald, J., . . . Konnerth, A. (2012). Staged decline of neuronal function in vivo in an animal model of Alzheimer's disease. Nat Commun, 3, 774. doi:10.1038/ncomms1783

      Gureviciene, I., Ishchenko, I., Ziyatdinova, S., Jin, N., Lipponen, A., Gurevicius, K., & Tanila, H. (2019). Characterization of Epileptic Spiking Associated With Brain Amyloidosis in APP/PS1 Mice. Front Neurol, 10, 1151. doi:10.3389/fneur.2019.01151

      Hijazi, S., Heistek, T. S., Scheltens, P., Neumann, U., Shimshek, D. R., Mansvelder, H. D., . . . van Kesteren, R. E. (2020a). Early restoration of parvalbumin interneuron activity prevents memory loss and network hyperexcitability in a mouse model of Alzheimer's disease. Mol Psychiatry, 25(12), 3380-3398. doi:10.1038/s41380-019-0483-4

      Hijazi, S., Heistek, T. S., van der Loo, R., Mansvelder, H. D., Smit, A. B., & van Kesteren, R. E. (2020b). Hyperexcitable Parvalbumin Interneurons Render Hippocampal Circuitry Vulnerable to Amyloid Beta. iScience, 23(7), 101271. doi:10.1016/j.isci.2020.101271

      Janota, C. S., Brites, D., Lemere, C. A., & Brito, M. A. (2015). Glio-vascular changes during ageing in wild-type and Alzheimer's disease-like APP/PS1 mice. Brain Res, 1620, 153-168. doi:10.1016/j.brainres.2015.04.056

      Kater, M. S. J., Huffels, C. F. M., Oshima, T., Renckens, N. S., Middeldorp, J., Boddeke, E., . . . Verheijen, M. H. G. (2023). Prevention of microgliosis halts early memory loss in a mouse model of Alzheimer's disease. Brain Behav Immun, 107, 225-241. doi:10.1016/j.bbi.2022.10.009

      Kim, H. Y., Kim, H. V., Jo, S., Lee, C. J., Choi, S. Y., Kim, D. J., & Kim, Y. (2015). EPPS rescues hippocampus-dependent cognitive deficits in APP/PS1 mice by disaggregation of amyloid-β oligomers and plaques. ature Communications, 6(1), 8997. doi:10.1038/ncomms9997

      Olah, V. J., Goettemoeller, A. M., Rayaprolu, S., Dammer, E. B., Seyfried, N. T., Rangaraju, S., . . . Rowan, M. J. M. (2022). Biophysical Kv3 channel alterations dampen excitability of cortical PV interneurons and contribute to network hyperexcitability in early Alzheimer’s. Elife, 11, e75316. doi:10.7554/eLife.75316

      Reyes-Marin, K. E., & Nuñez, A. (2017). Seizure susceptibility in the APP/PS1 mouse model of Alzheimer's disease and relationship with amyloid β plaques. Brain Res, 1677, 93-100. doi:10.1016/j.brainres.2017.09.026

      Sohal, V. S., Zhang, F., Yizhar, O., & Deisseroth, K. (2009). Parvalbumin neurons and gamma rhythms enhance cortical circuit performance. Nature, 459(7247), 698-702. doi:10.1038/nature07991

      van Heusden, F. C., van Nifterick, A. M., Souza, B. C., França, A. S. C., Nauta, I. M., Stam, C. J., . . . van Kesteren, R. E. (2023). Neurophysiological alterations in mice and humans carrying mutations in APP and PSEN1 genes. Alzheimers Res Ther, 15(1), 142. doi:10.1186/s13195-023-01287-6

      Végh, M. J., Heldring, C. M., Kamphuis, W., Hijazi, S., Timmerman, A. J., Li, K. W., . . . van Kesteren, R. E. (2014). Reducing hippocampal extracellular matrix reverses early memory deficits in a mouse model of Alzheimer's disease. Acta Neuropathol Commun, 2, 76. doi:10.1186/s40478-014-0076-z

      Verret, L., Mann, E. O., Hang, G. B., Barth, A. M., Cobos, I., Ho, K., . . . Palop, J. J. (2012). Inhibitory interneuron deficit links altered network activity and cognitive dysfunction in Alzheimer model. Cell, 149(3), 708-721. doi:10.1016/j.cell.2012.02.046

      Xia, F., Richards, B. A., Tran, M. M., Josselyn, S. A., Takehara-Nishiuchi, K., & Frankland, P. W. (2017). Parvalbumin-positive interneurons mediate neocortical-hippocampal interactions that are necessary for memory consolidation. Elife, 6. doi:10.7554/eLife.27868

      Zhang, W., Hao, J., Liu, R., Zhang, Z., Lei, G., Su, C., . . . Li, Z. (2011). Soluble Aβ levels correlate with cognitive deficits in the 12-month-old APPswe/PS1dE9 mouse model of Alzheimer's disease. Behavioural Brain Research, 222(2), 342-350. doi:https://doi.org/10.1016/j.bbr.2011.03.072

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      In this manuscript, Chang et al. investigated the cell type-specific role of the integrin activator Shv in activity-dependent synaptic remodeling. Using the Drosophila larval neuromuscular junction as a model, they show that glial-secreted Shv modulates synaptic plasticity by maintaining the extracellular balance of neuronal Shv proteins and regulating ambient extracellular glutamate concentrations, which in turn affects postsynaptic glutamate receptor abundance. Furthermore, they report that genetic perturbation of glial morphogenesis phenocopies the defects observed with the loss of glial Shv. Altogether, their findings propose a role for glia in activity-induced synaptic remodeling through Shv secretion. While the conclusions are intriguing, several issues related to experimental design and data interpretation merit further discussion.

      We appreciate the insightful and constructive comments. We have added new data and modified the text to address your concerns.  In doing so, the manuscript has been substantially strengthened.  Please see our detailed point-by-point response below. 

      Reviewer #2 (Public review):

      In this paper Chang et al follow up on their lab's previous findings about the secreted protein Shv and its role in activity-induced synaptic remodeling at the fly NMJ. Previously they reported that shv mutants have impaired synaptic plasticity. Normally a high stimulation paradigm should increase bouton size and GluR expression at synapses but this does not happen in shv mutants. The phenotypes relating to activity dependent plasticity were completely recapitulated when Shv was knocked down only in neurons and could be completely rescued by incubation in exogenously applied Shv protein. The authors also showed that Shv activation of integrin signaling on both the pre- and post- synapse was the molecular mechanism underlying its function. Here they extend their study to consider the role of Shv derived from glia in modulating synaptic features at baseline and remodeling conditions. This study is important to understand if and how glia contribute to these processes. Using cell-type specific knockdown of Shv only in glia causes abnormally high baseline GluR expression and prevents activity-dependent increases in bouton size or GluR expression post-stimulation. This does not appear to be a developmental defect as the authors show that knocking down Shv in glia after basic development has the same effects as lifelong knockdown, so Shv is acting in real time. Restoring Shv in ONLY glia in mutant animals is sufficient to completely rescue the plasticity phenotypes and baseline GluR expression, but glial-Shv does not appear to activate integrin signaling which was shown to be the mechanism for neuronally derived Shv to control plasticity. This led the authors to hypothesize that glial Shv works by controlling the levels of neuronal Shv and extracellular glutamate. They provide evidence that in the absence of glial Shv, synaptic levels of Shv go up overall, presumably indicating that neurons secrete more Shv. In this context which could then work via integrin signaling as described to control plasticity. They use a glutamate sensor and observe decreased signal (extracellular glutamate) from the sensor in glial Shv KD animals, however, this background has extremely high GluR levels at the synapse which may account for some or all of the decreases in sensor signal in this background. Additional controls to test if increased GluR density alone affects sensor readouts and/or independently modulating GluR levels in the glial KD background would help strengthen this data. In fact, glialspecific shv KD animals have baseline levels of GluR that are potentially high enough to have hit a ceiling of expression or detection that accounts for the inability for these levels to modulate any higher after strong stimulation and such a ceiling effect should be considered when interpreting the data and conclusions of this paper. Several outstanding questions remain-why can't glial derived Shv activate integrin pathways but exogenously applied recombinant Shv protein can? The effects of neuronal specific rescue of shv in a shv mutant are not provided vis-à-vis GluR levels and bouton size to compare to the glial only rescue. Inclusion of this data might provide more insight to outstanding questions of how and why the source of Shv seems to matter for some aspects of the phenotypes but not others despite the fact that exogenous Shv can rescue and in some experimental paradigms but not others.

      We appreciate your insightful comments. We have added new data and modified the text to address your concerns.  In doing so, the manuscript has been substantially strengthened.  Please also see the enclosed point-by-point response.

      To address the question of whether altered GluR density alone affects sensor readouts, we expressed GluR using a mhc promoter-driven GluRIIA fusion line, which increases total GluRIIA expression in muscle independently of the Gal4/UAS system. As shown in Figure 6 – figure supplement 1, mhc-GluRIIA animals exhibited elevated levels of not only GluRIIA but also the obligatory GluRIIC subunit. Despite this increase in GluR expression, we did not observe any change in extracellular glutamate levels, as measured by live imaging using the neuronal iGluSnFR sensor (updated Figure 6A). These results suggest that elevated GluR density alone does not alter iGluSnFR sensors dynamics and further support our conclusions.

      In regard to the question about ceiling effect, we do not think that the lack of GluR enhancement in repo>shv-RNAi is due to a saturated postsynaptic state. This is based on results in Figure 6, which shows that GluR levels can increase up to fourfold upon stimulation in the presence of glutamate, whereas repo>shv-RNAi results in only a ~2-fold increase in baseline GluR concentration. These results suggest that the synapse retains the capacity for further upregulation. 

      To address the question of why exogenously applied Shv activates integrin while glial derived Shv does not, we tested whether glia and neurons could differentially modify Shv. Based on Western blot analyses of adult heads and larval brains showing that Shv is present as a single band (Fig. 1A and Figure 2 – figure supplement 1B), the functional differences in neuronal or glial Shv is not likely due to the presence of different isoforms. Consistent with this, FlyBase also suggests that shv encodes a single isoform. However, while we did not detect obvious posttranslational modifications when Shv protein was expressed in neurons or glia (Figure 5 – figure supplement 1A), we cannot exclude the possibility that different cell types process Shv differently through post-transcriptional or post-translational mechanisms. Notably, shv is predicted to undergo A-to-I RNA editing, including an editing site in the coding region, which will result in a single amino acid change (St Laurent et al., 2013). Given that ADAR, the editing enzyme, is enriched in neurons and absent from glia (Jepson et al., 2011), such cell-specific editing could contribute to functional differences. It will be interesting to investigate this in the future. We have now included this in the Discussion section.

      Additionally, we have now included new data on neuronal Shv rescue of shv<sup>1</sup> mutants as suggested in the updated Figure 4. Consistent with previous findings that neuronal Shv rescues integrin signaling and electrophysiological phenotypes (Lee et al., 2017), we found that it also restores bouton size, GluR levels, and activity-induced synaptic remodeling. These results support the functional contribution of neuronal Shv. 

      Reviewer #3 (Public review):

      Summary:

      The manuscript by Chang and colleagues provides compelling evidence that glia-derived Shriveled (Shv) modulates activity-dependent synaptic plasticity at the Drosophila neuromuscular junction (NMJ). This mechanism differs from the previously reported function of neuronally released Shv, which activates integrin signaling. They further show that this requirement of Shv is acute and that glial Shv supports synaptic plasticity by modulating neuronal Shv release and the ambient glutamate levels. However, there are a number of conceptual and technical issues that need to be addressed.

      We appreciate the insightful and constructive comments. We have added new data and modified the text to address your concerns.  In doing so, the manuscript has been substantially strengthened.  Please see our detailed point-by-point response below.

      Major comments:

      (1) From the images provided for Fig 2B +RU486, the bouton size appears to be bigger in shv RNAi + stimulation, especially judging from the outline of GluR clusters.

      Thank you for pointing this out. We have selected another image to better represent the data.

      (2) The shv result needs to be replicated with a separate RNAi.

      We have used another independent RNAi line targeting shv to confirm our findings (BDSC 37507). This shv-RNAi<sup>37507</sup> line also showed the same phenotype, including increased GluR levels and impaired activity-induced synaptic remodeling line (new Figure 2 – figure supplement 1A).

      (3) The phenotype of shv mutant resembles that of neuronal shv RNAi - no increased GluR baseline. Any insights why that is the case?

      This is an interesting question. We speculate that neuronal Shv normally has a dominant role in maintaining GluR levels during development, mainly through its ability to activate integrin signaling. Consistent with this, we have shown that mutations in integrin leads to a drastic reduction in GluR levels at the NMJ (Lee et al., 2017). While we have shown that neuronal knockdown of shv elevates Shv from glia (Fig. 5E), glial Shv cannot activate integrin signaling (Fig. 5B, 5C). Additionally, high levels of glial Shv will elevate ambient glutamate concentrations (Figure 6A), which will likely reduce GluR abundance and impair synaptic remodeling (Augustin et al.  2007, Chen et al., 2009, and Figure 6B). Therefore, neuronal knockdown of Shv resulted in the same phenotype as shv<sup>1</sup> mutant. 

      (4) In Fig 3B, SPG shv RNAi has elevated GluR baseline, while PG shv RNAi has a lower baseline. In both cases, there is no activity induced GluR increase. What could explain the different phenotypes?

      SPG is the middle glial cell layer in the fly peripheral nervous system and may also influence the PG layer through signaling mechanisms (Lavery et al., 2007), therefore having a stronger effect. We have now mentioned this in the text. 

      (5) In Fig 4C, the rescue of PTP is only partial. Does that suggest neuronal shv is also needed to fully rescue the deficit of PTP in shv mutants?

      This is indeed a possibility. We have shown that neuronal and glial Shv each contribute to activity-induced synaptic remodeling through different mechanisms. It will be interesting test this in the future.

      (6) The observation in Fig 5D is interesting. While there is a reduction in Shv release from glia after stimulation, it is unclear what the mechanism could be. Is there a change in glial shv transcription, translation or the releasing machinery? It will be helpful to look at the full shv pool vs the released ones. 

      Thank you for the suggestion. To address this, we monitored the levels of intracellular Shv using a permeabilized preparation (we found that the addition of detergent to permeabilize the sample strips away extracellular Shv). Combined with the extracellular staining results, we can get an idea about the total amount of Shv. As shown in the updated Figure 5D, intracellular Shv levels (permeabilized) remained unchanged following stimulation, indicating that there is no intracellular accumulation and that the observed decrease in extracellular Shv is unlikely due to impaired release machinery.

      (7) In Fig 5E, what will happen after stimulation? Will the elevated glial Shv after neuronal shv RNAi be retained in the glia? 

      Thank you for the interesting question. We agree that examining Shv distribution following neuronal activity would be highly informative. While we plan to perform time-lapse experiments in future studies to address this, we feel that such analyses are beyond the scope of the current manuscript.

      (8) It would be interesting to see if the localization of shv differs based on if it is released by neuron or glia, which might be able to explain the difference in GluR baseline. For example, by using glia-Gal4>UAS-shv-HA and neuronal-QF>QUAS-shv-FLAG. It seems important to determine if they mix together after release? It is unclear if the two shv pools are processed differently.

      We agree that investigating whether neuronal and glial shv pools colocalize or are differentially processed is an important future direction. We hope to examine how each pool responds to stimulation in the shv<sup>1</sup> mutant background using LexA and Gal4 systems in the future

      (9) Alternatively, do neurons and glia express and release different Shv isoforms, which would bind different receptors?

      Thank you for the questions. We have now addressed this in the discussion and also enclosed below:

      Based on Western blot analyses of adult heads and larval brains showing that Shv is present as a single band (Fig. 1A and Figure 2 – figure supplement 1B), the functional differences in neuronal or glial Shv is not likely due to the presence of different isoforms. Consistent with this, FlyBase also suggests that shv encodes a single isoform (Ozturk-Colak et al., 2024). However, while we did not detect obvious post-translational modifications when Shv protein was expressed in neurons or glia (Figure 5 – figure supplement 1A), we cannot exclude the possibility that different cell types process Shv differently through posttranscriptional or post-translational mechanisms. Notably, shv is predicted to undergo A-to-I RNA editing, including an editing site in the coding region, which could result in a single amino acid change (St Laurent et al., 2013). Given that ADAR, the editing enzyme, is enriched in neurons and absent from glia (Jepson et al., 2011), such cell-specific editing could contribute to functional differences. It will be interesting to investigate this in the future.

      (10) It is claimed that Sup Fig 2 shows no observable change in gross glial morphology, further bolstering support that glial Shv does not activate integrin. This seems quite an overinterpretation. There is only one image for each condition without quantification. It is hard to judge if glia, which is labeled by GFP (presumably by UAS-eGFP?), is altered or not.

      Thank you for raising this concern. To strengthen our claim, we now include additional images (Figure 5, figure supplement 2). No obvious change in overall glial morphology was observed, with glia continuing to wrap the segmental nerves and extend processes that closely associate with proximal synaptic boutons (Figure 5, figure supplement 2). These observations suggest that glial  Shv is not essential for maintaining normal glial structure or survival, and is consistent with the idea that glial Shv does not activate integrin, as integrin signaling is required to maintain the integrity of peripheral glial layers. 

      (11) The hypothesis that glutamate regulates GluR level as a homeostatic mechanism makes sense. What is the explanation of the increased bouton size in the control after glutamate application in Fig 6?

      We speculate that it could be due to a retrograde signaling mechanism activated by elevated extracellular glutamate, allowing neurons to modulate bouton morphology in response to synaptic demand. It will be interesting to investigate this possibility in the future.  

      (12) What could be a mechanism that prevents elevated glial released Shv to activate integrin signaling after neuronal shv RNAi, as seen in Fig 5E?

      One potential mechanism is post-translational or post-transcriptional processing of Shv. Although our Western blots did not reveal differences in the molecular weight of glial vs. neuronal Shv, we cannot exclude the possibility that modifications not readily detectable by this method are responsible. Additionally, as mentioned in the Discussion section, post-transcriptional processing such as A-to-I RNA editing could introduce changes in the Shv protein, potentially altering its ability to interact with or activate integrin. 

      (13) Any speculation on how the released Shv pool is sensed?

      The same RNA editing modification mentioned earlier or post-translational modifications in Shv may also influence how it is sensed by target cells. 

      Reviewer #1 (Recommendations for the authors):

      Issues Regarding Cell Type-Specific Secretion and the Role of Shv:

      Extracellular Secretion of Shv:

      (1) The data in Figure 1 suggest that Shv is not secreted under resting conditions, challenging the proposed extracellular role of Shv. It remains unclear whether Shv secretion can be confirmed using Shv-eGFP (knock-in) following high K+ stimulation.

      We apologize for not being clear. In Figure 1, Shv signals we’ve shown are from permeabilized preparation, which preferentially labels intracellular Shv. We do observe secreted Shv-eGFP following stimulation (Figure 5E), consistent with our hypothesis. However, endogenous extracellular Shv-eGFP signal is very weak, and was therefore detected using the GFP antibody and amplified with a  fluorescent secondary antibody. We have now also included additional controls in Figure 5E to demonstrate the specificity of the staining.

      (2) In Figure 5D, total Shv staining should be included to evaluate potential presynaptic accumulation of intracellular Shv, which may lead to extracellular secretion upon stimulation. Additionally, the representative images of glial rescue do not seem to align with the quantification data; more extracellular Shv signals were observed after stimulation.

      Thank you for the comments. We monitored the levels of intracellular Shv using a permeabilized preparation (detergent treatment stripped away extracellular Shv signal). When combined with non-permeabilized extracellular staining, this approach provides insights into total Shv levels. We found no intracellular accumulation of Shv and the intracellular levels remained unchanged following stimulation (updated Figure 5D), suggesting that reduced extracellular Shv is not likely due to impaired release. Additionally, we have selected another image for glial rescue by avoiding the trachea region, which better represent the quantification data.

      (3) In Figure 5E, "extracellular" Shv staining in repo>shv-RNAi samples appears localized within synaptic boutons. This raises concerns about the staining protocol potentially labeling intracellular proteins. Control experiments using presynaptic cytosolic markers are needed to confirm staining specificity.

      Thank you for the thoughtful suggestion. To validate that our staining protocol is selective for extracellular proteins, we also stained for cysteine string protein (CSP), an intracellular synaptic vesicle protein predominantly located in the presynaptic terminals (Zinsmaier et al., 1990; Umbach et al., 1994), under the same conditions. CSP was detected only in the permeabilized condition (updated Figure 5E), suggesting that the non-permeabilizing protocol is selective for extracellular proteins. 

      (4) The study does not clarify why Shv knockdown in either perineurial glia or subperineurial glia abolishes stimulus-dependent synaptic remodeling. Does Shv secretion occur from PG, SPG, or both toward the synaptic bouton?

      Thank you for raising this point. SPG is the middle glial cell layer in the fly peripheral nervous system and may also influence the PG layer through signaling mechanisms (Lavery et al., 2007). Consistent with this, we observed a stronger effect on GluR levels when SPG was disrupted compared to PG. It will be interesting to distinguish whether Shv is released by PG or SPG in the future.

      (5) The possibility of an inter-glial role for Shv via integrin signaling in regulating glial morphogenesis is underexplored. The rough morphological characterization in Supplemental Figure 2 requires more detailed quantification and the use of sub-glial typespecific GAL4 drivers.

      We now include additional images (Figure 5, figure supplement 2) to examine the overall glial morphology. There was no obvious change in gross glial morphology, with glia continuing to wrap the segmental nerves and extend processes that closely associate with proximal synaptic boutons when shv is knocked down in glia (Figure 5, figure supplement 2). These observations suggest that glial  Shv is not essential for maintaining normal glial structure or survival, and is consistent with the idea that glial Shv does not activate integrin, as integrin signaling is required to maintain the integrity of peripheral glial layers (Xie and Auld, 2011; Hunter et al., 2020).

      (6) While repo>shv rescues stimulus-dependent bouton size and GluR increases in the shv mutant (Figure 5), the interaction between neuronal and glial Shv remains unclear. Does neuronal Shv influence the expression or distribution of glial Shv?

      We agree that investigating whether neuronal and glial shv pools influence each other’s expression or distribution is an important future direction. We hope to investigate this in more detail in the future using LexA-LexOp and GAL4/UAS dual expression systems.

      Issues Regarding the Regulation of GluR and Perisynaptic Glutamate by Glial Shv:

      (7) The methodology for iGluSnFR measurement (Figure 6A) is inadequately described. If anti-HRP staining was used to normalize signals, it suggests the experiment may have involved fixed tissue. However, iGluSnFR typically measures glutamate levels in live cells, raising concerns about the validity of this approach in fixed samples.

      We apologize for not being clear about the method used to measure iGluSnFR. The original figure was generated from imaging iGluSnFR signals immediately following fixation. To address the reviewer’s concern and validate these results, we have now performed live imaging experiments using a water dipping objective to measure iGluSnFR intensity in unfixed preparations (new Figure 6A). To label synaptic boutons, we co-expressed mtdTomato using the neuronal driver, nSybGAL4. The results from the live imaging experiments confirmed our original observations that glial Shv required to control ambient extracellular glutamate levels (see updated Fig. 6A and text). Additionally, to ascertain that the decrease in iGluSnFR signal reflects a decrease in ambient extracellular glutamate levels rather than glutamate depletion caused by high levels of GluR, we upregulated GluR levels using mhc-GluRIIA, which drives GluRIIA expression in muscles (Petersen et al., 1997). We found mhc-GluRIIA animals exhibited elevated levels of not only GluRIIA but also the obligatory GluRIIC subunit. However, iGluSnFR signals at the synapse remained unchanged (Figure 6A), suggesting that elevated GluR density alone does not reduce signals. Taken together, these results suggest that glial Shv plays a critical role in controlling ambient extracellular glutamate levels. 

      (8) As shown in Figure 2, repo>shv-RNAi increases GluR levels before high K+ stimulation, potentially saturating postsynaptic GluR expression and precluding further increases upon stimulation.

      Our data in Figure 6 show that GluR levels can increase up to four-fold upon stimulation in the presence of glutamate, whereas repo>shv-RNAi results in only a ~2-fold increase in baseline GluR concentration. These results suggest that the synapse retains the capacity for further upregulation. Thus, we do not think that the lack of GluR enhancement in repo>shv-RNAi is due to a saturated postsynaptic state, but rather reflects a requirement for glial Shv in activity-dependent modulation.

      (9) Despite glial shv knockdown lowering extracellular glutamate levels, GluR levels unexpectedly increase (Figure 6B). This contradicts the known requirement for high ambient glutamate concentrations to promote GluR clustering and membrane expression (Chen et al., 2009). Furthermore, adding 2 mM glutamate reverses these increases, suggesting additional complexity in the regulation of Shv synaptic remodeling.

      Thank you for the comment and the opportunity to clarify this point. While it may seem counterintuitive at first glance, our observations are in line with previous reports that showed low ambient glutamate levels significantly elevated GluR intensity at the Drosophila NMJ (Chen et al., 2009), but such increase can be reversed by glutamate supplementation (Augustin et al., 2007; Chen et al., 2009). We have revised the text to more clearly reflect this connection.

      (10) If glial Shv promotes GluR expression, why does the increased extracellular Shv from neuronal shv knockdown (elav>shv-RNAi, Figure 5E) fail to elicit stimulus-dependent GluR elevation?

      We speculate that this is because glial Shv does not activate integrin signaling (Figure 5B, C), and elevated glial Shv increases ambient glutamate concentration (Figure 6A), thereby reducing GluR expression (Augustin et al., 2007; Chen et al., 2009). This is indeed what we observed when shv is knocked down in neurons. 

      Additional Issues:

      (11) The type of bouton used for quantification (e.g., Ib or Is boutons) is not specified, which is critical for interpreting the results.

      We apologize for not being clear. We analyzed type Ib boutons as done previously (Lee et al., 2017 and Chang et al., 2024), and have now included this information in the Methods section.  

      (12) The extent of Shv protein depletion in the repo-GeneSwitch system needs validation to confirm the efficacy of the knockdown.

      Thank you for the suggestion. We confirmed the efficiency of acute shv knockdown by the repo-GeneSwitch system by performing Western blot analysis of dissected larval brains (Figure 2 – figure supplement 1B). Acute glial knockdown using the repo-GeneSwitch driver resulted in a 30% reduction in Shv levels, similar to the decrease observed with the repo-GAL4 driver, suggesting that the GeneSwitch driver is functional. Furthermore, knockdown of shv by the ubiquitous tubulin-GAL4 driver completely eliminated Shv protein, indicating that the RNAi construct is effective.  

      Reviewer #2 (Recommendations for the authors):

      (1) General comment on statistics/data presentation: The authors employ an unusual method of using both one-way ANOVA and multiple t-test stats for the same data. Would a 2-way ANOVA be the more appropriate solution to this problem (to analyze across genotype and stimulation condition)? Also a chart in the supplementals showing all comparisons rather than just the fraction explicitly reported in the graphs would be helpful (it is not clear if no indication on significance indicates no difference or just not reported between some of the baseline levels, especially since everything is presented as ratios and in some cases this could help with data interpretation of which baseline levels are different and how they compare to other baselines and other post-stim levels). Further, there are no sample sizes given for any experiment, nor are any values of means, SD, etc ever explicitly given.

      We appreciate the thoughtful suggestion. While a two-way ANOVA could be used to examine interaction effects between genotype and stimulation condition, our analysis was designed to address a specific biological question: whether each genotype, independent of baseline levels, is capable of undergoing activitydependent synaptic remodeling. To this end, we used t-tests to directly compare unstimulated vs. stimulated conditions within each genotype, allowing us to determine whether stimulation produces a significant effect in an all-or-none manner. In parallel, we applied one-way ANOVA with post hoc tests to analyze differences among baseline (unstimulated) conditions across genotypes. This approach is justified by the fact that stimulation was applied acutely and separately, and therefore the baseline values should not be influenced by the stimulated condition. Because we were not aiming to compare the extent of synaptic remodeling between genotypes, we did not use a two-way ANOVA to analyze interaction effects across all conditions.

      In response to the reviewer’s suggestion, we have now added the sample number in the graphs. Additionally, in the Methods section, we include information that each sample represents biological repeats, and that data are presented as fold-change relative to unstimulated controls from the same experimental batch. This normalization is necessary, as absolute GluR intensities can vary depending on microscope settings and staining conditions.

      (2) To clarify distinct roles of Shv coming from neurons vs glia it would help if the authors could include more data on the rescue of shv mutants with UAS-Shv in neurons alone. This data is never shown in the manuscript and data on what effect this rescue has on the pertinent phenotypes in this paper (bouton size and GluR staining) is not reported in the referred to 2017 paper. What this does and does not do for these phenotypes has important implications for how to interpret the glia-only rescue findings.

      Thank you for the suggestion. We have now included new data on neuronal Shv rescue in shv<sup>1</sup> mutants as suggested (updated Figure 4A). Consistent with previous findings that neuronal Shv rescues integrin signaling and electrophysiological phenotypes (Lee et al., 2017), we found that it also restores bouton size, GluR levels, and activity-induced synaptic remodeling. These results support the functional contribution of neuronal Shv. 

      (3) Figure 1C: Where are the images in the periphery taken? The morphology of the glia is odd in that "blobs" of glial membrane seemingly unattached to anything else are floating about? Perhaps these are a thin stack projection and so the connection to the main glia "stalks" are just cut off? Could a specific individual synapse be shown? Also consider HRP shown on its own so that where the actual boutons are could be more clear. It seems like both the Tomato and HRP channels are really overexposed making visualizing the morphology quite confusing. Also why not use the antibody against Shv to directly visualize expression which is more direct than a knock-in tagged version?

      Figure 1C shows a single optical slice of the NMJ at muscle segment 2, selected to clearly highlight Shv-eGFP localization at a branch in close contact with the glial membrane. The glial stalk is not visible in this image because it lies in a different focal plane from the branch of interest. We have now specified this information in the figure legend. In the original figure, the HRP signal (405 channel) was oversaturated, which interfered with visual clarity. In the updated Figure 1C, we reduced the intensity of overexposed channels to better reveal the weak ShveGFP signal and fine glial processes. While we have generated an antibody against Shv, the amount is extremely limited, and hence the Shv-eGFP fusion serves as a valuable tool for visualizing subcellular localization.

      (4) Do glutamate levels really rise in glia Shv KD? Although iGluSnFR signal changes could it be the high level of GluR at the synapse acting as sponges to sequester glutamate so that it can't stimulate the sensor as well? One way to test this would be to overexpress or KD GluRs in muscle in wildtype (or in the repo>Shv RNAi background) to see if that alone can modulate iGluSnfR signals?

      Thank you for suggesting this important control. To address the question of whether high level GluR density alone could influence neuronal iGluSnFR sensor readouts, we expressed GluR using a mhc promoter-driven GluRIIA fusion line, which increases total GluRIIA expression in muscle independently of the Gal4/UAS system. As shown in Figure 6 – figure supplement 1, mhc-GluRIIA animals exhibited elevated levels of not only GluRIIA but also the obligatory GluRIIC subunit. Despite this increase in GluR expression, we did not observe any change in extracellular glutamate levels, as measured by live imaging using the neuronal iGluSnFR sensor (updated Figure 6A). These results suggest that elevated GluR density alone does not alter iGluSnFR sensors  dynamics and further support our conclusions.

      (5) The authors have some Shv constructs that can't be secreted or can't bind to integrins. Performing cell type specific rescues with these constructs might also help distinguish how source matters for each proposed sub-function of Shv though this may be outside the scope of this study. 

      Thank you for noticing the Shv constructs we have. We hope to further test subfunctions of Shv in the future.

      (6) At one point the authors discuss experiments that measure how much Shv is released by glia during neuronal stimulation. Then state that "These data indicate that glial Shv does not directly inhibit integrin signaling." But how this experiment relates to integrin signaling is not explained and unclear.

      We apologize for the confusion. We have now updated the text to better explain our logic: “This activity-induced decrease in glial Shv levels, along with reduced integrin activation (Fig. 5B), suggest that glial Shv does not act by directly inhibiting integrin signaling.”

      Reviewer #3 (Recommendations for the authors):

      Minor comments

      (1) Readers are left wondering what causes the increased baseline of GluR after glial shv RNAi at Fig 1, which is addressed much later. It would be helpful to preemptively mention this.

      Thank you for the suggestion. To maintain a logical flow, we chose to first present the phenotypic data in Figures 1 and 2 and then return to the mechanistic explanation once we introduced ambient glutamate measurements. 

      (2) Be consistent with eGFP vs EGFP.

      Thank you, we have corrected the inconsistencies.  

      (3) Scale bar for Fig 1B is missing in the low-magnification panel.

      Thank you for pointing out. We’ve put in the scale bar for Figure 1B.   

      (4) Fig 1C, it would be helpful to elaborate on the anatomy. For example, what NMJ/abdominal segment is this? Why only some axons are surrounded by glia?

      Figure 1C presents a single optical slice of the NMJ at muscle segment 2, chosen to highlight Shv-eGFP localization at a branch closely juxtaposed to the glial membrane. The glial stalk is not shown in this image because it resides in a different focal plane than the branch being visualized. We have now included this information in the figure legend.

      (5) For Fig 3B, while it is stated that "we observed normal synaptic remodeling using alrmGAL4," the effect size is smaller. There seems to be a decrease in the amount of synaptic remodeling occurring?

      Thank you for pointing this out. Our primary goal was to determine whether each genotype, regardless of baseline GluR levels, is capable of undergoing activitydependent synaptic remodeling in response to stimulation. For this reason, we focused on detecting the presence or absence of remodeling rather than comparing the extent of remodeling across genotypes. While a smaller effect on activity-induced bouton size was observed with alrm-GAL4, the change was still statistically significant, indicating that remodeling does occur in this genotype. Currently, we do not have a clear biological interpretation for differences in the magnitude of remodeling, and therefore chose not to emphasize cross-genotype comparisons.

    1. Author response:

      Reviewer #1 (Public review):

      Major Concerns:

      (1) Lack of Direct Evidence for RadD-NKp46 Interaction

      The central claim that RadD interacts with NKp46 is not formally demonstrated. A direct binding assay (e.g., Biacore, ELISA, or pull-down with purified proteins) is essential to support this assertion. The absence of this fundamental experiment weakens the mechanistic conclusions of the study.

      The reviewer is correct. Direct assays are currently quite impossible because RadD is huge protein and it will take years to purify it. Instead, we used immunoprecipitation assays using NKp46-Ig (Author response images 1 and 2). Fusobacteria were lysed using RIPA buffer, and the lysates were centrifuged twice to separate the supernatant from the pellet (which contains the bacterial membranes). The resulting lysates were incubated overnight with 2.5 µg of purified NKp46 and protein G-beads. After thorough washing, the bound proteins were placed in sample buffer and heated at 95 °C for 8 minutes. The eluates were run on a 10% acrylamide gel and visualized by Coomassie blue staining. As can be seen the NKp46-Ig was able to precipitate protein band around 350Kd in both F. polymorphum ATCC10953 (Author response image 1) and in F. nucleatum ATCC23726 (Author response image 2).

      Author response image 1. NKp46 immunoprecipitation with Fusobacterium polymorphum (ATCC 10953) lysates. The resulting lysates of supernatant and pellet of Fusobacterium were immunoprecipitated (IP) with 2.5 μg of control fusion protein (RBD-Ig) or with NKp46-Ig. A 2.5 μg of purified fusion proteins were also run on gel.

      Author response image 2. NKp46 immunoprecipitation with Fusobacterium nucleatum (ATCC 23726) lysates. The resulting lysates of supernatant and pellet of Fusobacterium were immunoprecipitated (IP) with 2.5 μg of Control fusion protein (RBD-Ig) or with NKp46-Ig. 2.5 μg of purified fusion proteins were also run on gel.

      (2) Figure 2: Binding Specificity and Bacterial Strains

      A CEACAM1-Ig control should be included in all binding experiments to distinguish between specific and non-specific Ig interactions. There is differential Ig binding between strains ATCC 23726 and 10953. The authors should quantify RadD expression in each strain to determine if the difference in binding is due to variation in RadD levels.

      No significant difference in mCEACAM-1-Ig binding was observed across multiple independent experiments. Author response image 3 shows a representative histogram showing mCEACAM-1-Ig binding to F. nucleatum ATCC 23726 and F. polymorphum ATCC 10953. Comparable binding levels were detected in both bacterial species (upper histogram). Similarly, NKp46-Ig and Ncr1-Ig fusion proteins exhibited comparable binding patterns (lower histogram). It is currently not possible to quantify RadD expression directly, as no anti-RadD antibody is available.

      Author response image 3. CEACAM-1 Ig binding to Fusobacterium ATCC 23726 and ATCC 10953. Upper histograms show staining with secondary antibody alone (gray) compared to CEACAM-1 Ig (black line). Lower histograms show binding of NKp46 and Ncr1 fusion proteins to the two Fusobacterium strains. Gray represent secondary antibody controls.

      (3) Figure 3: Flow Cytometry Inconsistencies and Missing Controls

      What do the FITC-negative, Ig-negative events represent? The authors should clarify whether these are background signals, bacterial aggregates, or debris.

      We now present the gating strategy used in these experiments (Author response image 4). Fusion negative Ig samples were the bacterial samples stained only with the secondary antibody APC (anti-human AF647). The TITC-negative represent unlabeled bacteria.

      Author response image 4. Gating strategy for FITC-labeled Fusobacterium stained with fusion proteins. Bacteria were first gated as shown in the left panel. The gated population was then further analyzed in the right plot: the lower-left quadrant represents bacterial debris, the upper-left quadrant corresponds to FITC-stained bacteria only, and the upper-right quadrant shows bacteria double-positive for FITC and APC, indicating binding of the fusion proteins.

      Panel B, CEACAM1-Ig binding appears markedly increased compared to WT bacteria. The reason for this enhancement should be discussed-does it reflect upregulation of the bacterial ligand or an artifact of overexpression? Fluorescence compensation should be carefully reviewed for the NKp46/NCR1-Ig binding assays to ensure that the signals are not due to spectral overlap or nonspecific binding. Importantly, binding experiments using the FadI/RadD double knockout strain are missing and should be included. This control is essential.

      We don’t know why expression of CEACAM1-Ig binding is increased. Indeed, it will be nice to have the FadI/RadD double knockout strain which we currently don’t have.

      In Panel E, the basis for calculating fold-change in MFI is unclear. Please indicate the reference condition to which the change is normalized.

      The mean fluorescence intensity (MFI) fold change was calculated by dividing the MFI obtained from staining with the fusion proteins by the MFI of the corresponding secondary antibody control (bacteria incubated without fusion proteins).

      (4) Figure 4: Binding Inhibition and Receptor Sensitivity

      Panel A lacks representative FACS plots and is currently difficult to interpret.

      Fusobacteria binding to CEACAM-1, NKp46, and NCR1 fusion proteins was tested in the presence of 5 and 10 mM L-arginine (Author response image 5). L-arginine inhibited the binding of NKp46-Ig and NCR1-Ig, whereas no effect was observed on CEACAM-1-Ig binding.

      Author response image 5. Fusobacterium binding inhibition by L-Arginine. The figure shows the binding of CEACAM1-Ig (left panel), NKp46-Ig (middle panel), and Ncr1-Ig (right panel) in the presence of 0 mM (black), 5 mM (red), and 10 mM (blue) L-arginine.

      Differences in the sensitivity of human vs. mouse NKp46 to arginine inhibition should be discussed, given species differences in receptor-ligand interactions.

      Ncr1, the murine orthologue of human NKp46, shares approximately 58% sequence identity with its human counterpart (1). The observed differences in arginine-mediated inhibition of bacterial binding between mouse and human NKp46 might stem from structural differences or distinct posttranslational modifications, such as glycosylation. Indeed, prediction algorithms combined with high-performance liquid chromatography analysis revealed that Ncr1 possesses two putative novel O-glycosylation sites, of which only one is conserved in humans (2).

      References

      (1) Biassoni R., Pessino A., Bottino C., Pende D., Moretta L., Moretta A. The murine homologue of the human NKp46, a triggering receptor involved in the induction of natural cytotoxicity. Eur J Immunol. 1999 Mar; 29(3).

      (2) Glasner A., Roth Z., Varvak A., Miletic A., Isaacson B., Bar-On Y., Jonjić S., Khalaila I., Mandelboim O. Identification of putative novel O-glycosylations in the NK killer receptor Ncr1 essential for its activity. Cell Discov. 2015 Dec 22; 1:15036.

      What are the inhibition results using F. nucleatum strains deficient in FadI?

      The inhibition pattern observed in the F. nucleatum ΔFadI mutant was comparable to that of the wild-type strain (Author response image 6). When cultured under identical conditions and exposed to increasing concentrations of arginine (0, 5, and 10 mM), the F. nucleatum ΔFadI strain also demonstrated a dose-dependent reduction in binding to NKp46 and Ncr1.

      Author response image 6. Arginine inhibition of NKp46-Ig and Ncr1-Ig binding in F. nucleatum ΔFadI. Histograms show NKp46-Ig (A, C) and Ncr1-Ig (B, D) binding to F. nucleatum ATCC10953 ΔFadI (A and B) and to F. nucleatum ATCC23726 ΔFadI (A and B) following exposure to 5 mM and 10 mM L-Arginine. Panels (E) and (F) display the mean fluorescence intensity (MFI) quantification corresponding to (A and B) and (C and D), respectively.

      In Panel B, CEACAM1-Ig and RadD-deficient bacteria must be included as negative controls for binding specificity upon anti-NKp46 blocking.

      We appreciate the request to include CEACAM1-Ig and RadD-deficient bacteria as negative controls for specificity under anti-NKp46 blocking. We don’t not think it is necessary since the 02 antibody is specific for NKp46, we used other anti0NKp46 antibodies that did not block the interaction and an irrelevant antibofy, we showed that arginine produced a dose-dependent reduction in NKp46/Ncr1 binding, consistent with an arginine-inhibitable RadD interaction already shown in our manuscript (Fig. 4A). The ΔRadD strains we used already demonstrate loss of NKp46/Ncr1 binding and loss of NK-boosting activity (Figs. 3, 5). Collectively, these data establish that NKp46/Ncr1 recognition of a high-molecular-weight ligand consistent with RadD is specific and functionally relevant.

      Figure 5: Functional NK Activation and Tumor Killing

      In Panels B and C, the key control condition (NK cells + anti-NKp46, without bacteria) is missing. This is needed to evaluate if NKp46 recognition is involved in tumor killing. The authors should explicitly test whether pre-incubation of NK cells with bacteria enhances their anti-tumor activity.

      No significant difference in NK cell cytotoxicity was observed between untreated NK cells and NK cells incubated with anti-NKp46 antibody in the absence of bacteria. Therefore, the NK + anti-NKp46 (O2) group was included as an additional control alongside the other experimental conditions shown in Figures 5b and 5c, and is presented in Author response image 7 below.

      Author response image 7. NK cytotoxicity against breast cancer cell lines. NK cell cytotoxicity against T47D (left) and MCF7 (right) breast cancer cell lines. This experiment follows the format of Figure 5b and 5c, with the addition of the NK cells + O2 antibody group. No significant differences were observed when values were normalized to NK cells alone.

      Could bacteria induce stress signals in tumor cells that sensitize them to NK killing? This distinction is critical.

      It remains unclear whether the bacteria induce stress-related signals in tumor cells that render them more susceptible to NK cell–mediated cytotoxicity.

      (6) Figure 5D: Mechanism of Peripheral Activation

      It is suggested that contact between bacteria and NK cells in the periphery leads to their activation. Can the authors confirm whether this pre-activation leads to enhanced killing of tumor targets, or if bacteria-tumor co-localization is required? The literature indicates that F. nucleatum localizes intracellularly within tumor cells. If so, how is RadD accessible to NKp46 on infiltrating NK cells?

      We do not expect that pre-activation of NK cells with bacteria would enhance their tumor-killing capacity. In fact, when NK cells were co-incubated with bacteria, we occasionally observed NK cell death. Although F. nucleatum can reside intracellularly, bacterial entry requires prior adhesion to tumor cells. At this stage—before internalization—the bacteria are accessible for recognition and binding by NK cells.

      (8) Figure 5E and In Vivo Relevance

      Surprisingly, F. nucleatum infection is associated with increased tumor burden. Does this reflect an immunosuppressive effect? Are NK cells inhibited or exhausted in infected mice (TGIT, SIGLEC7...)? If NK cell activation leads to reduced tumor control in the infected context, the role of RadD-induced activation needs further explanation. RadD-deficient bacteria, which do not activate NK cells, result in even poorer tumor control. This paradox needs to be addressed: how can NK activation impair tumor control while its absence also reduces tumor control?

      Siglec-7 lacks a direct orthologue in mice, and neither mouse TIGIT nor CEACAM1 bind F. nucleatum. The increased tumor burden observed in infected mice may therefore result from bacterial interference with immune cell infiltration and accumulation within the tumor microenvironment (Parhi, L., Alon-Maimon, T., Sol, A. et al. Breast cancer colonization by Fusobacterium nucleatum accelerates tumor growth and metastatic progression. Nat Commun 11, 3259 (2020)). Consequently, the NK cells that do reach the tumor site can recognize and kill F. nucleatum–bearing tumor cells through RadD–NKp46 interactions. In the absence of RadD, this recognition is impaired, leading to reduced NK-mediated cytotoxicity and increased tumor growth.

      (9) NKp46-Deficient Mice: Inconsistencies

      In Ncr1⁻/⁻ mice, infection with WT or RadD-deficient F. nucleatum has no impact on tumor burden. This suggests that NKp46 is dispensable in this context and casts doubt on the physiological relevance of the proposed mechanism. This contradiction should be discussed more thoroughly.

      Ncr1 is also directly involved in mediating NK cell–dependent killing of tumor cells, even in the absence of bacterial infection. Therefore, in Ncr1-deficient mice, F. nucleatum has no additional effect on tumor progression (Glasner, A., Ghadially, H., Gur, C., Stanietsky, N., Tsukerman, P., Enk, J., Mandelboim, O. Recognition and prevention of tumor metastasis by the NK receptor NKp46/NCR1. J Immunol. 2012).

      Reviewer #2 (Public review):

      Weaknesses:

      (1) A previous study by this group (PMID: 38952680) demonstrated that RadD of F. nucleatum binds to NK cells via Siglec-7, thereby diminishing their cytotoxic potential. They further proposed that the RadD-Siglec-7 interaction could act as an immune evasion mechanism exploited by tumor cells. In contrast, the present study reports that RadD of F. nucleatum can also bind to the activating receptor NKp46 on NK cells, thereby enhancing their cytotoxic function.

      Siglec-7 lacks a direct orthologue in mice, and neither mouse TIGIT nor CEACAM1 bind F. nucleatum. In contrast, NKp46 and its murine homologue, Ncr1, both recognize and bind the bacterium.

      While F. nucleatum-mediated tumor progression has been documented in breast and colon cancers, the current study proposes an NK-activating role for F. nucleatum in HNSC. However, it remains unclear whether tumor-infiltrating NK cells in HNSC exhibit differential expression of NKp46 compared to Siglec-7. Furthermore, heterogeneity within the NK cell compartment, particularly in the relative abundance of NKp46⁺ versus Siglec-7⁺ subsets, may differ substantially among breast, colon, and HNSC tumors. Such differences could have been readily investigated using publicly available single-cell datasets. A deeper understanding of this subset heterogeneity in NK cells would better explain why F. nucleatum is passively associated with a favorable prognosis in HNSC but correlates with poor outcomes in breast and colon cancers.

      Currently, there are no publicly available single-cell datasets suitable for characterizing NK cell heterogeneity in the context of F. nucleatum infection—particularly regarding the expression of Siglec-7, NKp46, or CEACAM1 and their potential association with poor clinical outcomes in breast, head and neck squamous cell carcinoma (HNSC), or colorectal cancer (CRC). Furthermore, no RNA-seq datasets are available for breast cancer cases specifically associated with F. nucleatum infection and poor prognosis. Therefore, we analyzed bulk RNA expression datasets for Siglec-7 and CEACAM1 and evaluated their associations with HNSC and CRC using the same patient databases utilized in our manuscript (Author response image 8). No significant differences in Siglec-7 expression were detected between HNSC and CRC samples (Author response image 8A). Although CEACAM1 mRNA levels did not differ between F. nucleatum–positive and –negative cases within either cancer type, its overall expression was higher in CRC compared to HNSC (Author response image 8B).

      Author response image 8. Siglec7 and Ceacam1 expression and the prognostic effect of F. nucleatum in a tumor-type-specific manner. Comparison of Siglec7 (A) and Ceacam1 (B) expression across HNSC and CRC tumors. Log₂ expression levels of NKp46 mRNA were compared across HNSC and CRC cohorts, stratified by F. nucleatum positive and negative. Results were analyzed by one-way ANOVA with Bonferroni post hoc correction.

      (2) The in vivo tumor data (Figure 5D-F) appear to contradict the authors' claims. Specifically, Figure 5E suggests that WT mice engrafted with AT3 breast tumors and inoculated with WT F. nucleatum exhibited an even greater tumor burden compared to mice not inoculated with F. nucleatum, indicating a tumor-promoting effect. This finding conflicts with the interpretation presented in both the results and discussion sections.

      Siglec-7 lacks a direct orthologue in mice, and neither mouse TIGIT nor CEACAM1 bind F. nucleatum. The increased tumor burden observed in infected mice may therefore result from bacterial interference with immune cell infiltration and accumulation within the tumor microenvironment (Parhi, L., Alon-Maimon, T., Sol, A. et al. Breast cancer colonization by Fusobacterium nucleatum accelerates tumor growth and metastatic progression. Nat Commun 11, 3259 (2020)). Consequently, the NK cells that do reach the tumor site can recognize and kill F. nucleatum–bearing tumor cells through RadD–NKp46 interactions. In the absence of RadD, this recognition is impaired, leading to reduced NK-mediated cytotoxicity and increased tumor growth.

      (3) Although the authors acknowledge that F. nucleatum may have tumor context-specific roles in regulating NK cell responses, it is unclear why they chose a breast cancer model in which F. nucleatum has been reported to promote tumor growth. A more appropriate choice would have been the well-established preclinical oral cancer model, such as the 4-nitroquinoline 1-oxide (4NQO)-induced oral cancer model in C57BL/6 mice, which would more directly relate to HNSC biology.

      The tumor model we employed is, to date, the only model in which F. nucleatum has been shown to exert a measurable effect, which is why we selected it for our study (Parhi, L., Alon-Maimon, T., Sol, A. et al. Breast cancer colonization by Fusobacterium nucleatum accelerates tumor growth and metastatic progression. Nat Commun. 2020; 11: 3259). We have not tested the 4-nitroquinoline-1-oxide (4NQO)–induced oral cancer model, and we are uncertain whether its use would be ethically justified.

      (4) Since RadD of F. nucleatum can bind to both Siglec-7 and NKp46 on NK cells, exerting opposing functional effects, the expression profiles of both receptors on intratumoral NK cells should be evaluated. This would clarify the balance between activating and inhibitory signals in the tumor microenvironment and provide a more mechanistic explanation for the observed tumor context-dependent outcomes.

      This question was answered in Author response image 8 above.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      This is an interesting study on the role of FGF signaling in the induction of primitive streak-like cells (PS-LC) in human 2D-gastruloids. The authors use a previously characterized standard culture that generates a ring of PSLCs (TBXT+) and correlate this with pERK staining. A requirement for FGF signaling in TBXT induction is demonstrated via pharmacological inhibition of MEK and FGFR activity. A second set of culture conditions (with no exogenous FGFs) suggests that endogenous FGFs are required for pERK and TBXT induction. The authors then characterize, via scRNA-seq, various components of the FGF pathway (genes for ligands, receptors, ERK regulators, and HSPG regulation). They go on to characterize the pFGFR1, receptor isoforms, and polarized localization of this receptor. Finally, they perform FGF4 inhibition and use a cell line with a limited FGF17 inactivation (heterozygous null) and show that loss of these FGFs reduces PS-LC and derivative cell types. 

      Strengths: 

      (1) As the authors point out, the role of FGF signaling in gastrulation is less well understood than other signaling pathways. Hence this is a valuable contribution to that field. 

      (2) The FGF4 and FGF17 loss-of-function experiments in Figure 5 are very intriguing. This is especially so given the intriguing observation that these FGFs appear to be dominating in this model of human gastrulation, in contrast to what FGFs dominate in mice, chicks, and frogs. 

      (3) In general this paper is valuable as a further development of the Human gastruloid system and the role of FGF signaling in the induction of PS-CLs. The wide net that the authors cast in characterizing the FGF ligand gene, receptor isoforms, and downstream components provides a foundation for future work. As the authors write near the beginning of the Discussion "Many questions remain." 

      We thank the reviewer for these positive comments.

      Weaknesses: 

      (1) FGFs are cell survival factors in various aspects of development. The authors fail to address cell death due to loss of FGF signaling in their experiments. For example, in Figure 1E (which requires statistical analysis) and 1G (the bottom FGFRi row), there appears to be a significant amount of cell loss. Is this due to cell death? The authors should address the question of whether the role of FGF/ERK signaling is to keep the cells alive. 

      Indeed, FGF also strongly affects cell survival and it is an interesting question to what extent this depends on ERK. Our manuscript focuses instead on the role of FGF/ERK signaling in cell fate patterning. As mentioned in our discussion, figure 1de show that doxycycline induced pERK leads to more TBXT+ cells than the control without restoring cell number, suggesting the role of FGF in controlling cell number is independent of the requirement for FGF/ERK in PS-LC differrentiation. To further support this, we have added data showing low doses of MEKi are sufficient to inhibit differentiation without affecting cell number (Supp. Fig. 1i).

      To address the reviewers question regarding the cause of cell loss, we now stained for BrdU and cleaved Cas3 to assess proliferation and apoptosis in the presence and absence of MEK and FGFR inhibition (new Supp. Fig.

      1ef). This shows that the effect of these inhibitors on cell number is primarily due to a reduction in proliferation. We have also included statistical analysis in Fig.1e. 

      (2) Regarding the sparse cells in 1G, is there a reduction in cell number only with FGFRi and not MEKi? Is this reproducible? Gattiglio et al (Development, 2023, PMID: 37530863) present data supporting a "community effect" in the FGF-induced mesoderm differentiation of mouse embryonic stem cells. Could a community effect be at play in this human system (especially given the images in the bottom row of 1G)? If the authors don't address this experimentally they should at least address the ideas in Gattoglio et al. 

      Indeed, FGFRi reproducibly affects cell number more than MEKi, in line with the fact that pathways other than MAPK/ERK downstream of FGF (e.g. PI3K) play important roles in cell survival and growth. However, we think the lack of differentiation in MEKi and FGFRi in Fig.1g cannot be attributed to a loss of cells combined with a community effect. This is because without FGFRi or MEKi cells efficiently differentiate to primitive streak at much lower densities than those originally shown, consistent with the data we discuss in response to (1) arguing against a primarily indirect effect of FGF on PS-LC differentiation through cell density. In the context of directed differentiation (rather than 2D gastruloids), we have now shown in a controlled manner that the effect of MEKi and FGFRi does not depend on a community effect by repeating the experiment in Fig.1g while adjusting cell seeding densities to obtain similar final cell densities in all three conditions (new Fig.1g, new Supp Fig.1g). Furthermore we have included new data showing extremely sparse cells without MEKi or FGFRi still differentiate without problems (new Supp Fig 1h). We have also include Gattoglio et al in our revised discussion.

      (3) Do the FGF4 and FGF17 LOF experiments in Figure 5 affect cell numbers like FGFRi in Figure 1? 

      We did not observe major changes in cell number in the FGF4 and FGF17 loss of function experiments. This is in line with our observation that low levels of ERK signaling are sufficient to maintain proliferation (new Supp. Fig. 1i), and the fact that low levels of ERK signaling are maintained in the absence of FGF4 and FGF17 (Fig.5), likely by FGF2 (Fig. 2). In contrast, FGFRi treatment in Fig.1 leads to a nearly complete loss of FGF signaling (ERK and other pathways) that has a dramatic effect on cell number.

      Why examine PS-LC induction only in FGF17 heterozygous cells and not homozygous FGF17 nulls? 

      We were unable to obtain homozygous FGF17 nulls, it is not clear if there is a reason for this. In the absence of homozygous nulls, we have now further corroborated our findings with additional knockdown data (described in response to other comments below).

      (4) The idea that FGF8 plays a dominant role during gastrulation of other species but not humans is so intriguing it warrants deeper testing. The authors dismiss FGF8 because its mRNA "...levels always remained low." (line 363) as well as the data published in Zhai et al (PMID: 36517595) and Tyser et al (PMID: 34789876). But there are cases in mouse development where a gene was expressed at levels so low, that it might be dismissed, and yet LOF experiments revealed it played a role or even was required in a developmental process. The authors should consider FGF8 inhibition or inactivation to explore its potential role, despite its low levels of expression. 

      We thank the reviewer for this suggestion. We have now analyzed the role of FGF8 using FISH to visualize its expression and siRNA to understand its function (Fig.5d,f,h; Supp.Fig.5e,g,6e). We found that FGF8 expression is higher earlier in differentiation, preceding most expression of TBXT. Our scRNA-seq only analyzed samples at 42h so did not capture this. Furthermore, FGF8 expression localized inside the PS-like ring rather than coinciding with it like FGF4. Surprisingly, FGF8 knockdown led to an increase in primitive streak-like differentiation, suggesting it may counteract FGF4. The results are shown in the revised Fig. 5 and Supplemental Fig. 5. While this certainly merits further investigation, understanding the role of FGF8 in more detail is beyond the scope of the current work. 

      (5) Redundancy is a common feature in FGF genetics. What is the effect of inhibiting FGF4 in FGF17 LOF cells? 

      Further siRNA and shRNA experiments showed that FGF17 knockdown had a much smaller effect than FGF4 knockdown on expression of primitive streak markers (Fig.5i, Supp.Fig.6f-i) but that FGF17 knockdown did lead to a complete loss of the mesoderm marker TBX6 (Fig.5j, Supp.Fig.6j). A double knockdown of FGF4+FGF17 looked similar to FGF4 alone (Supp.Fig.6k). Thus, we now think the more likely scenario is that FGF17 is downstream of FGF4-dependent PS-differentiation and although this may have a positive feedback effect whereby this FGF17 can then enhance further PS-differentiation, which we previously interpreted as partial redundancy, the primary role of FGF17 may be later, in mesoderm differentiation.

      (6) I suggest stating that the authors take more caution in describing FGF gradients. For example, in one Results heading they write "Endogenous FGF4 and FGF17 gradients underly the ERK activity pattern.", implying an FGF protein gradient. However, they only present data for FGF mRNA , not protein. This issue would be clarified if they used proper nomenclature for gene, mRNA (italics), and protein (no italics) throughout the paper. 

      Thank you for the suggestion. We have edited the paper to more clearly distinguish protein and mRNA. We do think our data provide substantial indirect evidence for a protein gradient which is what the results heading is meant to convey. Receptor activation is high where ERK activity is high (Fig.3), and receptor activation is limited by ligands, since creating a scratch to let exogenous FGF reach the basal side of cells in the center leads to receptor activation (Fig.4). This strongly suggests ERK activity reflects an FGF protein gradient. 

      Reviewer #2 (Public review): 

      Summary: 

      The role of FGFs in embryonic development and stem cell differentiation has remained unclear due to its complexity. In this study, the authors utilized a 2D human stem cell-based gastrulation model to investigate the functions of FGFs. They discovered that FGF-dependent ERK activity is closely linked to the emergence of primitive streak cells. Importantly, this 2D model effectively illustrates the spatial distribution of key signaling effectors and receptors by correlating these markers with cell fate markers, such as T and ISL1. Through inhibition and loss-of-function studies, they further corroborated the needs of FGF ligands. Their data shows that FGFR1 is the primary receptor, and FGF2/4/17 are the key ligands for primitive streak development, which aligns with observations in primate embryos. Additional experiments revealed that the reduction of FGF4 and FGF17 decreases ERK activity. 

      Strengths: 

      This study provides comprehensive data and improves our understanding of the role of FGF signaling in primate

      primitive streak formation. The authors provide new insights related to the spatial localization of the key components of FGF signaling and attempt to reveal the temporal dynamics of the signal propagation and cell fate decision, which has been challenging. 

      Weaknesses: 

      Given the solid data, the work only partially clarifies the complex picture of FGF signaling, so details remain somewhat elusive. The findings lack a strong punchline, which may limit their broader impact. 

      We thank this reviewer for their valuable feedback and compliment on the solidity of our data. The punchline of our work is that FGF4 and FGF17-dependent ERK signaling plays a key role in differentiation of human PS-like cells and mesoderm, and that these are different FGFs than those thought to drive mouse gastrulation. A second key point is that like BMP and TGFβ signaling, FGF signaling is restricted to the basolateral sides of pluripotent stem cell colonies due to polarized receptor expression, which is crucial for understanding the response to exogenous ligands added to the cell medium. Indeed, many facets of FGF signaling remain to be investigated in the future, such as how FGF regulates and is regulated by other signals, which we will dedicate a different manuscript to. 

      Reviewer #3 (Public review): 

      Jo and colleagues set out to investigate the origins and functions of localized FGF/ERK signaling for the differentiation and spatial patterning of primitive streak fates of human embryonic stem cells in a well-established micropattern system. They demonstrate that endogenous FGF signaling is required for ERK activation in a ringdomain in the micropatterns, and that this localized signaling is directly required for differentiation and spatial patterning of specific cell types. Through high-resolution microscopy and transwell assays, they show that cells receive FGF signals through basally localized receptors. Finally, the authors find that there is a requirement for exogenous FGF2 to initiate primitive streak-like differentiation, but endogenous FGFs, especially FGF4 and FGF17, fully take over at later stages. 

      Even though some of the authors' findings - such as the localized expression of FGF ligands during gastrulation and the importance of FGF/ERK signaling for cell differentiation in the primitive streak - have been reported in model organisms before, this is one of the first studies to investigate the role of FGF signaling during primitive streak-like differentiation of human cells. In doing so, the paper reports a number of interesting and valuable observations, namely the basal localization of FGF receptors which mirrors that of BMP and Nodal receptors, as well as the existence of a positive feedback loop centered on FGF signaling that drives primitive-streak differentiation. The authors also perform a comparison of the role of different FGFs across species and try to assign specific functions to individual FGFs. In the absence of clean genetic loss-of-function cell lines, this part of the work remains less strong. 

      We thank the reviewer for emphasizing the value of our findings in a human model for gastrulation. We agree more loss-of-function experiments would provide further insight into the role of different FGFs. While we did not manage to create knockout cell lines, we have now performed both siRNA and shRNA knock-down of all FGF4, and FGF17 in two different hPSC lines, performed siRNA knockdown of FGF8, and also made a FGF4+FGF17 shRNA double knockdown cell lines to more completely test the functions of the individual FGFs (Fig.5, Supp.Fig.5,6). Our data suggest FGF17 may be downstream of FGF4 and primarily required for mesoderm differentiation while FGF8 appears to counteract FGF4. In doing this we have added a large amount of new data to the manuscript and we have removed the heterozygous knockout data in the first version of the manuscript which we felt added little to the new data. Further experiments are still needed to solidify our interpretation but those are beyond the scope of the current work.   

      Reviewer #1 (Recommendations for the authors): 

      (1) FGF2 is added to culture experiments (e.g. Figure 4), but the commercial source is not mentioned in Methods. For example, it could be added to "Supplementary Table 1: Cell signaling reagents." 

      We apologize for this oversight and have now added the information to Supplementary Table 1.

      (2) Line 117-118: "For example, by controlling the expression of Wnt or Nodal which are both required for PS-like differentiation". It is clear what the authors mean, but this is not a complete sentence. 

      We edited this for clarity, it now reads: “First, is FGF/ERK signaling required directly for PS-like differentiation, or does it act indirectly? These possibilities are not mutually exclusive. For example, FGF/ERK could be required directly but also act indirectly by controlling Wnt or Nodal expression, as both Wnt and Nodal signaling are required for PS-like differentiation.”

      (3) Line 246 "...found its spatial pattern to strongly resembles that of pERK..." either remove "to" or change "resembles" to "resemble" 

      Thank you for catching this. We removed “to”.

      (4) Lines 391- 393 seem to be missing a word in the last phrase: "...with FGF17 more important continued differentiation to mesoderm and endoderm." Maybe "during" after the word "important"? 

      Thank you for catching this, indeed the word “during” was missing and we have now added it.

      (5) Please define acronyms in Figure 3D (PS-LC was defined previously, but not others). 

      We apologize for the oversight, we have now defined the acronyms.

      (6) The three blue lines in Figure 5B (right) are hard to discern (and I'm not colorblind). I suggest also using a variety of dotted lines in a subset of these FGFs. 

      Thanks you for the suggestion. We have now given all the FGFs colors that are more clearly distinct and made the TBXT and TBX6 lines dashed.  

      Reviewer #2 (Recommendations for the authors): 

      (1) The reviewer acknowledges that FGF signaling is complex, particularly when dynamics and its correlation with cell fates are considered. To improve the clarity of the findings, the authors are encouraged to provide an additional schematic figure that clearly delineates the main findings of this study.  

      Thank you for the suggestion. We have now added a summary figure (Fig.6) to our discussion, which we hope helps present our findings more clearly.

      (2) The data suggest that FGF signaling may function differently in mice compared to primates, and their stem cell model aligns more closely with the latter. While the authors discuss this in the contents only based on sequencing data, it would be valuable to conduct some experiments with mouse embryos to validate the key differences. 

      It is unclear to us which experiments the reviewer has in mind. There is ample data on FGF expression in the mouse literature, as are many knockout phenotypes. Furthermore, verifying loss of function phenotypes (e.g. FGF17 knockout) in mouse is beyond our expertise.

      (3) Heparan sulfate proteoglycan (HSPG) is mentioned as an important component of FGF signaling; however, the only data related to HSPG is single-cell sequencing results. The authors should consider performing immunostaining or other assays to validate HSPG expression and spatial distribution, similar to the approach they used for other signaling components. 

      Our scratch experiments in Fig. 4 strongly argue against HSPGs as being responsible for the spatial pattern of FGF receptor activation: after a scratch across the colony the response is strong all along the scratch as expected if presence of FGF (an FGF gradient) controls the level of activity. If HSPGs were limiting, FGF flowing in from the media show not be able to uniformly activate receptors around the scratch.

      In addtion, we have now included an immunostain for HS in a newly added Supp. Fig. 4 which does not explain the observed pattern of ERK signaling.

      (4) In the scratch experiment, particularly high PERK expression is observed at the edge of the scratch. The authors should provide an explanation for why this expression is significantly higher compared to the edges of the colony. Additionally, it would be interesting to investigate the fate of the cells with super high PERK expression.  

      We have now determined that adaptive response to FGF is the reason that the response around the scratch is initially much higher than in the ERK activity ring that overlaps with the primitive streak-like cells. We have added figures showing that although the intial response to FGF exposure after scratching is very high, the response around the scratch adapts to levels similar in those in the ERK ring over the course of 6 hours (Fig.4ij). 

      (5) For some of the key experiments, multiple cell lines should be used to ensure that the findings are reproducible and applicable across different human stem cell lines.

      We have now checked FISH stainings and knockdown phenotypes for different FGFs in two different cell lines: ESI17 (hESC, XX) and PGP1 (hiPSC, XY). These results are shown in Supplementary Figures 6. We found all results to be consistent.

      (6) Where applicable, the meaning of error bars needs to be more clearly presented, including details on the number of independent experiments or samples used. 

      Thank you for pointing this out. Where error bar definitions were missing we have now added them to the figure captions.

      Reviewer #3 (Recommendations for the authors): 

      (1) The authors only analyze the ppERK ring in micropatterns of a single size. What was the motivation for the choice of this size? Can the authors how the ppERK ring is expected to depend on colony size? 

      Much smaller patterns lose the interior pluripotent regions while much larger patters have a much larger pluripotent region, which requires larger tilings to image without providing additional insight. The colony sizedependence of cell fate patterning was described in the paper that established the 2D gastruloids model (Warmflash Nat Methods 2014) and we later showed this due to a fixed length scale of the BMP and Nodal signaling gradients from the colony edge (Jo et al Elife 2022). We have now included data showing that the ERK patterns behaves similarly, with a fixed length scale of the pattern implying that in smaller colonies the ERK ring becomes a disc and the entire center of the colony has high ERK signaling (Supp Fig 1a).

      (2) The scRNAseq is somewhat confusing - why do the two datasets not overlap in the PHATE representation? This is unexpected, because the two samples have been treated similarly, and the authors have integrated their data to iron out possible batch effects. This discrepancy should be discussed. The authors should also specify from which reference exactly the first dataset comes from.  

      The two datasets do overlap nicely, the same fates are well mixed in the same place and the gene expresison profiles for the integrated data (e.g., Fig.2e) look smooth, so we believe the integration is good, but different cell fates are represented to different degrees. In particular, sample 2 shows much more mesoderm differentiation making the mesoderm branch mostly orange. Occassionally samples differentiate faster or slower than average which we see here, and these samples were collected far apart in time. We do not believe this affects our conclusions, if anything, we think performing the analysis on two samples that differ this much should make the conclusions more robust.  

      (3) If find it intriguing that exogenous FGF2 is important early on for primitive streak-like differentiation, although the authors show that it does not reach the center of the colony. The authors may want to discuss this conundrum. Does the FGF2 effect propagate from the outside to the inside, or does it act at an early stage when the cells have not yet formed a tight epithelium on the micropattern? 

      The cells in the experiment in Fig. 5a were given 24h to epithelialize, so we we do believe it acts from the edge. We believe this may be due to FGF2 modulating the early BMP response on the edge and are working on a manuscript that further explores this pathway crosstalk.

      (4) The authors' statement that FGF4 and FGF17 have partially redundant functions is not very strong, mainly because the study lacks a full FGF17 loss-of-function cell line. If the authors wanted to improve on this point, they could knock down FGF4 in the FGF17 heterozygous line, or produce a homozygous FGF17 KO line. If there are specific reasons why FGF17 homozygous lines cannot be produced, this could be interesting to discuss, too. Finally, I noticed that the methods list experiments with an FGF17 siRNA, but these are not shown in the manuscript. 

      We agree our evidence was previously not as strong as it could be. While there is no reason we know of why homozygous knockout lines cannot be produced, we failed to produce on. To strengthen our evidence we have therefore included substantial new knockdown data.  We have now performed both siRNA and shRNA knockdown of all FGF4, and FGF17 in two different hPSC lines, performed siRNA knockdown of FGF8, and also made a FGF4+FGF17 shRNA double knockdown cell lines to more completely test the functions of the individual FGFs (Fig.5, Supp.Fig.5,6). These experiments showed that FGF17 knockdown had a much smaller effect than FGF4 knockdown on expression of primitive streak markers (Fig.5i, Supp.Fig.6f-i) but that FGF17 knockdown did lead to a complete loss of the mesoderm marker TBX6 (Fig.5j, Supp.Fig.6j). A double knockdown of FGF4+FGF17 looked similar to FGF4 alone (Supp.Fig.6k). Thus, we now think the more likely scenario is that FGF17 is downstream of FGF4-dependent PS-differentiation and although this may have a positive feedback effect whereby this FGF17 can then enhance further PS-differentiation, which we previously interpreted as partial redundancy, the primary role of FGF17 may be later, in mesoderm differentiation. Furthermore, our new data suggests FGF8 may counteract FGF4 and limit PS-like differentiation. 

      Minor 

      (5) Line 63: Reference(s) appear to be missing. 

      This whole paragraph summarizes the results of the references given on line 55, we have now repeated the relevant references where the reviewer indicated.

      (6) Supplementary Figure 1a,b does not show ppERK, unlike stated in lines 102 - 104. 

      Indeed, the data described in lines 102-104 is shown in Fig.1a and we have removed the original Supplementary Figure 1ab since it did not provide relevant information.

      (7) Line 201: It is not clear whether this is a new sequencing dataset, or if existing datasets have been reanalyzed. 

      We agree our description was unclear. We have edited the text, which now explicitly states that our analysis is based on one dataset we collected previously and a replicate that was newly collected and deposited on GEO for this manuscript.

      (8) Figure 2f; Supplementary Figure 2b, c: The colors need to be explained in scale bars. How has this data been normalized to allow for comparison between very different sample types? 

      We have now added color bars indicating the scale for each of these figure panels. As the caption stated, the interspecies comparison was normalized within each species, so the highest FGF level for any FGF at any time within each species is normalized to one. We are thus comparing between species the relative expression of different FGFs within each species. Indeed there is no good way to compare absolute expression between species. For extra clarity we have expanded our description of the interspecies comparison analysis and normalization in the methods section.

      (9) Line 232: Where is the expression of SEF shown? 

      It is shown in Fig. 2i, under the official gene name IL17RD.

      (10) Supplementary Figure 4 seems to be missing. 

      Thank you for pointing this out. We have now added a supplementary Fig.4.

      (11) Line 437: Citation needed. 

      We have included citations now.

      (12) Line 439: A similar feedback loop has been proposed to operate during mesoderm differentiation in mouse ESC (pmid: 37530863 ). The authors may consider citing this work. 

      Thank you for the suggestion, we have now included this work in the discussion. The feedback loop proposed in that work involves FGF8, while we were trying to explain why FGF4 and not FGF8 appears to be conserved across species by invoking an FGF4 feedback loop. Thus, it becomes even harder to explain differences in FGF4 and FGF8 expression between human and mouse gastrulation.

      (13) Supplementary Figure 6 is not described in the main text. 

      We have removed the original Supplementary Figure 6 and corresponding heterozygous knockout data in the main figure which we felt added little to the extensive knockdown data we now present. We did create a new Supplementary Figure 6 showing additional knockdown data which is described in the main tekst.

      (14) Submission of sequencing data to GEO needs to be updated. 

      We have now made the GEO data public.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      All experiments are convincing, clearly visualized and quantified. 

      The strength of the paper is that it clearly indicates that there are temporal controlled feedback systems which is important for endothelial collective cell behavior. 

      A limitation of the study is that the inhibitory studies in vivo may include off-target effects as well. Future endeavors, including specific knockout models, optogenetics and/or transgenic zebrafish lines that visualize endothelial cell properties (proliferation and migration) will be informative to track individual endothelial cell responses upon feedback signals.

      We agree with the reviewer and are currently conducting experiments with optogenetic tools, knockout models, and transgenic zebrafish lines to dissect the feedback loop dynamics at the cellular scale.    

      Reviewer #2 (Public review):

      Major strengths: The combination of in vitro and in vivo assessment provides evidence for timing in physiologically relevant contexts, and rigorous quantification of outputs is provided. The idea of defining temporal aspects of the system is quite interesting. New RNA profiling supports the model. 

      Weaknesses: Actinomycin D blocks most transcription so exposure for hours likely leads to secondary and tertiary effects and perhaps effects on viability.

      We agree with the reviewer that “off-target” effects are a limitation of the pharmacologic approach. We have also previously shown that long-term treatment with actinomycin D reduces ECFC survival (Mason et al., 2019). 

      Reviewer #3 (Public review):

      Strengths: The authors conduct ASPM assay to find the time scale of feedback when ECFCs attach to three different matrics. They observe the common time scale of feedback. Thus, under very specific conditions they use, the reproducibility is validated by their ASPM assay. The feedback loop mediated by inhibition of gene expression by Actinomycin D is similar to that obtained from YAP/TAZ-depleted cells, suggesting the mechanotranduction might be involved in the feedback loop. The time scale representing infection point might be interesting when considering the continuous motility in cultured endothelial cells, although it might not account for the migration of endothelial cells that is controlled by a wide variety of extracellular cues. In vivo, stiffness of extracellular matrix is merely one of the cues. 

      Weaknesses: ASPM assay is based on attachment-dependent phenomenon. The time scale including the inflection point determined by ASPM experiments using cultured cells and the mechanotransduction-based theory do not seem to fit in vivo ISV elongation. Although it is challenging to find the conserved theory of continuous cell motility of endothelial cells, the data is preliminary and does not support the authors' claim. There is no evidence that mechanotransduction solely determines the feedback loop during elongation of ISVs. The points to be addressed are listed in recommendations for the authors.

      The ASPM assay enabled us to define temporal dynamics of YAP/TAZ mechanotransduction. We then used those insights to design ISV washout experiments that tested if the characteristic time scales were conserved in vivo. However, we agree with the limitations identified by the reviewer. Cells behave and respond to mechanical cues differently in 2D vs 3D environments, and the microenvironment in vivo is much more complex. Future work with optogenetic tools will be useful to dissect the temporal kinetics in vivo during ISV elongation.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #3 (Public review): 

      Summary: 

      The manuscript explores behavioral responses of C. elegans to hydrogen sulfide, which is known to exert remarkable effects on animal physiology in a range of contexts. The possibility of genetic and precise neuronal dissection of responses to H2S motivates the study of responses in C. elegans. The revised manuscript does not seem to have significantly addressed what was lacking in the initial version. 

      The authors have added further characterization of possible ASJ sensing of H2S by calcium imaging but ASJ does not appear to be directly involved. Genetic and parallel analysis of O2 and CO2 responsive pathways do not reveal further insights regarding potential mechanisms underlying H2S sensing. Gene expression analysis extends prior work. Finally, the authors have examined how H2S-evoked locomotory behavioral responses are affected in mutants with altered stress and detoxification response to H2S, most notably hif-1 and egl-9. These data, while examining locomotion, are more suggestive that observed effects on animal locomotion are secondary to altered organismal toxicity as opposed to specific behavioral responedse 

      Overall, the manuscript provides a wide range of intriguing observations, but mechanistic insight or a synthesis of disparate data is lacking. 

      We thank the reviewer for the valuable feedback. We agree that while our investigation provides broad coverage, it does not fully resolve the mechanisms of H<sub>2</sub>S perception. As both reviewers noted, the avoidance response to high levels of H<sub>2</sub>S is most likely driven by its toxicity, particularly at the level of mitochondria, rather than by direct perception of H<sub>2</sub>S. We also favor this model and have revised the results and discussion to highlight this interpretation, while acknowledging that other mechanisms cannot be excluded (main changes lines 387-402 and 535-547).

      Building on this view, our observations point toward mitochondrial ROS transients as the trigger for H<sub>2</sub>S avoidance. First, toxic levels of H<sub>2</sub>S are known to promote ROS production (1). Second, similar to acute H<sub>2</sub>S, brief exposure to rotenone, an ETC complex I inhibitor that rapidly generates mitochondrial ROS, triggers locomotory responses (Figure 7E) (Lines 393-396). Third, regardless of duration, rotenone exposure inhibits H<sub>2</sub>S-evoked avoidance (Figure 7E) (Lines 389-391), likely by preventing or dampening H<sub>2</sub>S-evoked mitochondrial ROS bursts when ETC function is impaired and ROS is already high. Notably, animals subjected to prolonged rotenone exposure, ETC mutants, and quintuple sod mutants, each experiencing chronically high ROS levels, fail to respond to H<sub>2</sub>S and display reduced locomotory activity, presumably due to ROS toxicity and/or activation of stress-adaptive mechanisms (Figure 7).

      Consistent with the activation of stress-responsive pathways, H<sub>2</sub>S exposure alters expression of genes controlled by SKN-1 and HIF-1 signaling. Both pathways are ROS-sensitive and promote adaptation to chronic ROS production (2-4). Their activation, as in egl-9, render these animals insensitive to H<sub>2</sub>S-evoked ROS transients (Figure 5B) (Lines 303-305). Conversely, mutants defective in these adaptive pathways, such as hif-1, still show initial locomotory responses to H<sub>2</sub>S, but rapidly lose activity during prolonged H<sub>2</sub>S exposure (Figure 5D) (Lines 318-319). These observations suggest that HIF-1 pathway is dispensable for initiating the response to H<sub>2</sub>S evoked ROS transients, but essential for protecting against ROS toxicity.

      In this context, the neural circuit we examined, such as ASJ neurons, is not directly involved in H<sub>2</sub>S perception (Line 165-169 and 448-457). Instead, it likely modulates a circuit that is responsive to ROS toxicity. This circuit is also influenced by ambient O<sub>2</sub> levels, the state of O<sub>2</sub> sensing circuit, and nutrient status, in a manner reminiscent of the CO<sub>2</sub> responses (5, 6).

      Reviewer #4 (Public review): 

      Summary: 

      The authors establish a behavioral paradigm for avoidance of H2S and conduct a large candidate screen to identify genetic requirements. They follow up by genetically dissecting a large number of implicated pathways - insulin, TGF-beta, oxygen/HIF-1, and mitochondrial ROS, which have varied effects on H2S avoidance. They additionally assay whole-animal gene expression changes induced by varying concentrations and durations of H2S exposure. 

      Strengths: 

      The implicated pathways are tested extensively through mutants of multiple pathway molecules. The authors address previous reviewer concerns by directly testing the ability of ASJ to respond to H2S via calcium imaging. This allows the authors to revise their previous conclusion and determine that ASJ does not directly respond to H2S and likely does not initiate the behavioral response. 

      We thank the reviewer for the supportive comments.

      Weaknesses: 

      Despite the authors focus on acute perception of H2S, I don't think the experiments tell us much about perception. I think they indicate pathways that modulate the behavior when disrupted, especially because most manipulations used broadly affect physiology on long timescales. For instance, genetic manipulation of ASJ signaling, oxygen sensing, HIF-1 signaling, mitochondrial function, as well as starvation are all expected to constitutively alter animal physiology, which could indirectly modulate responses to H2S. The authors rule out effects on general locomotion in some cases, but other physiological changes could relatively specifically modulate the H2S response without being involved in its perception. 

      I am actually not convinced that H2S is directly perceived by the C. elegans nervous system at all. As far as I can tell, the avoidance behavior could be a response to H2S-induced tissue damage rather than the gas itself. 

      We thank the reviewer for the valuable insights, and fully agree that the H<sub>2</sub>S may not be directly perceived by C. elegans. Please see detailed responses below.

      Reviewer #4 (Recommendations for the authors): 

      The clarity of the paper is improved in this version. My main issue has to do with "perception" of H2S. At times the authors suggest that hydrogen sulfide should be perceived by a neural circuit ("we did not specifically identify the neural circuit mediating H2S signaling"), while at other times they discuss the possibility that it is not directly perceived neuronally ("Supporting the idea that acute mitochondrial ROS generation initiates avoidance of high H2S levels,"). The authors should clearly state their model for H2S perception. Do they think there is a receptor and sensory neuron for H2S (not identified in this paper)? If not, what does it mean for there to be a neural circuit mediating the response? To me, it looks more like what is being "perceived" by a neural circuit is ROS-induced toxicity, not H2S itself. 

      To drill down on direct modulation of acute perception, are any of the pathway manipulations used in this paper performed on the timescale of perception? Rotenone for 10 mins is close to that timescale, and in fact it increases speed independently of H2S, consistent with ROSinduced toxicity, not H2S being the signal that induces the behavior. Optogenetic activation of RMG could also be on the acute timescale. Can the authors clarify for how long blue light was on the worms before the start of the assay? Or was it turned on at the same time as video acquisition commenced? This could be evidence that RMG acutely modulates this behavioral response. 

      I feel that the ASJ calcium imaging data should be in the main figure given its importance in revising the original model. 

      We thank the reviewer for the valuable advice.

      As suggested, ASJ calcium imaging data are displayed in the main figure (Figure 2I) (Line 167).

      As both reviewers noted, our initial presentation was not sufficiently clear regarding the mechanism underlying H<sub>2</sub>S avoidance. We agree with the reviewer that H<sub>2</sub>S avoidance is unlikely mediated by direct perception via a H<sub>2</sub>S-specific receptor, but likely arises from acute mitochondrial dysfunction and ROS generation. 

      ROS

      In line with the reviewer’s perspective, our observations point toward mitochondrial ROS transients as the trigger for H<sub>2</sub>S avoidance. First, toxic levels of H<sub>2</sub>S are known to promote ROS production (1). Second, similar to acute H<sub>2</sub>S, brief exposure to rotenone, an ETC complex I inhibitor that rapidly generates mitochondrial ROS, triggers locomotory responses (Figure 7E) (Lines 393-396). Third, regardless of duration, rotenone exposure inhibits H<sub>2</sub>S-evoked avoidance (Figure 7E) (Lines 389-391), likely by preventing or dampening H<sub>2</sub>S-evoked mitochondrial ROS bursts when ETC function is impaired and ROS is already high. Notably, animals subjected to prolonged rotenone exposure, ETC mutants, and quintuple sod mutants, each experiencing chronically high ROS levels, fail to respond to H<sub>2</sub>S and display reduced locomotory activity, presumably due to ROS toxicity and/or activation of stress-adaptive mechanisms (Figure 7). We revised the Results and Discussion to present the model more consistently (main changes lines 387-402 and 535-547).

      Consistent with the activation of stress-responsive pathways, H<sub>2</sub>S exposure alters expression of genes controlled by SKN-1 and HIF-1 signaling. Both pathways are ROS-sensitive and promote adaptation to chronic ROS production (2-4). Their activation, as in egl-9, render these animals insensitive to H<sub>2</sub>S-evoked ROS transients (Figure 5B) (Lines 303-305). Conversely, mutants defective in these adaptive pathways, such as hif-1, still show initial locomotory responses to H<sub>2</sub>S, but rapidly lose activity during prolonged H<sub>2</sub>S exposure (Figure 5D) (Lines 318-319). These observations suggest that HIF-1 pathway is dispensable for initiating the response to H<sub>2</sub> Sevoked ROS transients, but essential for protecting against ROS toxicity.

      ASJ neurons

      ASJ neurons and DAF-11 signaling are required for H<sub>2</sub>S-evoked behavioral responses. However, ASJ does not exhibit an H<sub>2</sub>S-evoked calcium transient. It suggests that ASJ neurons do not directly detect H<sub>2</sub>S (Line 165-169 and 448-457), but likely modulate the circuit responsive to ROS toxicity. This circuit can also be modulated by ambient O<sub>2</sub> levels, the state of O<sub>2</sub> sensing circuit, and nutrient status, in a manner reminiscent of the CO<sub>2</sub> responses (5, 6). 

      O<sub>2</sub> sensing circuit

      Consistent with the reviewer’s view, we favor the model that H<sub>2</sub>S avoidance is likely induced by ROS transients. We believe that the state of O<sub>2</sub> sensing circuit, similar to ASJ neurons, modulates the neural circuit that is responsive to H<sub>2</sub>S-evoked ROS toxicity. This circuit is inhibited as long as O<sub>2</sub> sensing circuit is active. In the RMG optogenetic experiment, channelrhodopsin was photo-stimulated as soon as the assay was initiated at 7% O<sub>2</sub> (Methods Lines 633-634 and Figure legend Lines 1177-1178), therefore RMG remained active throughout the assay including at 7% O<sub>2</sub>. Our interpretation is that RMG activation inhibits this ROSresponsive circuit and H<sub>2</sub>S avoidance. However, these observations do not resolve if H<sub>2</sub>S is acutely and directly perceived. The modulation of H<sub>2</sub>S response by O<sub>2</sub> circuit was discussed between Lines 437-447.

      References

      (1) J. Jia et al., SQR mediates therapeutic effects of H(2)S by targeting mitochondrial electron transport to induce mitochondrial uncoupling. Sci Adv 6, eaaz5752 (2020).

      (2) S. J. Lee, A. B. Hwang, C. Kenyon, Inhibition of Respiration Extends C. elegans Life Span via Reactive Oxygen Species that Increase HIF-1 Activity. Current Biology 20, 2131-2136 (2010).

      (3) C. Lennicke, H. M. Cocheme, Redox metabolism: ROS as specific molecular regulators of cell signaling and function. Mol Cell 81, 3691-3707 (2021).

      (4) D. A. Patten, M. Germain, M. A. Kelly, R. S. Slack, Reactive oxygen species: stuck in the middle of neurodegeneration. J Alzheimers Dis 20 Suppl 2, S357-367 (2010).

      (5) A. J. Bretscher, K. E. Busch, M. de Bono, A carbon dioxide avoidance behavior is integrated with responses to ambient oxygen and food in Caenorhabditis elegans. Proc Natl Acad Sci U S A 105, 8044-8049 (2008).

      (6) E. A. Hallem, P. W. Sternberg, Acute carbon dioxide avoidance in Caenorhabditis elegans. Proc Natl Acad Sci U S A 105, 8038-8043 (2008).

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      Okazaki et al. showed flickering stimuli to patients with unilateral spatial neglect (USN) and measured EEG responses. They compared this with another patient group (post-stroke, but no USN) and healthy controls. The author's rationale was to entrain intrinsic brain rhythms using the flicker of different frequencies (3-30 Hz). Effects found unique to the 9-Hz stimulation condition differentiate USN patients from the other groups, leading them to conclude that USN can be characterized by increased hemispheric alpha asymmetry, driven by a relatively increased response in the intact hemisphere.

      Strengths:

      This study is principled empirical work that benefits from access to special patient groups of considerable size (about 60 stroke patients in total, and 20 USN). The authors use state-of-the-art established methods to (1) deliver and (2) quantify the responses to the flicker stimulation in the EEG recordings. In addition, they use phase-coupling measures to investigate cross-frequency coupling (here: alpha-gamma) and a measure of directed connectivity between brain areas, transfer entropy. The results are supported by means of simulations using a coupled-oscillators model.

      Weaknesses:

      In my eyes, the major conceptual weakness of the study is that the authors make the a priori assumption that the flicker stimulation entrains intrinsic brain rhythms, especially alpha (9 Hz). To date, there is no direct (and only equivocal indirect) evidence that alpha rhythms can be entrained with periodic visual stimulation. In the present study, the assumption of alpha entrainment permeates some analytical decisions - where it would be possible to separate stimulus-driven from intrinsic rhythms more strongly than is currently the case, potentially yielding deeper insights into the oscillopathy of USN - and, ultimately, the interpretation of the results. Another potential issue to consider here is the analysis of gamma rhythms in EEG data, absent a control of miniature eye movements, a known problem (Yuval-Greenberg et al., 2008, https://doi.org/10.1016/j.neuron.2008.03.027) that may be exacerbated here, given that USN patients could show different auxiliary gaze behaviour.

      Reviewer #1 expressed concern that alpha entrainment is assumed a priori; however, our interpretation is based on the empirical observation of frequency-specific (9 Hz) hemispheric asymmetry, not on a prior assumption. This 9 Hz specificity is difficult to explain by a simple summation of stimulus-evoked responses and is more appropriately interpreted as a resonance phenomenon in the alpha band, which is close to the intrinsic resonance frequency of the visual system [1, 2]. In the revision, we will strengthen the conceptual distinction between stimulus-driven and intrinsic components and clarify that entrainment is a conclusion supported by our data and modeling.

      Gamma contamination by eye movements is a valid theoretical concern. However, it is unlikely that saccadic spike potentials explain our α-γ coupling findings, due to several factors including timing constraints and spectral properties. In the revision, we will add explicit discussion of this limitation while explaining why our coupling patterns are more consistent with physiological neural coupling than with artifacts.

      Reviewer #2 (Public review):

      This study investigates how altered neural oscillations may contribute to unilateral spatial neglect (USN) following right-hemisphere stroke. By combining steady-state visual evoked potentials (SSVEPs), phase-amplitude coupling (PAC), transfer entropy (TE), and computational modeling, the authors aim to show that USN arises from disrupted hemispheric synchronization dynamics rather than simply from lesion extent. The integration of empirical EEG data with a mechanistic model is a major strength and offers a valuable new perspective on how frequency-specific neural dynamics relate to clinical symptoms.

      The work has several notable strengths. The combination of experimental and modeling approaches is innovative and powerful, and the findings provide a coherent mechanistic framework linking abnormal neural entrainment to attentional deficits. The study also provides concrete evidence to support the potential for frequency-specific neuromodulatory interventions, which could have translational relevance At the same time, there are areas where the evidence could be clarified or contextualized further. The manuscript would benefit from more detailed characterization of lesions, since differences in lesion topography (white vs. gray matter, occipital vs. parietal areas) could greatly improve our understanding of the physiopathology causing unilateral spatial neglect and the altered neural oscillations reported. Methodological choices, such as focusing analyses on occipital electrodes rather than parietal sites, and the potential influence of volume conduction in transfer entropy analyses, also need clearer justification/elaboration. In addition, while the authors report several neural metrics, it is not always clear why SSVEP power was chosen as the primary correlate of clinical severity over other measures. More broadly, the manuscript would be strengthened by clearer definitions of dependent variables and reporting of software and toolboxes used.

      Overall, the study makes a significant contribution by demonstrating that USN can be conceptualized as a disorder of disrupted oscillatory dynamics. With some clarifications and expansions, the paper will provide readers with a clearer understanding of both the strengths and the limitations of the evidence, and it will stand as a valuable reference for future work on oscillatory mechanisms in stroke and attention.

      We agree that further lesion characterization would be generally useful. However, as shown in Supplementary Figure 1, lesions in our USN cohort involved both cortical and subcortical regions, and cortical damage often extended into adjacent white matter. Therefore, a strict gray-versus-white-matter classification was not feasible. This anatomical diversity suggests that the frequency-specific hemispheric asymmetry observed here cannot be fully explained by lesion location or size alone, but rather may reflect altered network dynamics following right-hemisphere damage. We will clarify this point in the revised Discussion.

      Regarding transfer entropy (TE) and volume conduction, TE is theoretically insensitive to zero-lag correlations and quantifies temporally directed information transfer. Furthermore, we used amplitude envelopes rather than raw oscillations as input, which should greatly reduce the risk of spurious causal estimation due to sinusoidal autocorrelation structure. Moreover, if such spurious connectivity due to autocorrelation had occurred, it would have been expected to appear equally in both feedforward and feedback directions. Therefore, the feedforward-limited (visual→frontal) asymmetry observed in our study cannot be explained by volume conduction or autocorrelation effects. We will maintain this position clearly in the revision.

      Regarding other methodological points: we focused on occipital electrodes (O1/O2) because visual stimuli primarily drive the visual system (we also analyzed parietal sites but found no significant hemispheric differences; Figure 4). We chose SSVEP power for clinical correlation because it was the primary phenomenon distinguishing USN from non-USN patients. In the revision, we will clarify these points and include software and toolbox information.

      We believe these revisions will substantially strengthen the manuscript and clarify the conceptual and methodological contributions of our study.

      References

      (1) Rosanova, M., Casali, A., Bellina, V., Resta, F., Mariotti, M., and Massimini, M. (2009). Natural frequencies of human corticothalamic circuits. J Neurosci 29, 7679-7685.

      (2) Okazaki, Y.O., Nakagawa, Y., Mizuno, Y., Hanakawa, T., and Kitajo, K. (2021). Frequency- and Area-Specific Phase Entrainment of Intrinsic Cortical Oscillations by Repetitive Transcranial Magnetic Stimulation. Front Hum Neurosci 15, 608947.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Summary:

      The manuscript characterizes a functional peptidergic system in the echinoderm Apostichopus japonicus that is related to the widely conserved family of calcitonin/diuretic hormone 31 (CT/DH31) peptides in bilaterian animals. In vitro analysis of receptor-ligand interactions, using multiple receptor activation assays, identifies three cognate receptors for two CT-like peptides in the sea cucumber, which stimulate cAMP, calcium, and ERK signaling. Only one of these receptors clusters within the family of calcitonin and calcitonin-like receptors (CTR/CLR) in bilaterian animals, whereas two other receptors cluster with invertebrate pigment dispersing factor receptors (PDFRs). In addition, this study sheds light on the expression and in vivo functions of CT-like peptides in A. japonicus, by quantitative real-time PCR, immunohistochemistry, pharmacological experiments on body wall muscle and intestine preparations, and peptide injection and RNAi knockdown experiments. This reveals a conserved function of CT-like peptides as muscle relaxants and growth regulators in A. japonicus.

      Strengths:

      This work combines both in vitro and in vivo functional assays to identify a CT-like peptidergic system in an economically relevant echinoderm species, the sea cucumber A. japonicus. A major strength of the study is that it identifies three G protein-coupled receptors for AjCT-like peptides, one related to the CTR/CLR family and two related to the PDFR family. A similar finding was previously reported for the CT-related peptide DH31 in Drosophila melanogaster that activates both CT-type and PDF-type receptors. Here, the authors expand this observation to a deuterostomian animal, which suggests that receptor promiscuity is a more general feature of the CT/DH31 peptide family and that CT/DH31-like peptides may activate both CT-type and PDF-type receptors in other animals as well.

      Besides the identification of receptor-ligand pairs, the downstream signaling pathways of AjCT receptors have been characterized, revealing broad and in some cases receptor-specific effects on cAMP, calcium, and ERK signaling.

      Functional characterization of the CT-related peptide system in heterologous cells is complemented with ex vivo and in vivo experiments. First, peptide injection and RNAi knockdown experiments establish transcriptional regulation of all three identified receptors in response to changing AjCT peptide levels. Second, ex vivo experiments reveal a conserved role for the two CT-like peptides as muscle relaxants, which have differential effects on body wall muscle and intestine preparations. Finally, peptide injection and knockdown experiments uncover a growth-promoting role for one CT-like peptide (AjCT2). Injection of AjCT2 at high concentration, or long-term knockdown of the AjCT precursor, affects diverse growth-related parameters including weight gain rate, specific growth rate, and transcript levels of growth-regulating transcription factors. The authors also reveal a growth-promoting function for the PDFR-like receptor AjPDFR2, suggesting that this receptor mediates the effects of AjCT2 on growth.

      Weaknesses:

      The authors present a more detailed phylogenetic analysis in the revised version, including a larger number of species. But some clusters in the analysis are not well supported because they have only low bootstrap values. This makes it difficult to interpret the clustering in some parts of the tree.

      Thank you for the reviewer’s comments. In response, we have produced a new phylogenetic analysis using the maximum likelihood method. This was done by Nayeli Escudero Castelán and Kite Jones in the Elphick group at QMUL and therefore they have been added as co-authors of this paper. The new phylogenetic tree (Figure 2, line 206) includes broad taxonomic sampling of CT-type receptors and PDF-type receptors. CRH-type receptors, which are also members of the secretin-type GPCR sub-family, have been included as an outgroup to root the tree. In the previous version the much more distantly related vasopressin/oxytocin-type receptors, which are rhodopsin-type GPCRs, were included as an outgroup. Furthermore, VIP-type receptors were also included in the previous tree but these have been omitted from the new tree because VIP receptor orthologs only occur in vertebrates and therefore they are not representative of a bilaterian GPCR family. The new tree shows high bootstrap support for key clades, notably achieving a bootstrap value of 100 for a clade comprising both deuterostomian and protostomian PDF receptors. This provides important evidence that the A. japonicus PDF-type receptors characterised in this study (AjPDFR1, AjPDFR2) are co-orthologs of the PDF-type receptor that has been characterised previously in Drosophila. Similarly, there is strong bootstrap support (100) for a clade comprising CT/DH31-type receptors and, importantly, the CT-type receptor characterised in this study (AjCTR) is positioned in a branch of this clade that comprises deuterostomian CT-type receptors (with bootstrap support of 100). Details of methods employed to produce the new receptor tree are included in lines 727-739. The new phylogenetic tree is shown below and has been incorporated into the revised manuscript (Figure 2, line 206). The description of new phylogenetic tree has also been modified accordingly in the revised manuscript (line 169-183).

      References:

      Bauknecht P, Jékely G. Large-Scale Combinatorial Deorphanization of Platynereis Neuropeptide GPCRs. Cell reports, 2015, 12(4), 684–693. doi:  10.1016/j.celrep.2015.06.052.

      Beets I, Zels S, Vandewyer E, Demeulemeester J, et al. System-wide mapping of peptide-GPCR interactions in C. elegans. Cell reports, 2023, 42(9), 113058. doi: 10.1016/j.celrep.2023.113058.

      Cardoso J C, Mc Shane J C, Li Z, et al. Revisiting the evolution of Family B1 GPCRs and ligands: Insights from mollusca. Molecular and cellular endocrinology, 2024, 586, 112192. doi: 10.1016/j.mce.2024.112192.

      Gorn A H, Lin H Y, Yamin M, et al. Cloning, characterization, and expression of a human calcitonin receptor from an ovarian carcinoma cell line. The Journal of clinical investigation, 1992, 90(5), 1726–1735. doi: 10.1172/JCI116046.

      Huang T, Su J, Wang X, et al. Functional Analysis and Tissue-Specific Expression of Calcitonin and CGRP with RAMP-Modulated Receptors CTR and CLR in Chickens. Animals: an open access journal from MDPI, 2024, 14(7), 1058. doi: 10.3390/ani14071058.

      Johnson E C, Shafer O T, Trigg J S, et al. A novel diuretic hormone receptor in Drosophila: evidence for conservation of CGRP signaling. Journal of Experimental Biology, 2005, 208(7): 1239-1246. doi: 10.1242/jeb.01529.

      McLatchie L M, Fraser N J, Main M J, et al. RAMPs regulate the transport and ligand specificity of the calcitonin-receptor-like receptor. Nature, 1998, 393(6683): 333-339. doi: 10.1038/30666.

      Schwartz J, Réalis-Doyelle E, Dubos M P, et al. Characterization of an evolutionarily conserved calcitonin signaling system in a lophotrochozoan, the Pacific oyster (Crassostrea gigas). Journal of Experimental Biology, 2019, 222(13): jeb201319. doi: 10.1242/jeb.201319.

      Sekiguchi T, Kuwasako K, Ogasawara M, et al. Evidence for conservation of the calcitonin superfamily and activity-regulating mechanisms in the basal chordate Branchiostoma floridae: insights into the molecular and functional evolution in chordates. Journal of Biological Chemistry, 2016, 291(5): 2345-2356. doi: 10.1074/jbc.M115.664003.

      Expression of CT-like peptides was investigated both at transcript and protein level, but insight into the expression of the three peptide receptors is limited. This makes it difficult to understand the mechanism underlying the (different) functions of the two CT-like peptides in vivo. The authors identify differences in signal transduction cascades activated by each peptide, which might underpin distinct functions, but these differences were established only in heterologous cells.

      We appreciate the reviewer's insightful comments. Regarding expression of CT-like peptide receptors, we have quantitatively analyzed the mRNA expression levels of the three receptors in key tissues using qRT-PCR (Figure 6, line 319) and receptor expression exhibits significant tissue-specific differences. Combined with the heterologous expression assays and In vivo functional validation, we believe our findings have provided clear mechanistic insights into the functional divergence of the two CT-like peptides. Investigation of the expression of the three receptor proteins in A. japonicus would require generation of specific antibodies, which was beyond the scope of this study. Furthermore, immunohistochemical visualization of neuropeptide receptor expression in other invertebrates has not been reported widely, which likely reflects technical difficulties in generation of antibodies that can be used to specifically detect receptor proteins that are typically expressed a low level in comparison to the neuropeptides that act as their ligands. 

      We acknowledge that investigating signal transduction cascades in heterologous cells (rather than native A. japonicus cells) is a limitation. However, as a non-model organism, A. japonicus currently lacks established cell lines for such research. Therefore, using heterologous cells was the most feasible approach to examine the differential signaling cascades activated by the peptides through the three receptors. Importantly, our in vivo experiments demonstrated that long-term knockdown of either the AjCT precursor or AjPDFR2 resulted in similar and significant growth defects. The phenotypic consistency strongly suggests that AjCT2 and AjPDFR2 function within the same signaling pathway, with AjPDFR2 serving as the key receptor functionally activated by AjCT2.

      The authors show overlapping phenotypes for a long-term knockdown of the AjCT precursor and the AjPDFR2 receptor, suggesting that the growth-regulating functions of AjCT2 are mediated by this receptor pathway. However, it remains unclear whether this mechanism underpins the growth-regulating function of AjCT2, until further in vivo evidence for this ligand-receptor interaction is presented. For example, the authors could investigate whether knockdown of AjPDFR2 attenuates the effects of AjCT2 peptide injection. In addition, a functional PDF system in this species remains uncharacterized, and a potential role of PDF-like peptides in growth regulation has not yet been investigated in A. japonicus. Therefore, it also remains unclear whether the ability of CT-like peptides to activate PDFRs is an evolutionary ancient property of this peptide family or whether this is an example of convergent evolution in some protostomian (Drosophila) and deuterostomian (sea cucumber) species.

      Thank you for the reviewer’s insightful comments and constructive questions. We acknowledge the request for more direct evidence to demonstrate how AjCT2 functions in vivo through AjPDFR2. However, long-term knockdown of the AjCT precursor and AjPDFR2 both resulted in identical and significant growth defect phenotypes. The high phenotypic consistency, combined with the activation effect of AjCT2 on AjPDFR2 in heterologous cells, strongly suggests that they function within the same signaling pathway, with AjPDFR2 serving as the key receptor functionally activated by AjCT2. While exogenous peptide injection combined with receptor knockdown is a classic method for verifying receptor activation, phenotypic overlap itself is widely accepted in genetic research as robust evidence for pathway association (Shafer and Taghert, 2009; Van Sinay et al., 2017). A. japonicus is a non-model organism with a 3-month aestivation period in summer followed shortly by winter hibernation. During these periods, we are unable to conduct in vivo experiments. Any single experimental suggestion from reviewers could potentially require one more year of research and we have already conducted an additional year of research, in response to reviewer feedback, since submitting the original manuscript. We hope therefore that these challenges associated with working with aquatic invertebrate non-model organisms is recognized by the reviewers.

      We fully agree that the functional PDF/PDFR system in A. japonicus and its potential role in growth regulation remain uncharacterized. Currently, the precursors of the PDF-type neuropeptide in echinoderms remain unidentified, which precludes clear pharmacological characterization of the two receptors. While further exploration of echinoderm PDF-type neuropeptides is still needed, our phylogenetic analysis-conducted using the maximum likelihood method with optimized parameters and rigorous sequence curation-demonstrates that the deuterostomian PDFRs (including AjPDFR1 and AjPDFR2) are positioned in a clade with the well-characterized protostomian PDFR clades with extremely high bootstrap support (value=100). Therefore, these two receptors in A. japonicus clearly belong to the PDF receptor family and our findings clearly indicate that the ability of CT-like peptides to activate PDFRs is either an evolutionarily ancient and conserved property or has arisen independently in different lineages. Details of methods employed to produce the new receptor tree are included in line 727-739. The new phylogenetic tree is shown below and has been incorporated into the revised manuscript (Figure 2, line 206). The description of new phylogenetic tree has also been modified accordingly in the revised manuscript (line 169-183).

      References:

      Bauknecht P, Jékely G. Large-Scale Combinatorial Deorphanization of Platynereis Neuropeptide GPCRs. Cell reports, 2015, 12(4), 684–693. doi:  10.1016/j.celrep.2015.06.052.

      Beets I, Zels S, Vandewyer E, Demeulemeester J, et al. System-wide mapping of peptide-GPCR interactions in C. elegans. Cell reports, 2023, 42(9), 113058. doi: 10.1016/j.celrep.2023.113058.

      Cardoso J C, Mc Shane J C, Li Z, et al. Revisiting the evolution of Family B1 GPCRs and ligands: Insights from mollusca. Molecular and cellular endocrinology, 2024, 586, 112192. doi: 10.1016/j.mce.2024.112192.

      Gorn A H, Lin H Y, Yamin M, et al. Cloning, characterization, and expression of a human calcitonin receptor from an ovarian carcinoma cell line. The Journal of clinical investigation, 1992, 90(5), 1726–1735. doi: 10.1172/JCI116046.

      Huang T, Su J, Wang X, et al. Functional Analysis and Tissue-Specific Expression of Calcitonin and CGRP with RAMP-Modulated Receptors CTR and CLR in Chickens. Animals: an open access journal from MDPI, 2024, 14(7), 1058. doi: 10.3390/ani14071058.

      Johnson E C, Shafer O T, Trigg J S, et al. A novel diuretic hormone receptor in Drosophila: evidence for conservation of CGRP signaling. Journal of Experimental Biology, 2005, 208(7): 1239-1246. doi: 10.1242/jeb.01529.

      McLatchie L M, Fraser N J, Main M J, et al. RAMPs regulate the transport and ligand specificity of the calcitonin-receptor-like receptor. Nature, 1998, 393(6683): 333-339. doi: 10.1038/30666.

      Schwartz J, Réalis-Doyelle E, Dubos M P, et al. Characterization of an evolutionarily conserved calcitonin signaling system in a lophotrochozoan, the Pacific oyster (Crassostrea gigas). Journal of Experimental Biology, 2019, 222(13): jeb201319. doi: 10.1242/jeb.201319.

      Sekiguchi T, Kuwasako K, Ogasawara M, et al. Evidence for conservation of the calcitonin superfamily and activity-regulating mechanisms in the basal chordate Branchiostoma floridae: insights into the molecular and functional evolution in chordates. Journal of Biological Chemistry, 2016, 291(5): 2345-2356. doi: 10.1074/jbc.M115.664003.

      Shafer, O. T., & Taghert, P. H. (2009). RNA-interference knockdown of Drosophila pigment dispersing factor in neuronal subsets: the anatomical basis of a neuropeptide's circadian functions. PloS one, 4(12), e8298. doi: 10.1371/journal.pone.0008298.

      Van Sinay, E., Mirabeau, O., Depuydt, G., Van Hiel, M. B., Peymen, K., Watteyne, J., Zels, S., Schoofs, L., & Beets, I. (2017). Evolutionarily conserved TRH neuropeptide pathway regulates growth in Caenorhabditis elegans. Proceedings of the National Academy of Sciences of the United States of America, 114(20), E4065–E4074. doi: 10.1073/pnas.1617392114.

      Reviewer #2 (Public review):

      Summary:

      The authors show that A. japonicus calcitonins (AjCT1 and AjCT2) activate not only the calcitonin/calcitonin-like receptor, but they also activate the two "PDF receptors", ex vivo. They also explore secondary messenger pathways that are recruited following receptor activation. They determine the source of CT1 and CT2 using qPCR and in situ hybridization and finally test the effects of these peptides on tissue contractions, feeding and growth. This study provides solid evidence that CT1 and CT2 act as ligands for calcitonin receptors; however, evidence supporting cross-talk between CT peptides and "PDF receptors" is weak.

      Strengths:

      This is the first study to report pharmacological characterization of CT receptors in an echinoderm. Multiple lines of evidence in cell culture (receptor internalization and secondary messenger pathways) support this conclusion.

      Weaknesses:

      The authors claim that A. japonicus CTs activate "PDF" receptors and suggest that this cross-talk is evolutionary ancient since similar phenomenon also exists in the fly Drosophila melanogaster. These conclusions are not fully supported. The authors perform phylogenetic analysis to show that the two "PDF" receptors form an independent clade. The bootstrap support is quite low in a lot of instances, especially for the deuterostomian and protostomian PDFR clades which is below 30. With such low support, it is unclear if the clade comprising deuterostomian "PDFR" is in fact PDFRs and not another receptor type whose endogenous ligand (besides CT) remains to be discovered.

      Thank you for the reviewer’s comments. In response, we have produced a new phylogenetic analysis using the maximum likelihood method. This was done by Nayeli Escudero Castelán and Kite Jones in the Elphick group at QMUL and therefore they have been added as co-authors of this paper. The new phylogenetic tree (Figure 2, line 206) includes broad taxonomic sampling of CT-type receptors and PDF-type receptors. CRH-type receptors, which are also members of the secretin-type GPCR sub-family, have been included as an outgroup to root the tree. In the previous version the much more distantly related vasopressin/oxytocin-type receptors, which are rhodopsin-type GPCRs, were included as an outgroup. Furthermore, VIP-type receptors were also included in the previous tree but these have been omitted from the new tree because VIP receptor orthologs only occur in vertebrates and therefore they are not representative of a bilaterian GPCR family. The new tree shows high bootstrap support for key clades, notably achieving a bootstrap value of 100 for a clade comprising both deuterostomian and protostomian PDF receptors. This provides important evidence that the A. japonicus PDF-type receptors characterized in this study (AjPDFR1, AjPDFR2) are co-orthologs of the PDF-type receptor that has been characterized previously in Drosophila. Similarly, there is strong bootstrap support (100) for a clade comprising CT/DH31-type receptors and, importantly, the CT-type receptor characterized in this study (AjCTR) is positioned in a branch of this clade that comprises deuterostomian CT-type receptors (with bootstrap support of 100). Details of methods employed to produce the new receptor tree are included in lines 727-739. The new phylogenetic tree is shown below and has been incorporated into the revised manuscript (Figure 2, line 206). The description of new phylogenetic tree has also been modified accordingly in the revised manuscript (line 169-183).

      References:

      Bauknecht P, Jékely G. Large-Scale Combinatorial Deorphanization of Platynereis Neuropeptide GPCRs. Cell reports, 2015, 12(4), 684–693. doi:  10.1016/j.celrep.2015.06.052.

      Beets I, Zels S, Vandewyer E, Demeulemeester J, et al. System-wide mapping of peptide-GPCR interactions in C. elegans. Cell reports, 2023, 42(9), 113058. doi: 10.1016/j.celrep.2023.113058.

      Cardoso J C, Mc Shane J C, Li Z, et al. Revisiting the evolution of Family B1 GPCRs and ligands: Insights from mollusca. Molecular and cellular endocrinology, 2024, 586, 112192. doi: 10.1016/j.mce.2024.112192.

      Gorn A H, Lin H Y, Yamin M, et al. Cloning, characterization, and expression of a human calcitonin receptor from an ovarian carcinoma cell line. The Journal of clinical investigation, 1992, 90(5), 1726–1735. doi: 10.1172/JCI116046.

      Huang T, Su J, Wang X, et al. Functional Analysis and Tissue-Specific Expression of Calcitonin and CGRP with RAMP-Modulated Receptors CTR and CLR in Chickens. Animals: an open access journal from MDPI, 2024, 14(7), 1058. doi: 10.3390/ani14071058.

      Johnson E C, Shafer O T, Trigg J S, et al. A novel diuretic hormone receptor in Drosophila: evidence for conservation of CGRP signaling. Journal of Experimental Biology, 2005, 208(7): 1239-1246. doi: 10.1242/jeb.01529.

      McLatchie L M, Fraser N J, Main M J, et al. RAMPs regulate the transport and ligand specificity of the calcitonin-receptor-like receptor. Nature, 1998, 393(6683): 333-339. doi: 10.1038/30666.

      Schwartz J, Réalis-Doyelle E, Dubos M P, et al. Characterization of an evolutionarily conserved calcitonin signaling system in a lophotrochozoan, the Pacific oyster (Crassostrea gigas). Journal of Experimental Biology, 2019, 222(13): jeb201319. doi: 10.1242/jeb.201319.

      Sekiguchi T, Kuwasako K, Ogasawara M, et al. Evidence for conservation of the calcitonin superfamily and activity-regulating mechanisms in the basal chordate Branchiostoma floridae: insights into the molecular and functional evolution in chordates. Journal of Biological Chemistry, 2016, 291(5): 2345-2356. doi: 10.1074/jbc.M115.664003.

      Reviewer #2 (Recommendations for the authors):

      Figure 1C: The bootstrap support is quite low in a lot of instances, especially for the deuterostomian and protostomian PDFR clades which is below 30. With such support, I would be hesitant to label the blue clade as deuterostomian PDFR for two reasons: 1) no members of this clade have been shown to be activated by a PDF-like substance and 2) the current study shows that these receptors are activated by CT-type peptides. Therefore, the phylogenetic analyses do not support the conclusions of this paper. What is the basis for calling these receptors PDFR and not CTR in light of weak phylogenetic support?

      Thank you for the reviewer’s comments. In response, we have produced a new phylogenetic analysis using the maximum likelihood method. This was done by Nayeli Escudero Castelán and Kite Jones in the Elphick group at QMUL and therefore they have been added as co-authors of this paper. The new phylogenetic tree (Figure 2, line 206) includes broad taxonomic sampling of CT-type receptors and PDF-type receptors. CRH-type receptors, which are also members of the secretin-type GPCR sub-family, have been included as an outgroup to root the tree. In the previous version the much more distantly related vasopressin/oxytocin-type receptors, which are rhodopsin-type GPCRs, were included as an outgroup. Furthermore, VIP-type receptors were also included in the previous tree but these have been omitted from the new tree because VIP receptor orthologs only occur in vertebrates and therefore they are not representative of a bilaterian GPCR family. The new tree shows high bootstrap support for key clades, notably achieving a bootstrap value of 100 for a clade comprising both deuterostomian and protostomian PDF receptors. This provides important evidence that the A. japonicus PDF-type receptors characterized in this study (AjPDFR1, AjPDFR2) are co-orthologs of the PDF-type receptor that has been characterized previously in Drosophila. Similarly, there is strong bootstrap support (100) for a clade comprising CT/DH31-type receptors and, importantly, the CT-type receptor characterized in this study (AjCTR) is positioned in a branch of this clade that comprises deuterostomian CT-type receptors (with bootstrap support of 100). Details of methods employed to produce the new receptor tree are included in lines 727-739 The new phylogenetic tree is shown below and has been incorporated into the revised manuscript (Figure 2, line 206). The description of new phylogenetic tree has also been modified accordingly in the revised manuscript (line 169-183).

      We agree with the reviewer that no members of the PDF-type receptor clade in deuterostomes have yet been shown to be activated by a PDF-like substance. That is because the precursors of the PDF-type neuropeptides in echinoderms remain unidentified so far, which precludes clear pharmacological characterization of these receptors within the deuterostomian PDFR clade. However, the new phylogenetic tree now provides strong support (bootstrap value = 100) for the clade comprising deuterostomian and protostomian PDFRs, confirming the classification of AjPDFR1 and AjPDFR2 as PDF-type receptors. 

      References:

      Bauknecht P, Jékely G. Large-Scale Combinatorial Deorphanization of Platynereis Neuropeptide GPCRs. Cell reports, 2015, 12(4), 684–693. doi:  10.1016/j.celrep.2015.06.052.

      Beets I, Zels S, Vandewyer E, Demeulemeester J, et al. System-wide mapping of peptide-GPCR interactions in C. elegans. Cell reports, 2023, 42(9), 113058. doi: 10.1016/j.celrep.2023.113058.

      Cardoso J C, Mc Shane J C, Li Z, et al. Revisiting the evolution of Family B1 GPCRs and ligands: Insights from mollusca. Molecular and cellular endocrinology, 2024, 586, 112192. doi: 10.1016/j.mce.2024.112192.

      Gorn A H, Lin H Y, Yamin M, et al. Cloning, characterization, and expression of a human calcitonin receptor from an ovarian carcinoma cell line. The Journal of clinical investigation, 1992, 90(5), 1726–1735. doi: 10.1172/JCI116046.

      Huang T, Su J, Wang X, et al. Functional Analysis and Tissue-Specific Expression of Calcitonin and CGRP with RAMP-Modulated Receptors CTR and CLR in Chickens. Animals: an open access journal from MDPI, 2024, 14(7), 1058. doi: 10.3390/ani14071058.

      Johnson E C, Shafer O T, Trigg J S, et al. A novel diuretic hormone receptor in Drosophila: evidence for conservation of CGRP signaling. Journal of Experimental Biology, 2005, 208(7): 1239-1246. doi: 10.1242/jeb.01529.

      McLatchie L M, Fraser N J, Main M J, et al. RAMPs regulate the transport and ligand specificity of the calcitonin-receptor-like receptor. Nature, 1998, 393(6683): 333-339. doi: 10.1038/30666.

      Schwartz J, Réalis-Doyelle E, Dubos M P, et al. Characterization of an evolutionarily conserved calcitonin signaling system in a lophotrochozoan, the Pacific oyster (Crassostrea gigas). Journal of Experimental Biology, 2019, 222(13): jeb201319. doi: 10.1242/jeb.201319.

      Sekiguchi T, Kuwasako K, Ogasawara M, et al. Evidence for conservation of the calcitonin superfamily and activity-regulating mechanisms in the basal chordate Branchiostoma floridae: insights into the molecular and functional evolution in chordates. Journal of Biological Chemistry, 2016, 291(5): 2345-2356. doi: 10.1074/jbc.M115.664003.

      The new results following AjCT and AjPDFR2 knockdown are a welcome addition. While this additional evidence supports the claim that AjCT could mediate its effects via AjPDFR2, this evidence does not show that AjCT acts as an endogenous ligand for PDFR in vivo. In combination with the weak phylogenetic analyses, I would recommend the authors to key down their claims that they have functionally characterized a PDFR (in the title and text).

      Thank you for your insightful comments and we do understand the reviewer’s concern. 

      Regarding “the weak phylogenetic analyses”, as highlighted above, we have produced a new phylogenetic tree (Fig 2, line 206) that provides strong bootstrap support for the clade comprising deuterostome and protostome PDF-type receptors. For this reason, it is our opinion that inclusion of “pigment-dispersing factor-type receptors” in the title of the paper is appropriate. The details of phylogenetic analysis method were added in line 727-739, and the updated phylogenetic tree has been incorporated into the revised manuscript (Figure 2, line 206). The description of new phylogenetic tree has also been modified accordingly in the revised manuscript (line 169-183). Besides, long-term knockdown of the AjCT precursor and AjPDFR2 both resulted in identical and significant growth defect phenotypes. And the observation of phenotypic overlap is widely accepted in genetic research as strong evidence for pathway association (Shafer and Taghert, 2009; Van Sinay et al., 2017). This high degree of phenotypic consistency, coupled with our in vitro finding that AjCT2 specifically activates AjPDFR2, strongly supports the conclusion that AjCT2 and AjPDFR2 function within the same signaling pathway in vivo, with AjPDFR2 serving as the key receptor functionally activated by AjCT2.

      References:

      Shafer, O. T., & Taghert, P. H. (2009). RNA-interference knockdown of Drosophila pigment dispersing factor in neuronal subsets: the anatomical basis of a neuropeptide's circadian functions. PloS one, 4(12), e8298. doi: 10.1371/journal.pone.0008298.

      Van Sinay, E., Mirabeau, O., Depuydt, G., Van Hiel, M. B., Peymen, K., Watteyne, J., Zels, S., Schoofs, L., & Beets, I. (2017). Evolutionarily conserved TRH neuropeptide pathway regulates growth in Caenorhabditis elegans. Proceedings of the National Academy of Sciences of the United States of America, 114(20), E4065–E4074. doi: 10.1073/pnas.1617392114.

      Since there is no formal logic defining the use of "type" vs "like" vs "related", I would encourage the authors to use one term (of their choice) to avoid unnecessary confusion. Or another possibility is that these relationships are defined at some point in the manuscript so that it becomes clear to the reader.

      Thank you for the reviewer’s comments. The “CT-related peptides” has defined in the Introduction (line 54-58). As per your suggestion, we have now defined both “CT-type peptides” and “CT-like peptides” in the Introduction (line 76-79). “CT-type peptides” are characterized by an N-terminal disulphide bridge, whereas “CT-like peptides” (diuretic hormone 31 (DH31)-type peptides) lack this feature. Additionally, in accordance with the definitions, we have corrected these three descriptions in the revised manuscript (line 80, 83, 88 for “CT-type peptides”) to ensure consistent and accurate usage of these terms.

      "To provide in vivo evidence supporting CT-mediated activation of "PDF" receptors, we conducted the following experiments: Firstly, we confirmed that AjPDFR1 and AjPDFR2were the functional receptors of AjCT1and AjCT2 (Figure 2, 3 and 4). Secondly, injection of AjCT2 and siAjCTP1/2-1 in vivo induced corresponding changes in AjPDFR1and AjPDFR2expression levels in the intestine (Figure 8C, 9A, 9B and 9C)."

      None of these experiments provide direct evidence that CT activates PDFR in vivo. The functional studies are indeed a welcome addition but they cannot discriminate between correlation and causation.

      Thank you for the reviewer’s insightful comments. We agree that the functional studies do not constitute direct proof that CT’s activation of PDFR in vivo. However, we observed identical and significant growth defect phenotypes following long-term knockdown of the AjCT precursor and the AjPDFR2. This high degree of phenotypic congruence, combined with the established in vitro activation of AjPDFR2 by AjCT2, provides strong support for the conclusion that AjCT2 acts as the key endogenous ligand activating the AjPDFR2 signaling pathway in vivo. Importantly, such phenotypic overlap has been widely accepted in genetic research as strong evidence for functional pathway association (Shafer and Taghert, 2009; Van Sinay et al., 2017).

      References:

      Shafer, O. T., & Taghert, P. H. (2009). RNA-interference knockdown of Drosophila pigment dispersing factor in neuronal subsets: the anatomical basis of a neuropeptide's circadian functions. PloS one, 4(12), e8298. doi: 10.1371/journal.pone.0008298.

      Van Sinay, E., Mirabeau, O., Depuydt, G., Van Hiel, M. B., Peymen, K., Watteyne, J., Zels, S., Schoofs, L., & Beets, I. (2017). Evolutionarily conserved TRH neuropeptide pathway regulates growth in Caenorhabditis elegans. Proceedings of the National Academy of Sciences of the United States of America, 114(20), E4065–E4074. doi: 10.1073/pnas.1617392114.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      This preprint from Shaowei Zhao and colleagues presents results that suggest tumorous germline stem cells (GSCs) in the Drosophila ovary mimic the ovarian stem cell niche and inhibit the differentiation of neighboring non-mutant GSC-like cells. The authors use FRT-mediated clonal analysis driven by a germline-specific gene (nos-Gal4, UASp-flp) to induce GSC-like cells mutant for bam or bam's cofactor bgcn. Bam-mutant or bgcn-mutant germ cells produce tumors in the stem cell compartment (the germarium) of the ovary (Figure 1). These tumors contain non-mutant cells - termed SGC for single-germ cells. 75% of SGCs do not exhibit signs of differentiation (as assessed by bamP-GFP) (Figure 2). The authors demonstrate that block in differentiation in SGC is a result of suppression of bam expression (Figure 2). They present data suggesting that in 73% of SGCs, BMP signaling is low (assessed by dad-lacZ) (Figure 3) and proliferation is less in SGCs vs GSCs. They present genetic evidence that mutations in BMP pathway receptors and transcription factors suppress some of the non-autonomous effects exhibited by SGCs within bam-mutant tumors (Figure 4). They show data that bam-mutant cells secrete Dpp, but this data is not compelling (see below) (Figure 5). They provide genetic data that loss of BMP ligands (dpp and gbb) suppresses the appearance of SGCs in bam-mutant tumors (Figure 6). Taken together, their data support a model in which bam-mutant GSC-like cells produce BMPs that act on nonmutant cells (i.e., SGCs) to prevent their differentiation, similar to what is seen in the ovarian stem cell niche. 

      Strengths:

      (1) Use of an excellent and established model for tumorous cells in a stem cell microenvironment.

      (2) Powerful genetics allow them to test various factors in the tumorous vs nontumorous cells.

      (3) Appropriate use of quantification and statistics.

      We greatly appreciate these comments.

      Weaknesses:

      (1) What is the frequency of SGCs in nos>flp; bam-mutant tumors? For example, are they seen in every germarium, or in some germaria, etc, or in a few germaria?

      This is a great question. Because the SGC phenotype depends on the presence of germline tumor clones, our quantification was restricted to germaria that contained them.These quantification data ("SGCs and/or germline cysts per germarium with germline clones") will be presented in the revised Figure 1.

      (2) Does the breakdown in clonality vary when they induce hs-flp clones in adults as opposed to in larvae/pupae?

      Our initial attempts to induce ovarian hs-flp germline clones by heat-shocking adult flies were unsuccessful, with very few clones being observed. Therefore, we shifted our approach to an earlier developmental stage. Successful induction was achieved by subjecting late-L3/early-pupal animals to a twice-daily heatshock at 37°C for 6 consecutive days (2 hours per session with a 6-hour interval, see Lines 325-329) (Zhao et al., 2018).

      (3) Approximately 20-25% of SGCs are bam+, dad-LacZ+. Firstly, how do the authors explain this? Secondly, of the 70-75% of SGCs that have no/low BMP signaling, the authors should perform additional characterization using markers that are expressed in GSCs (i.e., Sex lethal and nanos).

      These 20-25% of SGCs are bamP-GFP<sup>+</sup> dad-lacZ-, not bam<sup>+</sup> dad-lacZ<sup>+</sup> (see Figure 2C and 3D). They would be cystoblast-like cells that may have initiated a differentiation program toward forming germline cysts (see Lines 109-117). The 70-75% of SGCs that have low BMP signaling exhibit GSC-like properties, including: 1) dot-like spectrosomes; 2) dad-lacZ positivity; 3) absence of bamP-GFP expression. While additional markers would be beneficial, we think that this combination of properties is sufficient to classify these cells as GSC-like. 

      (4) All experiments except Figure 1I (where a single germarium with no quantification) were performed with nos-Gal4, UASp-flp. Have the authors performed any of the phenotypic characterizations (i.e., figures other than Figure 1) with hs-flp?

      Yes, we initially identified the SGC phenotype through hs-flp-mediated mosaic analysis of bam or bgcn mutant in ovaries. However, as noted in our response to Weakness (2), this approach was very labor-intensive. Therefore, we switched to using the more convenient nos::flp system for subsequent experiments. To our observation, there was no difference in the SGC phenotype between these two approaches, confirming that the nos::flp system is a valid and more practical alternative for its study. 

      (5) Does the number of SGCs change with the age of the female? The experiments were all performed in 14-day-old adult females. What happens when they look at a young female (like 2-day-old). I assume that the nos>flp is working in larval and pupal stages, and so the phenotype should be present in young females. Why did the authors choose this later age? For example, is the phenotype more robust in older females? Or do you see more SGCs at later time points?

      These are very good questions. Such time-course analysis data will be provided in revised Figure 1. The SGC phenotype depends on the presence of bam or bgcn mutant germline clones. Germaria from 14-day-old flies contained bigger and more such clones than those from younger flies. This age-dependent increase in clone size and frequency significantly enhanced the efficiency of our quantification (see Lines 129-131). 

      (6) Can the authors distinguish one copy of GFP versus 2 copies of GFP in germ cells of the ovary? This is not possible in the Drosophila testis. I ask because this could impact the clonal analyses diagrammed in Figure 4A and 4G and in 6A and B. Additionally, in most of the figures, the GFP is saturated, so it is not possible to discern one vs two copies of GFP.

      We greatly appreciate this comment. It was also difficult for us to distinguish 1 and 2 copies of GFP in the Drosophila ovary. In Figure 4A-F, to resolve this problem, we used a triplecolor system, in which red germ cells (RFP<sup>+/+</sup> GFP<sup>-/-</sup>) are bam mutant, yellow germ cells (RFP<sup>+/-</sup> GFP<sup>+/-</sup>) are wild-type, and green germ cells (RFP<sup>-/-</sup> GFP<sup>+/+</sup>) are punt or med mutant. In Figure 4G-J, we quantified the SGC phenotype only in black germ cells (GFP<sup>-/-</sup>), which are wild-type (control) or mad mutant.  In Figure 6, we quantified the SGC phenotype only in green germ cells (both GFP<sup>+/+</sup> and GFP<sup>+/-</sup>), all of which are wild-type.

      (7) More evidence is needed to support the claim of elevated Dpp levels in bam or bgcn mutant tumors. The current results with the dpp-lacZ enhancer trap in Figure 5A, B are not convincing. First, why is the dpp-lacZ so much brighter in the mosaic analysis (A) than in the no-clone analysis (B)? It is expected that the level of dpplacZ in cap cells should be invariant between ovaries, and yet LacZ is very faint in Figure 5B. I think that if the settings in A matched those in B, the apparent expression of dpp-lacZ in the tumor would be much lower and likely not statistically significant. Second, they should use RNA in situ hybridization with a sensitive technique like hybridization chain reactions (HCR) - an approach that has worked well in numerous Drosophila tissues, including the ovary.

      We appreciate this critical comment. The settings of immunofluorescent staining and confocal parameters in Figure 5A were the same as those in 5B. To our observation, the level of dpp-lacZ in cap cells was variable across germaria, even within the same ovary, as quantified in Figure 5C. We will provide RNA in situ hybridization data to further strengthen the conclusion that bam or bgcn mutant germline tumors secret BMP ligands.  

      (8) In Figure 6, the authors report results obtained with the bamBG allele. Do they obtain similar data with another bam allele (i.e., bamdelta86)?

      No. Given that bam<sup>BG</sup> was functionally indistinguishable from bam<sup>Δ86</sup> in inducing the SGC phenotype (compare Figure 6F, I with Figure 6-figure supplement 3C), we believe that repeating these experiments with bam<sup>Δ86</sup> would be redundant and would not alter the key conclusion of our study. Thanks for the understanding!

      Reviewer #2 (Public review):

      While the study by Zhang et al. provides valuable insights into how germline tumors can non-autonomously suppress the differentiation of neighboring wild-type germline stem cells (GSCs), several conceptual and technical issues limit the strength of the conclusions.

      Major points:

      (1) Naming of SGCs is confusing. In line 68, the authors state that "many wild-type germ cells located outside the niche retained a GSC-like single-germ-cell (SGC) morphology." However, bam or bgcn mutant GSCs are also referred to as "SGCs," which creates confusion when reading the text and interpreting the figures. The authors should clarify the terminology used to distinguish between wild-type SGCs and tumor (bam/bgcn mutant) SGCs, and apply consistent naming throughout the manuscript and figure legends.

      We apologize for any confusion. In our manuscript, the term "SGC" is reserved specifically for wild-type germ cells that maintain a GSC-like morphology outside the niche. bam or bgcn mutant germ cells are referred to as GSC-like tumor cells (Lines 87-88), not SGCs.

      (a) The same confusion appears in Figure 2. It is unclear whether the analyzed SGCs are wild-type or bam mutant cells. If the SGCs analyzed are Bam mutants, then the lack of Bam expression and failure to differentiate would be expected and not informative. However, if the SGCs are wild-type GSCs located outside the niche, then the observation would suggest that Bam expression is silenced in these wildtype cells, which is a significant finding. The authors should clarify the genotype of the SGCs analyzed in Figure 2C, as this information is not currently provided.

      The SGCs analyzed in Figure 2A-C are wild-type, GSC-like cells located outside the niche. They were generated using the same genetic strategy depicted in Figures 1C and 1E (with the schematic in Figure 1B). The complete genotypes for all experiments are available in Source data 1. 

      (b) In Figures 4B and 4E, the analysis of SGC composition is confusing. In the control germaria (bam mutant mosaic), the authors label GFP⁺ SGCs as "wild-type," which makes interpretation unclear. Note, this is completely different from their earlier definition shown in line 68.

      The strategy to generate SGCs in Figure 4B-F (with the schematic in Figure 4A) is completely different from that in Figure 1C-F, H, and I (with the schematic in Figure 1B). In Figure 4B-F, we needed to distinguish punt<sup>-/-</sup> (or med<sup>-/-</sup>) with punt<sup>+/-</sup> (or med<sup>+/-</sup>) germ cells. As noted in our response to Reviewer #1’s Weakness (6), it was difficult for us to distinguish 1 and 2 copies of GFP in the Drosophila ovary. Therefore, we chose to use the triple-color system to distinguish these germ cells in Figure 4B-F (see genotypes in Source data 1). 

      (c) Additionally, bam⁺/⁻ GSCs (the first bar in Figure 4E) should appear GFP⁺ and Red⁺ (i.e., yellow). It would be helpful if the authors could indicate these bam⁺/⁻ germ cells directly in the image and clarify the corresponding color representation in the main text. In Figure 2A, although a color code is shown, the legend does not explain it clearly, nor does it specify the identity of bam⁺/⁻ cells alone. Figure 4F has the same issue, and in this graph, the color does not match Figure 4A.

      The color-to-genotype relationships for the schematics in Figures 2A and 4E are provided in Figures 1B and 4A, respectively. Due to the high density of germ cells, it is impractical to label each genotype directly in the images. In contrast to Figure 4E, the colors in Figure 4F do not represent genotypes; instead, blue denotes the percentage of SGCs, and red denotes the percentage of germline cysts, as indicated below the bar chart. 

      (2) The frequencies of bam or bgcn mutant mosaic germaria carrying [wild-type] SGCs or wild-type germ cell cysts with branched fusomes, as well as the average number of wild-type SGCs per germarium and the number of days after heat shock for the representative images, are not provided when Figure 1 is first introduced. Since this is the first time the authors describe these phenotypes, including these details is essential. Without this information, it is difficult for readers to follow and evaluate the presented observations.

      Thanks for this constructive suggestion. We will include such quantification data in the revised manuscript.

      (3) Without the information mentioned in point 2, it causes problems when reading through the section regarding [wild-type] SGCs induced by impairment of differentiation or dedifferentiation. In lines 90-97, the authors use the presence of midbodies between cystocytes as a criterion to determine whether the wild-type GSCs surrounded by tumor GSCs arise through dedifferentiation. However, the cited study (Mathieu et al., 2022) reports that midbodies can be detected between two germ cells within a cyst carrying a branched fusome upon USP8 loss.

      Unlike wild-type cystocytes, which undergo incomplete cytokinesis and lack midbodies, those with USP8 loss undergo complete cell division, with the presence of midbodies (white arrow, Figure 1F’ from Mathieu et al., 2022) as a marker of the late cytokinesis stage (Mathieu et al., 2022). 

      (a) Are wild-type germ cell cysts with branched fusomes present in the bam mutant mosaic germaria? What is the proportion of germaria containing wild-type SGCs versus those containing wild-type germ cell cysts with branched fusomes?

      (b) If all bam mutant mosaic germaria carry only wild-type GSCs outside the niche and no germaria contain wild-type germ cell cysts with branched fusomes, then examining midbodies as an indicator of dedifferentiation may not be appropriate.

      We greatly appreciate this critical comment. bam mutant mosaic germaria indeed contained wild-type germline cysts, as evidenced by an SGC frequency of ~70%, rather than 100% (see Figures 2H, 4F, 4J, 6F, 6I, and Figure 6-figure supplement 3C). Since the SGC phenotype depends on the presence of bam or bgcn mutant germline tumors, we quantified it as “the percentage of SGCs relative to the total number of SGCs and germline cysts that are surrounded by germline tumors” (see Lines 124-129). Quantifying the SGC phenotype as "the percentage of germaria with SGCs" would be imprecise. This is because the presence and number of SGCs were highly variable among germaria with bam mutant germline clones, and a small number of germaria entirely lacked these clones. We will provide the data of "SGCs and/or germline cysts per germarium with germline clones" in revised Figure 1.

      (c) If, however, some germaria do contain wild-type germ cell cysts with branched fusomes, the authors should provide representative images and quantify their proportion.

      Such representative germaria are shown in Figure 2G, 3B, 3C, 6D, 6E, and 6H. The percentage of germline cysts can be calculated by “100% - SGC%”.

      (d) In line 95, although the authors state that 50 germ cell cysts were analyzed for the presence of midbodies, it would be more informative to specify how many germaria these cysts were derived from and how many biological replicates were examined.

      As noted in our response to points a) and b) above, the germ cells surrounded by germline tumors, rather than germarial numbers, are more precise for analyzing the phenotype. For this experiment, we examined >50 such germline cysts via confocal microscopy. As the analysis was performed on a defined cellular population, this sample size should be sufficient to support our conclusion. 

      (4) Note that both bam mutant GSCs and wild-type SGCs can undergo division to generate midbodies (double cells), as shown in Figure 4H. Therefore, the current description of the midbody analysis is confusing. The authors should clarify which cell types were examined and explain how midbodies were interpreted in distinguishing between cell division and differentiation.

      We assayed for the presence of midbodies or not specifically within the germline cysts surrounded by bam mutant tumors, not within the tumors themselves (Lines 94-95). As detailed in Lines 88-97, the absence of midbodies was used as a key criterion to exclude the possibility of dedifferentiation.  

      (5) The data in Figure 5 showing Dpp expression in bam mutant tumorous GSCs are not convincing. The Dpp-lacZ signal appears broadly distributed throughout the germarium, including in escort cells. To support the claim more clearly, the authors should present corresponding images for Figures 5D and 5E, in which dpp expression was knocked down in the germ cells of bam or bgcn mutant mosaic germaria. Showing these images would help clarify the localization and specificity of Dpp-lacZ expression relative to the tumorous GSCs.

      We greatly appreciate this comment. RNA in situ hybridization data will be provided to further strengthen the conclusion that bam or bgcn mutant germline tumors secret BMP ligands.

      (6) While Figure 6 provides genetic evidence that bam mutant tumorous GSCs produce Dpp to inhibit the differentiation of wild-type SGCs, it should be noted that these analyses were performed in a dpp⁺/⁻ background. To strengthen the conclusion, the authors should include appropriate controls showing [dpp⁺/⁻; bam⁺/⁻] SGCs and [dpp⁺/⁻; bam⁺/⁻] germ cell cysts without heat shock (as referenced in Figures 6F and 6I).

      Schematic cartoons in Figure 6A and 6B demonstrate that these analyses were performed in a dpp<sup>+/-</sup> background. Figure 6-figure supplement 1 indicates that dpp<sup>+/-</sup> or gbb<sup>+/-</sup> does not affect GSC maintenance, germ cell differentiation, and female fly fertility. Figure 6C is the control for 6D and 6E, and 6G is the control for 6H, with quantification in 6F and 6I.  We used nos::flp, not the heat shock method, to induce germline clones in these experiments (see genotypes in Source data 1).

      (7) Previous studies have reported that bam mutant germ cells cause blunted escort cell protrusions (e.g., Kirilly et al., Development, 2011), which are known to contribute to germ cell differentiation (e.g., Chen et al., Frontiers in Cell and Developmental Biology, 2022). The authors should include these findings in the Discussion to provide a broader context and to acknowledge how alterations in escort cell morphology may further influence differentiation defects in their model.

      Thanks for teaching us! Such discussion will be included in the revised manuscript.

      (8) Since fusome morphology is an important readout of SGCs vs differentiation. All the clonal analysis should have fusome staining.

      SGC is readily distinguishable from multi-cellular germline cyst based on morphology. In some clonal analysis experiments, fusome staining was not feasible due to technical limitations such as channel saturation or antibody incompatibility. Thanks for the understanding! 

      (9) Figure arrangement. It is somewhat difficult to identify the figure panels cited in the text due to the current panel arrangement.

      The figure panels were arranged to optimize space while ensuring that related panels are grouped in close proximity for logical comparison. We would be happy to consider any specific suggestions for an alternative layout that could improve clarity. Thanks!

      (10) The number of biological replicates and germaria analyzed should be clearly stated somewhere in the manuscript-ideally in the Methods section or figure legends. Providing this information is essential for assessing data reliability and reproducibility.

      Thanks for this constructive suggestion. Such information will be included in figure legends in the revised manuscript.

      Reviewer #3 (Public review):

      Summary:

      Zhang et al. investigated how germline tumors influence the development of neighboring wild-type (WT) germline stem cells (GSC) in the Drosophila ovary. They report that germline tumors inhibit the differentiation of neighboring WT GSCs by arresting them in an undifferentiated state, resulting from reduced expression of the differentiation-promoting factor Bam. They find that these tumor cells produce low levels of the niche-associated signaling molecules Dpp and Gbb, which suppress bam expression and consequently inhibit the differentiation of neighboring WT GSCs non-cell-autonomously. Based on these findings, the authors propose that germline tumors mimic the niche to suppress the differentiation of the neighboring stem cells.

      Strengths:

      This study addresses an important biological question concerning the interaction between germline tumor cells and WT germline stem cells in the Drosophila ovary. If the findings are substantiated, they could provide valuable insights applicable to other stem cell systems.

      We greatly appreciate these comments.

      Weaknesses:

      Previous work from Xie's lab demonstrated that bam and bgcn mutant GSCs can outcompete WT GSCs for niche occupancy. Furthermore, a large body of literature has established that the interactions between escort cells (ECs) and GSC daughters are essential for proper and timely germline differentiation (the differentiation niche). Disruption of these interactions leads to arrest of germline cell differentiation in a status with weak BMP signaling activation and low bam expression, a phenotype virtually identical to what is reported here. Thus, it remains unclear whether the observed phenotype reflects "direct inhibition by tumor cells" or "arrested differentiation due to the loss of the differentiation niche". Because most data were collected at a very late stage (more than 10 days after clonal induction), when tumor cells already dominate the germarium, this question cannot be solved. To distinguish between these two possibilities, the authors could conduct a time-course analysis to examine the onset of the WT GSC-like singlegerm-cell (SGC) phenotype and determine whether early-stage tumor clones with a few tumor cells can suppress the differentiation of neighboring WT GSCs with only a few tumor cells present. If tumor cells indeed produce Dpp and Gbb (as proposed here) to inhibit the differentiation of neighboring germline cells, a small cluster or probably even a single tumor cell generated at an early stage might prevent the differentiation of their neighboring germ cells.

      Thanks for this critical comment. Such time-course analysis data will be provided in revised Figure 1.

      The key evidence supporting the claim that tumor cells produce Gpp and Gbb comes from Figures 5 and 6, which suggest that tumor-derived dpp and gbb are required for this inhibition. However, interpretation of these data requires caution. In Figure 5, the authors use dpp-lacZ to support the claim that dpp is upregulated in tumor cells (Figure 5A and 5B). However, the background expression in somatic cells (ECs and pre-follicular cells) differs noticeably between these panels. In Figure 5A, dpp-lacZ expression in somatic cells in 5A is clearly higher than in 5B, and the expression level in tumor cells appears comparable to that in somatic cells (dpplacZ single channel). Similarly, in Figure 5B, dpp-lacZ expression in germline cells is also comparable to that in somatic cells. Providing clear evidence of upregulated dpp and gbb expression in tumor cells (for example, through single-molecular RNA in situ) would be essential.

      We greatly appreciate this critical comment. In our data, the expression of dpp-lacZ in cap cells was variable across germaria, even within the same ovary, as quantified in Figure 5C. The images in Figures 5A and 5B were selected as representative examples of positive signaling. To directly address the reviewer's point and strengthen our conclusion, we will perform RNA in situ hybridization data in the revised manuscript to visualize the expression of BMP ligands within the bam or bgcn mutant germline tumor cells.

      Most tumor data present in this study were collected from the bam[86] null allele, whereas the data in Figure 6 were derived from a weaker bam[BG] allele. This bam[BG] allele is not molecularly defined and shows some genetic interaction with dpp mutants. As shown in Figure 6E, removal of dpp from homozygous bam[BG] mutant leads to germline differentiation (evidenced by a branched fusome connecting several cystocytes, located at the right side of the white arrowhead). In Figure 6D, fusome is likely present in some GFP-negative bam[BG]/bam[BG] cells. To strengthen their claim that the tumor produces Dpp and Gbb to inhibit WT germline cell differentiation, the authors should repeat these experiments using the bam[86] null allele.

      Although a structure resembling a "branched fusome" is visible in Figure 6E (right of the white arrowhead), it is an artifact resulting from the cytoplasm of GFP-positive follicle cells, which also stain for α-Spectrin, projecting between germ cells of different clones (see the merged image). In both our previous (Zhang et al., 2023) and current studies, bam<sup>BG</sup> was functionally indistinguishable from bam<sup>Δ86</sup> in its ability to block GSC differentiation and induce the SGC phenotype (compare Figure 6F, I with Figure 6-figure supplement 3C). Given this, we believe that repeating the extensive experiments in Figure 6 with the bam<sup>Δ86</sup> allele would be scientifically redundant and would not change the key conclusion of our study. We thank the reviewer for their consideration.

      It is well established that the stem niche provides multiple functional supports for maintaining resident stem cells, including physical anchorage and signaling regulation. In Drosophila, several signaling molecules produced by the niche have been identified, each with a distinct function - some promoting stemness, while others regulate differentiation. Expression of Dpp and Gbb alone does not substantiate the claim that these tumor cells have acquired the niche-like property. To support their assertion that these tumors mimic the niche, the authors should provide additional evidence showing that these tumor cells also express other niche-associated markers. Alternatively, they could revise the manuscript title to more accurately reflect their findings.

      Dpp and Gbb are the key niche signals from cap cells for maintaining GSC stemness. Our work demonstrates that germline tumors can specifically mimic this signaling function, not the full suite of cap cell properties, to create a non-cell-autonomous differentiation block. The current title “Tumors mimic the niche to inhibit neighboring stem cell differentiation” reflects this precise concept: a partial, functional mimicry of the niche's most relevant activity in this context. We feel it is an appropriate and compelling summary of our main conclusion.

      In the Method section, the authors need to provide details on how dpp-lacZ expression levels were quantified and normalized.

      Thanks for this suggestion. Such information will be included in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      The authors report the structure of the human CTF18-RFC complex bound to PCNA. Similar structures (and more) have been reported by the O'Donnell and Li labs. This study should add to our understanding of CTF18-RFC in DNA replication and clamp loaders in general. However, there are numerous major issues that I recommend the authors fix. 

      Strengths: 

      The structures reported are strong and useful for comparison with other clamp loader structures that have been reported lately. 

      Weaknesses: 

      The structures don't show how CTF18-RFC opens or loads PCNA. There are recent structures from other groups that do examine these steps in more detail, although this does not really dampen this reviewer's enthusiasm. It does mean that the authors should spend their time investigating aspects of CTF18-RFC function that were overlooked or not explored in detail in the competing papers. The paper poorly describes the interactions of CTF18-RFC with PCNA and the ATPase active sites, which are the main interest points. The nomenclature choices made by the authors make the manuscript very difficult to read. 

      Reviewer #2 (Public review): 

      Summary 

      Briola and co-authors have performed a structural analysis of the human CTF18 clamp loader bound to PCNA. The authors purified the complexes and formed a complex in solution. They used cryo-EM to determine the structure to high resolution. The complex assumed an auto-inhibited conformation, where DNA binding is blocked, which is of regulatory importance and suggests that additional factors could be required to support PCNA loading on DNA. The authors carefully analysed the structure and compared it to RFC and related structures. 

      Strength & Weakness 

      Their overall analysis is of high quality, and they identified, among other things, a human-specific beta-hairpin in Ctf18 that flexibly tethers Ctf18 to Rfc2-5. Indeed, deletion of the beta-hairpin resulted in reduced complex stability and a reduction in a primer extension assay with Pol ε. This is potentially very interesting, although some more work is needed on the quantification. Moreover, the authors argue that the Ctf18 ATP-binding domain assumes a more flexible organisation, but their visual representation could be improved. 

      The data are discussed accurately and relevantly, which provides an important framework for rationalising the results. 

      All in all, this is a high-quality manuscript that identifies a key intermediate in CTF18dependent clamp loading. 

      Reviewer #3 (Public review): 

      Summary: 

      CTF18-RFC is an alternative eukaryotic PCNA sliding clamp loader that is thought to specialize in loading PCNA on the leading strand. Eukaryotic clamp loaders (RFC complexes) have an interchangeable large subunit that is responsible for their specialized functions. The authors show that the CTF18 large subunit has several features responsible for its weaker PCNA loading activity and that the resulting weakened stability of the complex is compensated by a novel beta hairpin backside hook. The authors show this hook is required for the optimal stability and activity of the complex. 

      Relevance: 

      The structural findings are important for understanding RFC enzymology and novel ways that the widespread class of AAA ATPases can be adapted to specialized functions. A better understanding of CTF18-RFC function will also provide clarity into aspects of DNA replication, cohesion establishment, and the DNA damage response. 

      Strengths: 

      The cryo-EM structures are of high quality enabling accurate modelling of the complex and providing a strong basis for analyzing differences and similarities with other RFC complexes. 

      Weaknesses: 

      The manuscript would have benefitted from more detailed biochemical analysis to tease apart the differences with the canonical RFC complex. 

      I'm not aware of using Mg depletion to trap active states of AAA ATPases. Perhaps the authors could provide a reference to successful examples of this and explain why they chose not to use the more standard practice in the field of using ATP analogues to increase the lifespan of reaction intermediates. 

      Overall appraisal: 

      Overall the work presented here is solid and important. The data is sufficient to support the stated conclusions and so I do not suggest any additional experiments. 

      Reviewer #1 (Recommendations for the authors): 

      We thank the reviewer for their positive comments and for their thorough review. All raised points have been addressed below.

      Major points 

      (1) The nomenclature used in the paper is very confusing and sometimes incorrect. The authors refer to CTF18 protein as "Ctf18", and the entire CTF18-RFC complex as "CTF18". This results in massive confusion because it is hard to ascertain whether the authors are discussing the individual subunits or the entire complex. Because these are human proteins, each protein name should be fully capitalized (i.e. CTF18, RFC4 etc). The full complex should be referred to more clearly with the designation CTF18-RFC or CTF18-RLC (RFC-like complex). Also, because the yeast and human clamp loader complexes use the same nomenclature for different subunits, it would be best for the authors to use the "A, B, C, D, E subunit" nomenclature that has been standard in the field for the past 20 years. Finally, the authors try to distinguish PCNA subunits by labeling them "PCNA2" or "PCNA1" (see Page 8 lines 180,181 for an example). This is confusing because the names of the RFC subunits have similar formats (RFC2, RFC3, RFC4, etc). In the case of RFC this denotes unique genes, whereas PCNA is a homotrimer. Could the authors think of another way to denote the different subunits, such as super/subscript? PCNA-I, PCNA-II, PCNA-III? 

      We thank the reviewer for pointing out the confusing nomenclature. Following the referee suggestion, we now refer to the CTF18 full complex as “CTF18-RFC”. We prefer keeping the nomenclature used for CTFC18 subunits as RFC2, RFC3 etc., as recently used in Yuan et al, Science, 2024. However, we followed the referee’s suggestion for PCNA subunits, now referred to as PCNA-I, PCNA-II and PCNA-III.

      (2) I believe that the authors are over-interpreting their data in Figure 1. The claim that "less sharp definition" of the map corresponding to the AAA+ domain of Ctf18 supports a relatively high mobility of this subunit is largely unsubstantiated. There are several reasons why one could get varying resolution in a cryo-EM reconstruction, such as compositional heterogeneity, preferred orientation artifacts, or how the complex interacts with the air-water interface. If other data were presented that showed this subunit is flexible, this evidence would support that data but cannot alone as justification for subunit mobility. Along these lines, how was the buried surface area (2300 vs 1400 A2) calculated? Is this the total surface area or only the buried surface area involving the AAA+ domains? It is surprising that these numbers are so different considering that the subunits and complexes look so similar (Figures 1c and 2b). 

      We respectfully disagree with the suggestion that our interpretation of local flexibility in the AAA+ domain of Ctf18 is overreaching. Several lines of evidence support this interpretation. First, compositional heterogeneity is unlikely, as the A′ domain of Ctf18 is well-resolved and forms stable interactions with RFC3, indicating that Ctf18 is consistently incorporated into the complex. Second, preferred orientation artifacts are excluded, as the particle distribution shows excellent angular coverage (Fig. S9a). Third, we now include a 3D variability analysis (3DVA; Supplementary Video 1), which reveals local conformational heterogeneity centered around the AAA+ domain of Ctf18, consistent with intrinsic flexibility.

      Regarding the buried surface area values, the reported numbers refer specifically to the interfaces between the AAA+ domain of Ctf18 and RFC2, and are derived from buried surface area calculations performed with PISA. The smaller interface (~1400 Ų) compared to RFC1–RFC2 (~2300 Ų) reflects low sequence identity (~26%) and divergent structural features, including the absence of conserved elements such as the canonical PIP-box in Ctf18. We have clarified and expanded this explanation in the revised manuscript (Page 7).

      (3) The authors very briefly discuss interactions with PCNA and how the CTF18-RFC complex differs from the RFC complex. This is amongst the most interesting results from their work, but also not well-developed. Moreover, Figure 3D describing these interactions is extremely unclear. I feel like this observation had potential to be interesting, but is largely ignored by the authors. 

      We thank the referee for pointing this out. We have expanded the section describing the interactions of CTF18-RFC and PCNA (Page 9 in the new manuscript), and made a new panel figure with further details (Fig. 3D).  

      (4) The authors make the observation that key ATP-binding residues in RFC4 are displaced and incompatible with nucleotide binding in their CTF18-RFC structure compared to the hRFC structure. This should be a main-text figure showing these displacements and how it is incompatible with ATP binding. Again, this is likely an interesting finding that is largely glossed over by the authors. 

      We now discuss this feature in detail (Pag 11 in the new manuscript), and added two figure insets (Fig. 4c) describing the incompatibility of RFC4 with nucleotide binding.

      (5) The authors claim that the work of another group (citation 50) "validate(s) our predictions regarding the significant similarities between CTF18-RFC and canonical RFC in loading PCNA onto a ss/dsDNA junction." However, as far as this reviewer can tell the work in citation 50 was posted online before the first draft of this manuscript appeared on biorxiv, so it is dubious to claim that these were "predictions." 

      We agree with the referee about this claim. We have now revised the text as follows:

      “While our work was being finalized, several cryo-EM structures of human CTF18-RFC bound to PCNA and primer/template DNA were reported by another group (He et al, PNAS, 2024). These findings are consistent with the distinct features of CTF18-RFC observed in our structures and independently support the notion of significant mechanistic similarity between CTF18-RFC and canonical RFC in loading PCNA onto a ss/dsDNA junction”.

      (6) The authors use a primer extension assay to test the effects of truncating the Nterminal beta hairpin of CTF18. However, this assay is only a proxy for loading efficiency and the observed effects of the mutation are rather subtle. The authors could test their hypothesis more clearly if they performed an ATPase assay or even better a clamp loading assay. 

      We thank the referee for this valuable suggestion. In response, we have performed clamp loading assays comparing the activities of human RFC, wild-type CTF18-RFC, and the β-hairpin–truncated CTF18-RFC mutant. The results, now presented in Fig. 6 and Table 1 of the revised manuscript, clearly show that truncation of the N-terminal βhairpin results in a slower rate of PCNA loading. We propose that this reduced loading rate likely contributes to the diminished Pol ε–mediated DNA synthesis observed in the primer extension assays.

      Minor points 

      (1) Page 3 line 53 the introduction suggests that ATP hydrolysis prompts clamp closure. While this may be the case, to my knowledge all recent structural work shows that closure can occur without ATP hydrolysis. It may be better to rephrase it to highlight that under normal loading conditions, ATP hydrolysis occurs before clamp closure. 

      The text now reads (Page 3): 

      “DNA binding prompts the closure of the clamp and hydrolysis of ATP induces the concurrent disassembly of the closed clamp loader from the sliding clamp-DNA complex, completing the cycle necessary for the engagement of the replicative polymerases to start DNA synthesis.”

      (2) Page 3 line 60, I do not see how the employment of alternative loaders highlights the specificity of the loading mechanism - would it not be possible for multiple loaders to have promiscuous clamp loading? 

      We thank the referee for this comment. The text now reads (Page 3):

      “However, eukaryotes also employ alternative loaders (20), including CTF18-RFC (6, 21-24), which likely use a conserved loading mechanism but are functionally specialized through specific protein interactions and context-dependent roles in DNA replication.”

      (3) Page 4 line 75 could you please cite a study that shows Ctf8 and Dcc1 bind to the Ctf18 C-terminus and that a long linker is predicted to be flexible? 

      Two references have been added (Stokes et al, NAR, 2020 and Grabarczyk et al, Structure, 2018)

      (4) Figure 2A has the N-terminal region of Ctf18 as bound to RFC3 but should likely be labeled as bound to RFC5. This caused significant confusion while trying to parse this figure. Further, the inclusion of "X" as a sequence - does this refer to a sequence that was not buildable in the cryo-EM map? I would be surprised that density immediately after the conserved DEXX box motif is unbuildable. If this is the case, it should be clearly stated in the figure legend that "X" denotes an unbuildable sequence. For the conserved beta-hairpin in the sequence, could the authors superimpose the AlphaFold prediction onto their structure? It would be more informative than just looking at the sequence. 

      We apologize for this confusion. The error in Figure 2A has been corrected. The figure caption now explicitely says that “X” refers to amino acid residues in the sequence which were not modelled. A superposition of the cryo-EM model of the N-terminal Beta hairpin in human Ctf18 and AlphaFold predictions for this feature in drosophila and yeast Ctf18 is now presented in Figure 2A.

      (5) Page 8 line 168, the use of the term "RFC5" here feels improper, since the "C" subunit is not RFC5 in all lower eukaryotes (see comment above about nomenclature). For instance, in S cerevisiae, the C subunit is RFC3. I would expect this interaction to be maintained in all C subunits, not all RFC5 subunits. 

      The text now reads (Page 8):

      “Therefore, lower eukaryotes may use a similar b-hairpin motif to bind the corresponding subunit of the RFC-module complex (RFC5 in human, Rfc3 in S. cerevisiae), emphasizing its importance.”  

      (6) Page 10 line 228, the authors claim that hydrolysis is dispensable at the Ctf18/RFC2 interface based on evidence from RFC1/RFC2 interface, by analogy that this is the "A/B" interface in both loaders. However, the wording makes it sound as if the cited data were collected while studying Ctf18 loaders. The authors should clarify this point. 

      The text has been modified as follows (Pag 11): 

      “Prior research has indicated that hydrolysis at the large subunit/RFC2 interface is not essential for clamp loading by various loaders (48-51), while the others are critical for the clamp-loading activity of eukaryotic RFCs. “

      (7) Page 11 line 243/244 the authors introduce the separation pin. Could they clarify whether Ctf18 contains any aromatic residues in this structural motif that would suggest it serves the same functional purpose? Also, the authors highlight this is similar to yeast RFC, which makes it sound like this is not conserved in human RFC, but the structural motif is also conserved in human RFC. 

      We thank the reviewer for this helpful comment. We have clarified in the revised text (Page 12) that the separation pin is conserved not only in yeast RFC but also in human RFC, and now note that human Ctf18 also harbors aromatic residues at the corresponding positions. This observation is supported by the new panel in Figure 4e.

      Minutia 

      (1) Page 2 line 37 please remove the word "and" before PCNA. 

      This has been corrected.

      (2) Please define AAA+ and update the language to clarify that not all pentameric AAA+ ATPases are clamp loaders. 

      AAA+ has been now defined (Page 3).

      (3) Page 4 line 86 Given the relatively weak interaction of Pol ε. 

      This has been corrected.

      (4) Page 8 line 204 the authors likely mean "leucine" and not "lysine". 

      We thank the reviewer for catching this. The error has been corrected.

      (5) Page 14 line 300, the authors claim that CTF18 utilizes three subunits but then list four. 

      We have corrected this.

      Reviewer #2 (Recommendations for the authors): 

      We thank the reviewer for their positive comments and valuable suggestions. The points raised by the referee have been addressed below.

      Major point: 

      (1) Please quantify Figure 6 and S9 from 3 independent repeats and determine the standard deviation to show the variability of the Ctf18 beta hairpin deletion.  The authors suggest that a suboptimal Ctf18 complex interaction with PCNA impacts the stability of the complex, but do not test this hypothesis. Could the suboptimal PIP motif in Ctf18 be changed to an improved motif and the impact tested in the primer extension assay? Although not essential, it would be a nice way to explore the mechanism. 

      We thank the reviewer for the suggestion. However, we note that Figure 6b (now 7b) already presents the quantification of the primer extension assay from three independent replicates, with error bars showing standard deviations, and includes the calculated rate of product accumulation. These data clearly indicate a 42% reduction in primer synthesis rate upon deletion of the Ctf18 β-hairpin.

      We agree that we do not provide direct evidence of impaired complex stability upon deletion of the Ctf18 β-hairpin. However, the 2D classification of the cryo-EM dataset (Figure S9) shows a marked reduction in the number of particles corresponding to intact CTF18-RFC–PCNA complexes in the β-hairpin deletion sample, with the majority of particles corresponding to free PCNA. This contrasts with the wild-type dataset, where complex particles are predominant. These findings indirectly suggest that deletion of the β-hairpin compromises the stability or assembly of the clamp-loader–clamp complex.

      We thank the reviewer for the valuable suggestion to mutate the weak PIP-box of Ctf18. While an interesting direction, we instead sought to directly test the mechanism by performing quantitative clamp loading assays. These assays revealed a significant reduction in the rate of PCNA loading by the CTF18<sup>Δ165–194</sup>-RFCmutant (Figure 6), supporting the conclusion that the β-hairpin contributes to productive PCNA loading. This loading delay likely underlies the reduced rate of primer extension observed in the Pol ε assay (Figure 7), consistent with impaired formation of processive polymerase– clamp complexes.

      (2) I did not see the method describing how the 2D classes were quantified to evaluate the impact of the Ctf18 beta hairpin deletion on complex formation. Please add the relevant information. 

      The relevant information has been added to the Method section:

      “For quantification of complex stability, the number of particles contributing to each 2D class was extracted from the classification metadata (Datasets 1 and 3). All classes showing isolated PCNA rings were summed and compared to the total number of particles in classes representing intact CTF18-RFC–PCNA complexes. This analysis was performed for both wild-type and β-hairpin deletion mutant datasets. Notably, no 2D classes corresponding to free PCNA were observed in the wild-type dataset, whereas in the mutant dataset, a substantial fraction of particles corresponded to isolated PCNA, suggesting reduced stability of the mutant complex.”

      Minor point: 

      (1) Page 2, line 25. Detail what type of mobility is referred to. Do you mean flexibility in the EM-map? 

      We have clarified this. The text now reads:

      “The unique RFC1 (Ctf18) large subunit of CTF18-RFC, which based on the cryo-EM map shows high relative flexibility, is anchored to PCNA through an atypical low-affinity PIP box”

      (2) Page 4, line 82. Please introduce CMGE, or at least state what the abbreviation stands for. 

      This has been addressed.

      (3) Page 4, line 89. Specify that the architecture of the HUMAN CTF18-RFC module is not known, as the yeast one has been published. 

      At the time our study was initiated, the architecture of the human CTF18-RFC module was unknown. A structure of the human complex was published by another group during the final stages of our work and is now properly acknowledged in the Discussion.

      (4) Page 6. Is it possible to illustrate why the autoinhibited state cannot bind to DNA? A visual representation would be nice. 

      We thank the reviewer for this suggestion. Figure 4b in the original manuscript already illustrates why the autoinhibited, overtwisted conformation of the CTF18-RFC pentamer cannot accommodate DNA. In this state, the inner chamber of the loader is sterically occluded, precluding the binding of duplex DNA.

      Reviewer #3 (Recommendations for the authors): 

      We thank Reviewer #3 for their constructive feedback and positive overall assessment of our work.

      We also thank the reviewer for their remarks on the use of Mg depletion to halt hydrolysis. Magnesium is an essential cofactor for ATP hydrolysis, and its depletion is expected to effectively prevent catalysis by destabilizing the transition state, possibly more completely than the use of slowly hydrolysable analogues such as ATPγS. We have recently employed Mg<sup>²+</sup> depletion to successfully trap a pre-hydrolytic intermediate in a replicative AAA+ helicase engaged in DNA unwinding (Shahid et al., Nature, 2025). This precedent supports the rationale for our choice, and the reference has now been included in the revised manuscript.

      I think the authors deposited the FSC curve for the +Mg structure in the -Mg structure PDB/EMDB entry according to the validation report. 

      We thank the reviewer for their careful inspection of the deposition materials. The discrepancy in the deposited FSC curve has now been corrected, and the appropriate FSC curves have been assigned to the correct PDB/EMDB entries.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      This paper measures the positioning and diffusivity of RNaseE-mEos3.2 proteins in E. coli as a function of rifampicin treatment, compares RNaseE to other E. coli proteins, and measures the effect of changes in domain composition on this localization and motion. The straightforward study is thoroughly presented, including very good descriptions of the imaging parameters and the image analysis/modeling involved, which is good because the key impact of the work lies in presenting this clear methodology for determining the position and mobility of a series of proteins in living bacteria cells. 

      Thank you for the nice summary and positive feedback on the descriptions and methodology. 

      My key notes and concerns are listed below; the most important concerns are indicated with asterisks. 

      (1) The very start of the abstract mentions that the domain composition of RNase E varies among species, which leads the reader to believe that the modifications made to E. coli RNase E would be to swap in the domains from other species, but the experiment is actually to swap in domains from other E. coli proteins. The impact of this work would be increased by examining, for instance, RNase E domains from B. subtilis and C. crescentus as mentioned in the introduction. 

      Thank you for the suggestions. We agree that the sentence may convey an unintended expectation. Our original intention was to note the presence and absence of certain domains of RNase E (e.g. membrane-binding motif and CTD) vary across species, rather than the actual sequence variations. To avoid any misinterpretation, we decided to remove the sentence from the abstract. Using the domains of B. subtilis and C. crescentus RNase E in E. coli is a very interesting suggestion, but we will leave that for a future study. 

      (2) Furthermore, the introduction ends by suggesting that this work will modulate the localization, diffusion, and activity of RNase E for "various applications", but no applications are discussed in the discussion or conclusion. The impact of this work would be increased by actually indicating potential reasons why one would want to modulate the activity of RNase E. 

      Thank you for this suggestion. For example, an E. coli strain expressing membranebound RNase E without CTD can help stabilize mRNAs and enhance protein expression. In fact, this idea was used in a commercial BL21 cell line (Invitrogen’s One Shot BL21 Star), to increase the yield of protein expression. We also think that environmentally modulated MB% of RNase E can be useful for controlling the mRNA half-lives and protein expression levels in different conditions. We discussed these ideas at the end of the Discussion.

      (3) Lines 114 - 115: "The xNorm histogram of RNase E shows two peaks corresponding to each side edge of the membrane": "side edge" is not a helpful term. I suggest instead: "...corresponding to the membrane at each side of the cell" 

      Thank you. We made the suggested change.

      (4) A key concern of this reviewer is that, since membrane-bound proteins diffuse more slowly than cytoplasmic proteins, some significant undercounting of the % of cytoplasmic proteins is expected due to decreased detectability of the faster-moving proteins. This would not be a problem for the LacZ imaging where essentially all proteins are cytoplasmic, but would significantly affect the reported MB% for the intermediate protein constructs. How is this undercounting considered and taken into account? One could, for instance, compare LacZ vs. LacY (or RNase E) copy numbers detected in fixed cells to those detected in living cells to estimate it.  

      Thank you for raising this point and suggesting a possible way to address this. We compared the number of tracks for mEos3.2-fused proteins in live vs fixed cells and tested the undercounting effect of cytoplasmic molecules. We compared WT RNase E molecules in live and fixed cells and found that there are about 50% lower molecules detected in the fixed cells, which agrees with the expectation that fluorescent proteins lose their signal upon fixation. Similarly, cytoplasmic RNase E (RNase E ΔMTS) copy number was also ~50% less in the fixed cells compared to live cells. If cytoplasmic molecules were undercounted compared membrane-bound molecules in live cells, fixation would reduce the copy number less than 50%. The comparable ratio of 50% indicates that the undercounting issue is not significant. This control analysis is provided in Figure S1B-C, and we made corresponding textual change in the result section as below:

      For this analysis, we first confirmed that proteins localized on the membrane and in the cytoplasm are detected with equal probability, despite differences in their mobilities (Fig. S1B-C). 

      (5) The rifampicin treatment study is not presented well. Firstly, it is found that LacY diffuses more rapidly upon rifampicin treatment. This change is attributed to changes in crowding at the membrane due to mRNA. Several other things change in cells after adding rif, including ATP levels, and these factors should be considered. More importantly, since the change in the diffusivity of RNaseE is similar to the change in diffusivity of LacY, then it seems that most of the change in RNaseE diffusion is NOT due to RNaseE-mRNAribosome binding, but rather due to whatever crowding/viscosity effects are experienced by LacY (along these lines: the error reported for D is SEM, but really should be a confidence interval, as in Figure 1, to give the reader a better sense of how different (or similar) 1.47 and 1.25 are). 

      We agree with the reviewer that upon rifampicin treatment, RNase E’s D increases to a similar extent as that of LacY. Hence, the increase likely arises from a factor common to both proteins. We have added the reviewer’s suggested interpretation as a possible explanation in the manuscript as below. 

      The similar fold change in D<sub>RNE</sub> and D<sub>LacY</sub> upon rif treatment suggests that the change in RNE diffusion may largely be attributed to physical changes in the intracellular environment (such as reduced viscosity or macromolecular crowding[41,42]), rather than a loss of RNA-RNE interactions.

      As requested by the reviewer, we have provided confidence intervals for our D values in Table S8. Because these intervals are very narrow, we chose to present the SEM as the error metric for D and have also reported the corresponding errors for the fold-change values whenever we describe the fold differences between D values. 

      (6) Lines 185-189: it is surprising to me that the CTD mutants both have the same change in D (5.5x and 5.3x) relative to their full-length counterparts since D for the membranebound WT protein should be much less sensitive to protein size than D for the cytoplasmic MTS mutant. Can the authors comment? 

      Perhaps the reviewer understood that these differences are the ratios between +/-CTD (e.g. WT RNE vs ΔCTD). However, the differences we mentioned were from membrane-bound vs cytoplasmic versions of RNase E with comparable sizes (e.g. WT RNase E vs RNase E ΔMTS). We modified text and added a summary sentence at the end of the paragraph to clarify the point.

      We found that D<sub>ΔMTS</sub> is ~5.5 times that of D<sub>RNE</sub> (Fig. 3B). [...] Together, these results suggest that the membrane binding reduces RNE mobility by a factor of 5.

      That being said, we also realized a similar fold difference between +/-CTD. Specifically, WT RNE vs RNE ΔCTD (both membrane-bound) show a ~4.1-fold difference and RNE ΔMTS vs RNE ΔMTS ΔCTD (both cytoplasmic) show ~3.9-fold difference. We do not currently do not have a clear explanation for this pattern. Given that these two pairs have a similar change in mass, we speculate that the relationship between D and molecular mass may be comparable for membrane-bound and free-floating RNE variants. 

      (7) Lines 190-194. Again, the confidence intervals and experimental uncertainties should be considered before drawing biological conclusions. It would seem that there is "no significant change" in the rhlB and pnp mutants, and I would avoid saying "especially for ∆pnp" when the same conclusion is true for both (one shouldn't say 1.04 is "very minute" and 1.08 is just kind of small - they are pretty much the same within experiments like this). 

      Thank you for raising this point, which we fully agree with. That being said, we decided to remove results related to the degradosome proteins to improve the flow of the paper. We are preparing another paper related to the RNA degradosome complex formation. 

      (8) Lines 221-223 " This is remarkable because their molecular masses (and thus size) are expected to be larger than that of MTS" should be reconsidered: diffusion in a membrane does not follow the Einstein law (indeed lines 223-225 agree with me and disagree with lines 221-223). (Also the discussion paragraph starting at line 375). Rather, it is generally limited by the interactions with the transmembrane segments with the membrane. So Figure 3D does not contain the right data for a comparison, and what is surprising to me is that MTS doesn't diffuse considerably faster than LacY2. 

      We agree with the reviewer’s point that diffusion in a membrane does not follow the Stokes-Einstein law. That is why we introduced Saffman’s model. However, even in this model, proteins of larger size (or mass) should be slower than smaller size (a reason why we presented Figure 3D, now 4D). In other words, both Einstein and Saffman models predict that larger particles diffuse slower, although the exact scaling relationship differs between two models. Here, we assume that mass is related to the size. Contrary to Saffman’s model for membrane proteins, LacY2 diffuses faster than MTS despite of large size. Using MD simulations, we showed that this discrepancy can be explained by different interaction energies as the reviewer mentioned. This analysis further demonstrates that the size is not the only factor to consider protein diffusion in the membrane. We edited the paragraph to clarify the expectations and our interpretations.

      According to the Stokes-Einstein relation for diffusion in simple fluids[49] and the Saffman-Delbruck diffusion model for membrane proteins, D decreases as particle size increases, albeit with different scaling behaviors. […] Thus, if size (or mass) were the primary determinant of diffusion, LacY2 and LacY6 would diffuse more slowly than the smaller MTS. The observed discrepancy instead implies that D may be governed by how each motif interacts with the membrane. For example, the way that TM domains are anchored to the membrane may facilitate faster lateral diffusion with surrounding lipids. 

      (9) The logical connection between the membrane-association discussion (which seems to ignore associations with other proteins in the cell) and the preceding +/- rifampicin discussion (which seeks to attribute very small changes to mRNA association) is confusing.

      Thank you for raising this point. We re-arranged the second result section to present diffusion due to membrane binding first before rifampicin. Furthermore, we stated our hypothesis and expectations in the beginning of the results section. This addition will legitimate our logic flow.

      (10) Separately, the manuscript should be read through again for grammar and usage. For instance, the title should be: "Single-molecule imaging reveals the *roles* of *the* membrane-binding motif and *the* C-terminal domain of RNase E in its localization and diffusion in Escherichia coli". Also, some writing is unwieldy, for instance, "RNase E's D" would be easier to read if written as D_{RNaseE}. (underscore = subscript), and there is a lot of repetition in the sentence structures. 

      Thank you for catching grammar mistakes. We went through extensive proofreading to avoid these mistakes and also used simple notation suggested by the reviewer, such as D<sub>RNE</sub>, to make it easier to read. Thank you again for your suggestions.

      Reviewer #2 (Public review): 

      Summary: 

      Troyer and colleagues have studied the in vivo localisation and mobility of the E.coli RNaseE (a protein key for mRNA degradation in all bacteria) as well as the impact of two key protein segments (MTS and CTD) on RNase E cellular localisation and mobility. Such sequences are important to study since there is significant sequence diversity within bacteria, as well as a lack of clarity about their functional effects. Using single-molecule tracking in living bacteria, the authors confirmed that >90% of RNaseE localised on the membrane, and measured its diffusion coefficient. Via a series of mutants, they also showed that MTS leads to stronger membrane association and slower diffusion compared to a transmembrane motif (despite the latter being more embedded in the membrane), and that the CTD weakens membrane binding. The study also rationalised how the interplay of MTS and CTD modulate mRNA metabolism (and hence gene expression) in different cellular contexts. 

      Strengths: 

      The study uses powerful single-molecule tracking in living cells along with solid quantitative analysis, and provides direct measurements for the mobility and localisation of E.coli RNaseE, adding to information from complementary studies and other bacteria. The exploration of different membrane-binding motifs (both MTS and CTD) has novelty and provides insight on how sequence and membrane interactions can control function of protein-associated membranes and complexes. The methods and membrane-protein standards used contribute to the toolbox for molecular analysis in live bacteria. 

      Thank you for the nice summary of our work and positive comments about the paper’s strengths.

      Weaknesses: 

      The Results sections can be structured better to present the main hypotheses to be tested. For example, since it is well known that RNase E is membrane-localised (via its MTS), one expects its mobility to be mainly controlled by the interaction with the membrane (rather than with other molecules, such as polysomes and the degradosome). The results indeed support this expectation - however, the manuscript in its current form does not lay down the dominant hypothesis early on (see second Results chapter), and instead considers the rifampicin-addition results as "surprising"; it will be best to outline the most likely hypotheses, and then discuss the results in that light. 

      Thank you for this comment. We addressed this point by stating our main hypothesis from the beginning of the results section. We also agree with the reviewer that the membrane binding effect should be discussed first; hence, we re-arranged the result section. In the revised manuscript, we discuss the effect of membrane binding on diffusion first, followed by rif effects.

      Similarly, the authors should first discuss the different modes of interaction for a peripheral anchor vs a transmembrane anchor, outline the state of knowledge and possibilities, and then discuss their result; in its current version, the ms considers the LacY2 and LacY6 faster diffusion compared to MTS "remarkable", but considering the very different mode of interaction, there is no clear expectation prior to the experiment. In the same section, it would be good to see how the MD simulations capture the motion of LacY6 and LacY12, since this will provide a set of results consistent with the experimental set. 

      Thank you for pointing this out. In fact, there is little discussion in the literature about the different modes of interaction for a peripheral anchor vs a transmembrane anchor. To our knowledge, our work (experiments and MD simulations) is the first that directly compared the two to reveal that the peripheral anchor has higher interaction energy than the transmembrane anchor. We added a sentence “Despite the prevalence of peripheral membrane proteins, how they interact with the membrane and how this differs from TM proteins remain poorly understood”. Furthermore, we added the MD simulation result of LacY6 and LacY12 in Figure 4E-F.

      The work will benefit from further exploration of the membrane-RNase E interactions; e.g., the effect of membrane composition is explored by just using two different growth media (which on its own is not a well-controlled setting), and no attempts to change the MTS itself were made. The manuscript will benefit from considering experiments that explore the diversity of RNaseE interactions in different species; for example, the authors may want to consider the possibility of using the membrane-localisation signals of functional homologs of RNaseE in different bacteria (e.g., B. subtilis). It would be good to look at the effect of CTD deletions in a similar context (i.e., in addition to the MTS substitution by LacY2 and LacY6). 

      Thank you very much for this suggestion. During revision, we engineered point mutations in MTS and analyzed critical hydrophobic residues for membrane binding. We characterized MB% in both +/-CTD variants (Fig. 2 and Fig. S6) and their effect on lacZ mRNA degradation (Fig. 6). We will leave the use of membrane motif of B. subtilis RNase E for future study. 

      The manuscript will benefit from further discussion of the unstructured nature of the CTD, especially since the RNase CTD is well known to form condensates in Caulobacter crescentus; it is unclear how the authors excluded any roles for RNaseE phase separation in the mobility of RNaseE in E.coli cells. 

      Yes, we agree with the reviewer that the intrinsically disordered nature of the CTD might contribute to condensate formation. We explored this possibility using both epifluorescence microscopy (with a YFP fusion) and single-molecule imaging with cluster analysis (using an mEos3.2 fusion). Please see Figure S8. We did observe some weak de-clustering of RNase E upon CTD deletion. In the current study, we are unable to quantify the extent to which clustering contributes to the slow diffusion of RNase E. However, we speculate that the clustering may be linked to the low MB% of certain RNE mutants containing CTD, and we discussed this possibility in the Discussion.

      […] further supporting that the CTD decreases membrane association across RNE variants. We speculate that this effect may be related to the CTD’s role in promoting phase-separated ribonucleoprotein condensates, as observed in Caulobacter crescentus[19]. In E. coli, we also observed a modest increase in the clustering tendency of RNE compared to ΔCTD (Fig. S8). 

      Some statements in the Discussion require support with example calculations or toning down substantially. Specifically, it is not clear how the authors conclude that RNaseE interacts with its substrate for a short time (and what this time may actually be); further, the speculation about the MTS "not being an efficient membrane-binding motif for diffusion" lacks adequate support as it stands. 

      Thank you for these points. To elaborate our point on transient interaction between RNase E and RNA, we added a sentence “Specifically, if RNE interacts with mRNAs for ~20 ms or less, the slow-diffusing state would last shorter than the frame interval and remain undetected in our experiment.” Also, we added this sentence in the discussion.

      One possible explanation is that RNA-bound RNE (and RNase Y) is short-lived compared to our frame interval (~20 ms), unlike other RNA-binding proteins related to transcription and translation, interacting with RNA for ~1 min for elongation [48].

      Plus, we clarified the wording used in the second sentence that the reviewer pointed out as follows,

      Lastly, the slow diffusion of the MTS in comparison to LacY2 and LacY6 suggests that MTS is less favorable for rapid lateral motion in the membrane. 

      Reviewer #3 (Public review): 

      Summary: 

      The manuscript by Troyer et al quantitatively measured the membrane localization and diffusion of RNase E, an essential ribonuclease for mRNA turnover as well as tRNA and rRNA processing in bacteria cells. Using single-molecule tracking in live E. coli cells, the authors investigated the impact of membrane targeting sequence (MTS) and the Cterminal domain (CTD) on the membrane localization and diffusion of RNase E under various perturbations. Finally, the authors tried to correlate the membrane localization of RNase E to its function on co- and post-transcriptional mRNA decay using lacZ mRNA as a model. 

      The major findings of the manuscripts include: 

      (1) WT RNase E is mostly membrane localized via MTS, confirming previous results. The diffusion of RNase E is increased upon removal of MTS or CTD, and more significantly increased upon removal of both regions. 

      (2) By tagging RNase E MTS and different lengths of LacY transmembrane domain (LacY2, LacY6, or LacY12) to mEos3.2, the results demonstrate that short LacY transmembrane sequence (LacY2 and LacY6) can increase the diffusion of mEos3.2 on the membrane compared to MTS, further supported by the molecular dynamics simulation. A similar trend was roughly observed in RNase E mutants with MTS switched to LacY transmembrane domains. 

      (3) The removal of RNase E MTS significantly increases the co-transcriptional degradation of lacZ mRNA, but has minimal effect on the post-transcriptional degradation of lacZ mRNA. Removal of CTD of RNase E overall decreases the mRNA decay rates, suggesting the synergistic effect of CTD on RNase E activity. 

      Strengths: 

      (1) The manuscript is clearly written with very detailed method descriptions and analysis parameters. 

      (2) The conclusions are mostly supported by the data and analysis. 

      (3) Some of the main conclusions are interesting and important for understanding the cellular behavior and function of RNase E. 

      Thank you for your thorough summary of our work and positive comments.

      Weaknesses: 

      (1) Some of the observations show inconsistent or context-dependent trends that make it hard to generalize certain conclusions. Those points are worth discussion at least. Examples include: 

      (a) The authors conclude that MTS segment exhibits reduced MB% when succinate is used as a carbon source compared to glycerol, whereas LacY2 segment maintains 100% membrane localization, suggesting that MTS can lose membrane affinity in the former growth condition (Ln 341-342). However, the opposite case was observed for the WT RNase E and RNase E-LacY2-CTD, in which RNase E-LacY2-CTD showed reduced MB% in the succinate-containing M9 media compared to the WT RNase E (Ln 264-267). This opposite trend was not discussed. In the absence of CTD, would the media-dependent membrane localization be similar to the membrane localization sequence or to the fulllength RNase E? 

      This is a great point. Thank you for pointing out the discrepancy in data. We think the weak membrane interaction of RNaseE-lacY2-CTD likely stems from the structure instability in the presence of the CTD. Our data shows that an RNase E variant with a cytoplasmic population under a normal growth condition exhibits a greater cytoplasmic fraction in a poor growth media. In contrast, RNaseE-MTS and RNaseE-LacY2 lacking the CTD both showed 100% MB% under both normal and poor growth conditions. These results are presented in Figure S6 and further discussed in the Discussion section.

      The loss of MB% in LacY2-based RNE was observed only in the presence of the CTD (Fig. S6D), suggesting that the CTD negatively affects membrane binding of RNE, possibly by altering protein conformation. In fact, all ΔCTD RNE mutants we tested exhibited higher MB% than their CTD-containing counterparts (Fig. S6A-B). 

      (b) When using mEos3.2 reporter only, LacY2 and LacY6 both increase the diffusion of mEos3.2 compared to MTS. However, when inserting the LacY transmembrane sequence into RNase E or RNase E without CTD, only the LacY2 increases the diffusion of RNase E. This should also be discussed. 

      Thank you for raising this point. As the reviewer pointed out, as the membrane motifs, both LacY2 and LacY6 diffuse faster than the MTS, but when they are fused to RNE, only LacY2-based RNE diffuses faster than MTS-based RNE. We speculate that it is possibly due to a structural reason—having four (large) LacY6 in a tetrameric arrangement may cancel out the original fast-diffusing property of LacY6. We added this idea in the result section:

      This result may be due to the high TM load (24 helices) created by four LacY6 anchors in the RNE tetramer. Although all constructs are tetrameric, the 24-helix load (LacY6), compared with 8 (LacY2) and 4 (MTS), likely enlarges the membrane-embedded footprint and increases drag, thereby changing the mobility advantages assessed as standalone membrane anchors.

      (2) The authors interpret that in some cases the increase in the diffusion coefficient is related to the increase in the cytoplasm localization portion, such as for the LacY2 inserted RNase E with CTD, which is rational. However, the authors can directly measure the diffusion coefficient of the membrane and cytoplasm portion of RNase E by classifying the trajectories based on their localizations first, rather than just the ensemble calculation. 

      Thank you for this suggestion. Currently, because of the 2D projection effect from imaging, we cannot clearly distinguish which individual tracks are from the cytoplasm or from the inner membrane based on the localization. Therefore, we are unable to assign individual tracks as membrane-bound or cytoplasmic. However, we can demonstrate that the xNorm data can be separated into two different spatial populations based on the diffusion coefficient. D. That is we can plot xNorm of slow tracks vs xNorm of fast tracks. This analysis showed that the slow tracks have LacY-like xNorm profiles while the fast tracks have LacZ-like xNorm profiles, also quantitatively supporting our MB% fitting results. We have added this analysis to Figure S2.

      (3) The error bars of the diffusion coefficient and MB% are all SEM from bootstrapping, which are very small. I am wondering how much of the difference is simply due to a batch effect. Were the data mixed from multiple biological replicates? The number of biological replicates should also be reported. 

      Thank you for raising this point. In the original manuscript, we reported the number of tracks analyzed and noted that all data was from at least three separate biological replicates (measurements were repeated at least three different days). Furthermore, in the revised manuscript, we have provided the number of cells imaged in Table S6. 

      (4) Some figures lack p-values, such as Figures 4 and 5C-D. Also, adding p-values directly to the bar graphs will make it easier to read. 

      Thank you for checking these details. We added p values in the graphs showing k<sub>d1</sub> and k<sub>d2</sub> (Table S7).

      Reviewer #2 (Recommendations for the authors): 

      Minor and technical points: 

      (1) Clarity and flow will be improved if each section first highlights the objective for the experiments that are described (e.g., line 240). 

      Thank you for the suggestion. We addressed this point by editing the beginning of each subsection in the Results. 

      (2) Line 272 (and elsewhere)."1.33-times faster is wrong". The authors mean 33% faster (from 0.075 to 1, see Figure 4G), and not 133% faster. Needs fixing. 

      Thanks for pointing this out. We changed this as well as other incidences where we talk about the fold difference. For example, this particular incidence was changed to:

      Indeed, in the absence of the CTD, we found that the D of LacY2-based RNE was 1.33 ± 0.01 times as fast as the MTS-based RNE. 

      (3) The authors need to consider the fitting of two species on their D population. e.g., how will a 93% - 7% split between diffusive species would have looked for the distribution in S4B? Note also the L1 profile in Fig S4C - while it is not hugely different from Figure S4B, the analysis gives a 41% amplitude for the fast-diffusing species. The 2-species analysis can also be used on some of the samples with much higher cytoplasmic components. Further, tracks that are in the more central region can be analysed to see whether the fast-diffusing species increase in amplitude. 

      Thank you for this comment. The D histograms of L1 and RNase E show a dominant peak at around 0.015, but L1 has a residual population in the shoulder (note the difference between L1’s experimental data and D1 fit, a yellow line in now Figure S3B). This residual shoulder population is absent in the D histogram of RNase E. We also performed two-species analysis as suggested by the reviewer and provided the result in Figure S3C. The analysis shows that the two-population fit (black line) is very close to one one-population fit (yellow line). While we agree with the reviewer that subpopulation analysis is helpful for other proteins that show <90% MB% (>10% significant cytoplasmic population). we found it useful to divide xNorm histogram into two populations based on the diffusivity (rather than doing two-population fit to the D histogram, which does not have spatial information). This analysis, shown in Figure S2, supports our MB% fit results.

      (4) The authors suggest that the sequestration of RNaseE to the membrane limits its interaction with cytoplasmic mRNAs, and may increase mRNA lifetime. While this is true and supported by the authors' preprint (Ref15), it will also be good to consider (and discuss) that highly-transcribed regions are in the nucleoid periphery (and thus close to the membrane) and that ribosomes/polysomes are likewise predominantly peripheral (coregulation of transcription/translation) and membrane proximal. 

      This is an interesting point, which we appreciate very much. The lacZ gene, when induced, is shown to move to the nucleoid periphery (Yang et al. 2019, Nat Comm). Also, in our preprint (Ref 15), we engineered to have lacZ closer to the membrane, by translationally fusing it to lacY. However, the degradation rate of lacZ mRNA was not enhanced by the proximity to the membrane (for both k<Sub>d1</sub> and k<sub>d2</sub>). For lacZ mRNA, we mainly see the change in k<sub>d1</sub> when RNE localization changes. We think it is due to the slow diffusion of the nascent mRNA (attached to the chromosome) and the slow diffusion of membrane-bound RNE, such that regardless of the location of the nascent mRNA, the degradation by the membrane-bound RNE is inefficient. Only when RNE is free diffusing in the cytoplasm, it seems to increase k<sub>d1</sub> (the decay of nascent mRNAs).

      Reviewer #3 (Recommendations for the authors):

      (1) It will increase the clarity of the manuscript if the authors can provide better nomenclatures for different constructs, such as for different membrane targeting sequences fused to mEos3.2, full-length RNase E, or CDT truncated RNaseE. 

      Thank you for this suggestion. We agree that many constructions were discussed, and their naming can be confusing. To help with clarity, we have abbreviated RNase E as RNE throughout the text where appropriate. 

      (2) Line 342, Figure S7D should be cited instead of S6D. 

      Thank you for finding this error. We made a proper change in the revised manuscript.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      The authors describe the results of a single study designed to investigate the extent to which horizontal orientation energy plays a key role in supporting view-invariant face recognition. The authors collected behavioral data from adult observers who were asked to complete an old/new face matching task by learning broad-spectrum faces (not orientation filtered) during a familiarization phase and subsequently trying to label filtered faces as previously seen or novel at test. This data revealed a clear bias favoring the use of horizontal orientation energy across viewpoint changes in the target images. The authors then compared different ideal observer models (cross-correlations between target and probe stimuli) to examine how this profile might be reflected in the image-level appearance of their filtered images. This revealed that a model looking for the best matching face within a viewpoint differed substantially from human data, exhibiting a vertical orientation bias for extreme profiles. However, a model forced to match targets to probes at different viewing angles exhibited a consistent horizontal bias in much the same manner as human observers.

      Strengths:

      I think the question is an important one: The horizontal orientation bias is a great example of a low-level image property being linked to high-level recognition outcomes, and understanding the nature of that connection is important. I found the old/new task to be a straightforward task that was implemented ably and that has the benefit of being simple for participants to carry out and simple to analyze. I particularly appreciated that the authors chose to describe human data via a lower-dimensional model (their Gaussian fits to individual data) for further analysis. This was a nice way to express the nature of the tuning function, favoring horizontal orientation bias in a way that makes key parameters explicit. Broadly speaking, I also thought that the model comparison they include between the view-selective and view-tolerant models was a great next step. This analysis has the potential to reveal some good insights into how this bias emerges and ask finegrained questions about the parameters in their model fits to the behavioral data.

      We thank the reviewer for their positive appraisal of the importance of our research question as well as of the soundness of our approach to it.

      Weaknesses:

      I will start with what I think is the biggest difficulty I had with the paper. Much as I liked the model comparison analysis, I also don't quite know what to make of the view-tolerant model. As I understand the authors' description, the key feature of this model is that it does not get to compare the target and probe at the same yaw angle, but must instead pick a best match from candidates that are at different yaws. While it is interesting to see that this leads to a very different orientation profile, it also isn't obvious to me why such a comparison would be reflective of what the visual system is probably doing. I can see that the view-specific model is more or less assuming something like an exemplar representation of each face: You have the opportunity to compare a new image to a whole library of viewpoints, and presumably it isn't hard to start with some kind of first pass that identifies the best matching view first before trying to identify/match the individual in question. What I don't get about the view-tolerant model is that it seems almost like an anti-exemplar model: You specifically lack the best viewpoint in the library but have to make do with the other options. Again, this is sort of interesting and the very different behavior of the model is neat to discuss, but it doesn't seem easy to align with any theoretical perspective on face recognition. My thinking here is that it might be useful to consider an additional alternate model that doesn't specifically exclude the best-matching viewpoint, but perhaps condenses appearance across views into something like a prototype. I could even see an argument for something like the yaw-averages presented earlier in the manuscript as the basis for such a model, but this might be too much of a stretch. Overall, what I'd like to see is some kind of alternate model that incorporates the existence of the best-match viewpoint somehow, but without the explicit exemplar structure of the view-specific model.

      The view-tolerant model was designed so that identity needed to be abstracted away from variations in yaw to support face recognition. We believe this model aligns with the notion of tolerant recognition.

      The tolerance of identity recognition is presumably empowered by the internal representation of the natural statistics of identity, i.e. the stable traits and (idiosyncratic) variability of a face, which builds up through the varied encounters with a given face (Burton, Jenkins et al. 2005, Burton, Jenkins and Schweinberger 2011, Jenkins and Burton 2011, Jenkins, White et al. 2011, Burton, Kramer et al. 2016, Menon, Kemp and White 2018).

      The average of various images of a face provides its appearance distribution (i.e., variability) and central tendency (i.e., stable properties; Figure 1) and could be used as a reasonable proxy of its natural statistical properties (Burton, Jenkins et al. 2005). We thus believe that the alternate model proposed by the reviewer is relevant to existing theories of face identity recognition and agree that our current model observers do not fully capture this aspect. It is thus an excellent idea to examine the orientation tuning profile of a model observer that compares a specific view of a face to the average encompassing all views of a face identity. Since the horizontal range is proposed to carry the view-stable cues to identity, we expect that such a ‘viewpoint-average’ model observer will perform best with horizontally filtered faces and that its orientation tuning profile will significantly predict human performance across views. We expect the viewpointtolerant and viewpoint-average observers will behave similarly as they manifest the stability of the horizontal identity cues across variations in viewpoint.

      Besides this larger issue, I would also like to see some more details about the nature of the crosscorrelation that is the basis for this model comparison. I mostly think I get what is happening, but I think the authors could expand more on the nature of their noise model to make more explicit what is happening before these cross-correlations are taken. I infer that there is a noise-addition step to get them off the ceiling, but I felt that I had to read between the lines a bit to determine this.

      The view-selective model responded correctly whenever successfully matching a given face identity at a specific viewpoint to itself. Since there was an exact match in each trial, resulting in uninformative ceiling performance, we decreased the signal-to-noise ratio (SNR) of the target and probe images to .125 (face RMS contrast: .01; noise RMS contrast: .08). In every trial, target and probe faces were each combined with 10 different random noise patterns. SNR was adjusted so that the overall performance of the view-selective model was in the range of human performance. We will describe these important aspects in the methods and add a supplemental with the graphic illustration of the d’ distributions of each model and human observers.

      Another thing that I think is worth considering and commenting on is the stimuli themselves and the extent to which this may limit the outcomes of their behavioral task. The use of the 3D laserscanned faces has some obvious advantages, but also (I think) removes the possibility for pigmentation to contribute to recognition, removes the contribution of varying illumination and expression to appearance variability, and perhaps presents observers with more homogeneous faces than one typically has to worry about. I don't think these negate the current results, but I'd like the authors to expand on their discussion of these factors, particularly pigmentation. Naively, surface color and texture seem like they could offer diagnostic cues to identity that don't rely so critically on horizontal orientations, so removing these may mean that horizontal bias is particularly evident when face shape is the critical cue for recognition.

      We indeed got rid of surface color by converting images to gray scales. While we acknowledge that the conversion to grayscales may have removed one potential source of surface information, it is unlikely that our stimuli fully eliminated the contribution of surface pigmentation in our study. Pigmentation refers to all surface reflectance property (Russell, Sinha et al. 2006) and hue (color) is only one surface cue among others. The grayscaled 3D laser scanned faces used here still contained natural variations in crucial surface cues such as skin albedo (i.e., how light or dark the surface appears) and texture (i.e., spatial variation in how light is reflected). Both color and grayscale stimuli (2D face pictures or 3D laser scanned faces like ours) have actually been used to disentangle the role of shape and surface cues to identity recognition (e.g., Troje and Bulthoff 1996, Vuong, Peissig et al. 2005, Russell, Sinha et al. 2006, Russell, Biederman et al. 2007, Jiang, Dricot et al. 2009).

      More fundamentally, we demonstrated that the diagnosticity of the horizontal range of face information is not restricted to the transmission of shape cues. Our recent work has indeed shown that the processing of both face shape and surface most critically relies on horizontal information (Dumont, Roux-Sibilon and Goffaux 2024).

      Reviewer #2 (Public review):

      This study investigates the visual information that is used for the recognition of faces. This is an important question in vision research and is critical for social interactions more generally. The authors ask whether our ability to recognise faces, across different viewpoints, varies as a function of the orientation information available in the image. Consistent with previous findings from this group and others, they find that horizontally filtered faces were recognised better than vertically filtered faces. Next, they probe the mechanism underlying this pattern of data by designing two model observers. The first was optimised for faces at a specific viewpoint (viewselective). The second was generalised across viewpoints (view-tolerant). In contrast to the human data, the view-specific model shows that the information that is useful for identity judgements varies according to viewpoint. For example, frontal face identities are again optimally discriminated with horizontal orientation information, but profiles are optimally discriminated with more vertical orientation information. These findings show human face recognition is biased toward horizontal orientation information, even though this may be suboptimal for the recognition of profile views of the face.

      One issue in the design of this study was the lowering of the signal-to-noise ratio in the viewselective observer. This decision was taken to avoid ceiling effects. However, it is not clear how this affects the similarity with the human observers.

      The view-selective model responded correctly whenever successfully matching a given face identity at a specific viewpoint to itself. Since there was an exact match in each trial, resulting in uninformative ceiling performance, we decreased the signal-to-noise ratio (SNR) of the target and probe images to .125 (face RMS contrast: .01; noise RMS contrast: .08). In every trial, target and probe faces were each combined with 10 different random noise patterns. SNR was adjusted so that the overall performance of the view-selective model was in the range of human performance. We will describe these important aspects in the methods and add a supplemental with the graphic illustration of the d’ distributions of each model and human observers.

      Another issue is the decision to normalise image energy across orientations and viewpoints. I can see the logic in wanting to control for these effects, but this does reflect natural variation in image properties. So, again, I wonder what the results would look like without this step.

      Energy of natural images is disproportionately distributed across orientations (e.g., Hansen, Essock et al. 2003). Images of faces cropped from their background as used here contain most of their energy in the horizontal range (Keil 2009, Goffaux and Greenwood 2016, Goffaux 2019). If not normalized after orientation filtering, such uneven distribution of energy would boost recognition performance in the horizontal range across views. Normalization was performed across our experimental conditions merely to avoid energy from explaining the influence of viewpoint on the orientation tuning profile.

      We are not aware of any systematic natural variations of energy across face views. To address this, we measured face average energy (i.e., RMS contrast) in the original stimulus set, i.e., before the application of any image processing or manipulation. Background pixels were excluded from these image analyses. Across yaws, we found energy to range between .11 and .14 on a 0 to 1 grayscale. This is moderate compared to the range of energy variations we measured across identities (from .08 to .18). This suggests that variations in energy across viewpoints are moderate compared to variations related to identity. It is unclear whether these observations are specific to our stimulus set or whether they are generalizable to faces we encounter in everyday life. They, however, indicate that RMS contrast did not substantially vary across views in the present study and suggest that RMS normalization is unlikely to have affected the influence of viewpoint on recognition performance.

      Nonetheless, we acknowledge the importance of this issue regarding the trade-off between experimental control and stimulus naturalness, and we will refer to it explicitly in the methods section.

      Despite the bias toward horizontal orientations in human observers, there were some differences in the orientation preference at each viewpoint. For example, frontal faces were biased to horizontal (90 degrees), but other viewpoints had biases that were slightly off horizontal (e.g., right profile: 80 degrees, left profile: 100 degrees). This does seem to show that differences in statistical information at different viewpoints (more horizontal information for frontal and more vertical information for profile) do influence human perception. It would be good to reflect on this nuance in the data.

      Indeed, human performance data indicates that while identity recognition remains tuned to horizontal information, horizontal tuning shows some variation across viewpoints. We primarily focused on the first aspect because of its direct relevance to our research objective, but also discussed the second aspect: with yaw rotation, certain non-horizontal morphological features such as the jaw line or nose bridge, etc. may increasingly contribute to identity recognition, whereas at frontal or near frontal views, features are mostly horizontally-oriented (e.g., Keil 2008, Keil 2009). We will relate this part of the discussion more explicitly to the observation of the fluctuation of the peak location as a function of yaw.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer 1:

      The authors frequently refer to their predictions and theory as being causal, both in the manuscript and in their response to reviewers. However, causal inference requires careful experimental design, not just statistical prediction. For example, the claim that "algorithmic differences between those with BPD and matched healthy controls" are "causal" in my opinion is not warranted by the data, as the study does not employ experimental manipulations or interventions which might predictably affect parameter values. Even if model parameters can be seen as valid proxies to latent mechanisms, this does not automatically mean that such mechanisms cause the clinical distinction between BPD and CON, they could plausibly also refer to the effects of therapy or medication. I recommend that such causal language, also implicit to expressions like "parameter influences on explicit intentional attributions", is toned down throughout the manuscript.

      Thankyou for this chance to be clearer in the language. Our models and paradigm introduce a from of temporal causality, given that latent parameter distributions are directly influenced by latent parameter estimates at a previous point in time (self-uncertainty and other uncertainty directly governs social contagion). Nevertheless, we appreciate the reviewers perspective and have now toned down the language to reflect this.

      Abstract:

      ‘Our model makes clear predictions about the mechanisms of social information generalisation concerning both joint and individual reward.’

      Discussion:

      ‘We can simulate this by modelling a framework that incorporates priors based on both self and a strong memory impression of a notional other (Figure S3).’

      ‘We note a strength of this work is the use of model comparison to understand algorithmic differences between those with BPD and matched healthy controls.’

      Although the authors have now much clearer outlined the stuy's aims, there still is a lack of clarity with respect to the authors' specific hypotheses. I understand that their primary predictions about disruptions to self-other generalisation processes underlying BPD are embedded in the four main models that are tested, but it is still unclear what specific hypotheses the authors had about group differences with respect to the tested models. I recommend the authors specify this in the introduction rather than refering to prior work where the same hypotheses may have been mentioned.

      Thankyou for this further critique which has enabled us to more cleary refine our introduction. We have now edited our introduction to be more direct about our hypotheses, that these hypotheses are instantiated into formal models, and what our predictions were. We have also included a small section on how previous predictions from other computational assessments of BPD link to our exploratory work, and highlighted this throughout the manuscript.

      ‘This paper seeks to address this gap by testing explicitly how disruptions in self-other generalization processes may underpin interpersonal disruptions observed in BPD. Specifically, our hypotheses were: (i) healthy controls will demonstrate evidence for both self-insertion and social contagion, integrating self and other information during interpersonal learning; and (ii) individuals with BPD will exhibit diminished self-other integration, reflected in stronger evidence for observations that assume distinct self-other representations.

      We tested these hypotheses by designing a dynamic, sequential, three-phase Social Value Orientation (Murphy & Ackerman, 2014) paradigm—the Intentions Game—that would provide behavioural signatures assessing whether BPD differed from healthy controls in these generalization processes (Figure 1A). We coupled this paradigm with a lattice of models (M1-M4) that distinguish between self-insertion and social contagion (Figure 1B), and performed model comparison:

      M1. Both self-to-other (self-insertion) and other-to-self (social contagion) occur before and after learning M2. Self-to-other transfer only occurs M3. Other-to-self transfer only occurs M4. Neither transfer process, suggesting distinct self-other representations

      We additionally ran exploratory analysis of parameter differences and model predictions between groups following from prior work demonstrating changes in prosociality (Hula et al., 2018), social concern (Henco et al., 2020), belief stability (Story et al., 2024a), and belief updating (Story, 2024b) in BPD to understand whether discrepancies in self-other generalisation influences observational learning. By clearly articulating our hypotheses, we aim to clarify the theoretical contribution of our findings to existing literature on social learning, BPD, and computational psychiatry.’

      Caveats should also be added about the exploratory nature of the many parameter group comparisons. If there are any predictions about group differences that can be made based on prior literature, the authors should make such links clear.

      Thank you for this. We have now included caveats in the text to highlight the exploratory nature of these group comparisons, and added direct links to relevant literature where able:

      Introduction

      ‘We additionally ran exploratory analysis of parameter differences and model predictions between groups following from prior work demonstrating changes in prosociality (Hula et al., 2018), social concern (Henco et al., 2020), belief stability (Story et al., 2024a), and belief updating (Story, 2024b) in BPD to understand whether discrepancies in self-other generalisation influences observational learning. By clearly articulating our hypotheses, we aim to clarify the theoretical contribution of our findings to existing literature on social learning, BPD, and computational psychiatry.’

      Model Comparison

      ‘We found that CON participants were best fit at the group level by M1 (Frequency = 0.59, Exceedance Probability = 0.98), whereas BPD participants were best fit by M4 (Frequency = 0.54, Exceedance Probability = 0.86; Figure 2A). This suggests CON participants are best fit by a model that fully integrates self and other when learning, whereas those with BPD are best explained as holding disintegrated and separate representations of self and other that do not transfer information back and forth.

      We first explore parameters between separate fits (see Methods). Later, in order to assuage concerns about drawing inferences from different models, we examined the relationships between the relevant parameters when we forced all participants to be fit to each of the models (in a hierarchical manner, separated by group). In sum, our model comparison is supported by convergence in parameter values when comparisons are meaningful (see Supplementary Materials). We refer to both types of analysis below.’

      Phase 2 analysis

      ‘Prior work predicts those with BPD should focus more intently on public social information, rather than private information that only concerns one party (Henco et al., 2020). In BPD participants, only new beliefs about the relative reward preferences – mutual outcomes for both player - of partners differed (see Fig 2E): new median priors were larger than median preferences in phase 1 (mean = -0.47; = -6.10, 95%HDI: -7.60, -4.60).’

      ‘Models of moral preference learning (Story et al., 2024) predicts that BPD vs non-BPD participants have more rigid beliefs about their partners. We found that BPD participants were equally flexible around their prior beliefs about a partner’s relative reward preferences (= -1.60, 95%HDI: -3.42, 0.23), and were less flexible around their beliefs about a partner’s absolute reward preferences (=-4.09, 95%HDI: -5.37, -2.80), versus CON (Figure 2B).’

      Phase 3 analysis

      ‘Prior work predicts that human economic preferences are shaped by observation (Panizza, et al., 2021; Suzuki et al. 2016; Yu et al, 2021), although little-to-no work has examined whether contagion differs for relative vs. absolute preferences. Associative models predict that social contagion may be exaggerated in BPD (Ereira et al., 2018).… As a whole, humans are more susceptible to changing relative preferences more than selfish, absolute reward preferences, and this is disrupted in BPD.’

      Psychometric and Intentional Attribution analysis

      ‘Childhood trauma, persecution, and poor mentalising in BPD are all predicted to disrupt one’s ability to change (Fonagy & Luyten, 2009).’

      ‘Prior work has also predicted that partner-participant preference disparity influences mental state attributions (Barnby et al., 2022; Panizza et al., 2021).’

      I'm not sure I understand why the authors, after adding multiple comparison correction, now list two kinds of p-values. To me, this is misleading and precludes the point of multiple comparison corrections, I therefore recommend they report the FDR-adjusted p-values only. Likewise, if a corrected p-value is greater than 0.05 this should not be interpreted as a result.

      We have now adjusted the exploratory results to include only the FDR corrected values in the text.

      ‘We assessed conditional psychometric associations with social contagion under the assumption of M3 for all participants. We conducted partial correlation analyses to estimate relationships conditional on all other associations and retained all that survived bootstrapping (5000 reps), permutation testing (5000 reps), and subsequent FDR correction. When not controlled for group status, RGPTSB and CTQ scores were both moderately associated with MZQ scores (RGPTSB r = 0.41, 95%CI: 0.23, 0.60, p[fdr]=0.043; CTQ r = 0.354 95%CI: 0.13, 0.56, p[fdr]=0.02). This was not affected by group correction. CTQ scores were moderately and negatively associated with shifts in individualistic reward preferences (; r = -0.25, 95%CI: -0.46, -0.04, p[fdr]=0.03). This was not affected by group correction. MZQ scores were in turn moderately and negatively associated with shifts in prosocial-competitive preferences () between phase 1 and 3 (r = -0.26, 95%CI: -0.46, -0.06, p[fdr]=0.03). This was diminished when controlled for group status (r = 0.13, 95%CI: -0.34, 0.08, p[fdr]=0.20). Together this provides some evidence that self-reported trauma and self-reported mentalising influence social contagion (Fig S11). Social contagion under M3 was highly correlated with contagion under M1 demonstrating parsimony of outcomes across models (Fig S12).

      Prior work has predicted that partner-participant preference disparity influences mental state attributions (Barnby et al., 2022; Panizza et al., 2021). We tested parameter influences on explicit intentional attributions in Phase 2 while controlling for group status. Attributions included the degree to which they believed their partner was motived by harmful intent (HI) and self-interest (SI). According with prior work (Barnby et al., 2022), greater disparity of absolute preferences before learning was associated on a trend level with reduced attributions of SI (<= -0.23, p[fdr]=0.08), and greater disparity of relative preferences before learning exaggerated attributions of HI = 0.21, p[fdr]=0.08), but did not survive correction (Figure S4B). This is likely due to partners being significantly less individualistic and prosocial on average compared to participants (= -5.50, 95%HDI: -7.60, -3.60; = 12, 95%HDI: 9.70, 14.00); partners are recognised as less selfish and more competitive.’

      Can the authors please elaborate why the algorithm proposed to be employed by BPD is more 'entropic', especially given both their self-priors and posteriors about partners' preferences tended to be more precise than the ones used by CON? As far as I understand, there's nothing in the data to suggest BPD predictions should be more uncertain. In fact, this leads me to wonder, similarly to what another reviewer has already suggested, whether BPD participants generate self-referential priors over others in the same way CON participants do, they are just less favourable (i.e., in relation to oneself, but always less prosocial) - I think there is currently no model that would incorporate this possibility? It should at least be possible to explore this by checking if there is any statistical relationship between the estimated θ_ppt^m and 〖p(θ〗_par |D^0).

      Thank you for this opportunity to be clearer in our wording. We belief the reviewer is referring to this line in the discussion: ‘In either case, the algorithm underlying the computational goal for BPD participants is far higher in entropy and emphasises a less stable or reliable process of inference.’

      We note in the revised Figure 2 panel E and in the results that those with BPD under M4 show insertion along absolute reward (they still expect diminished selfishness in others), but neutral priors over relative reward (around 0, suggesting expectations of neither prosocial or competitive tendencies of others). Thus, θ_ppt^m (self preference) and θ_par^m (other preference) are tightly associated for absolute, but not relative reward.

      In our wording, we meant that whether under model M4 or M1, those with BPD either show a neutral prior over relative reward (M4) or a prior with large variance over relative reward (M1), showing expectations of difference between themselves and their partner. In both cases, expectation about a partner’s absolute reward preferences is diminished vs. CON participants. We have strengthened our language in the discussion to clarify this:

      ‘In either case, the algorithm underlying the computational goal for BPD participants is far higher in uncertainty, whether through a neutral central tendency (M4) or large variance (M1) prior over relative reward in phase 2, and emphasises a less certain and reliable expectation about others.’

      To note, social contagion under M3 was highly correlated with contagion under M1 (see Fig S11). This provides some preliminary evidence that trauma impacts beliefs about individualism directly, whereas trauma and persecutory beliefs impact beliefs about prosociality through impaired trait mentalising" - I don't understand what the authors mean by this, can they please elaborate and add some explanation to the main text?

      We have now clarified this in the text:

      ‘Together this provides some evidence that self-reported trauma and self-reported mentalising influence social contagion (Fig S11). Social contagion under M3 was highly correlated with contagion under M1 demonstrating parsimony of outcomes across models (Fig S12).’

      I noted that at least some of the newly added references have not been added to the bibliography (e.g., Hitchcock et al. 2022).

      Thankyou for noticing this omission. We have now ensured all cited works are in the reference list.

      Reviewer 2:

      The paper is not based on specific empirical hypotheses formulated at the outset, but, rather, it uses an exploratory approach. Indeed, the task is not chosen in order to tackle specific empirical hypotheses. This, in my view, is a limitation since the introduction reads a bit vague and it is not always clear which gaps in the literature the paper aims to fill. As a further consequence, it is not always clear how the findings speak to previous theories on the topic.’

      As I wrote in the public review, however, I believe that an important limitation of this work is that it was not based on testing specific empirical hypotheses formulated at the outset, and on selecting the experimental paradigm accordingly. This is a limitation because it is not always clear which gaps in the literature the paper aims to fill. As a consequence, although it has improved substantially compared to the previous version, the introduction remains a bit vague. As a further consequence, it is not always clear how the findings speak to previous theories on the topic. Still, despite this limitation, the paper has many strengths, and I believe it is now ready for publication

      Thank you for this further critique. We appreciate your appraisal that the work has improved substantially and is ready for publication. We nevertheless have opted to clarify our introduction and aprior predictions throughout the manuscript (please see response to Reviewer 1).

      Reviewer 3:

      Although the authors note that their approach makes "clear and transparent a priori predictions," the paper could be improved by providing a clear and consolidated statement of these predictions so that the results could be interpreted vis-a-vis any a priori hypotheses.

      In line with comments from both Reviewer 1 and 2, we have clarified our introduction to make it clear what our aprior predictions and hypotheses are about our core aims and exploratory analyses (see response to Reviewer 1).

      The approach of using a partial correlation network with bootstrapping (and permutation) was interesting, but the logic of the analysis was not clearly stated. In particular, there are large group (Table 1: CON vs. BPD) differences in the measures introduced into this network. As a result, it is hard to understand whether any partial correlations are driven primarily by mean differences in severity (correlations tend to be inflated in extreme groups designs due to the absence of observation in middle of scales forming each bivariate distribution). I would have found these exploratory analyses more revealing if group membership was controlled for.

      Thank you for this chance to be clearer in our methods. We have now written a more direct exposition of this exploratory method:

      ‘Exploratory Network Analysis

      To understand the individual differences of trait attributes (MZQ, RGPTSB, CTQ) with other-to-self information transfer () across the entire sample we performed a network analysis (Borsboom, 2021). Network analysis allows for conditional associations between variables to be estimated; each association is controlled for by all other associations in the network. It also allows for visual inspection of the conditional relationships to get an intuition for how variables are interrelated as a whole (see Fig S11). We implemented network analysis with the bootNet package in r using the ‘estimateNetwork’ function with partial correlations (Epskamp, Borsboom & Fried, 2018). To assess the stability of the partial correlations we further implemented bootstrap resampling with 5000 repetitions using the ‘bootnet’ function. We then additionally shuffled the data and refitted the network 5000 times to determine a p<sub>permuted</sub> value; this indicates the probability that a conditional relationship in the original network was within the null distribution of each conditional relationship. We then performed False Discovery Rate correction on the resulting p-values. We additionally controlled for group status for all variables in a supplementary analysis (Table S4).’

      We have also further corrected for group status and reported these results as a supplementary table, and also within the main text alongside the main results. We have opted to relegate Figure 4 into a supplementary figure to make the text clearer.

      ‘We explored conditional psychometric associations with social contagion under the assumption of M3 for all participants (where everyone is able to be influenced by their partner). We conducted partial correlation analyses to estimate relationships conditional on all other associations and retained all that survived bootstrapping (5000 reps), permutation testing (5000 reps), and subsequent FDR correction. When not controlled for group status, RGPTSB and CTQ scores were both moderately associated with MZQ scores (RGPTSB r = 0.41, 95%CI: 0.23, 0.60, p[fdr]=0.043; CTQ r = 0.354 95%CI: 0.13, 0.56, p[fdr]=0.02). This was not affected by group correction. CTQ scores were moderately and negatively associated with shifts in individualistic reward preferences (; r = -0.25, 95%CI: -0.46, -0.04, p[fdr]=0.03). This was not affected by group correction. MZQ scores were in turn moderately and negatively associated with shifts in prosocial-competitive preferences () between phase 1 and 3 (r = -0.26, 95%CI: -0.46, -0.06, p[fdr]=0.03). This was diminished when controlled for group status (r = 0.13, 95%CI: -0.34, 0.08, p[fdr]=0.20). Together this provides some evidence that self-reported trauma and self-reported mentalising influence social contagion (Fig S11). Social contagion under M3 was highly correlated with contagion under M1 demonstrating parsimony of outcomes across models (Fig S12).’

      Discussion first para: "effected -> affected"

      Thanks for spotting this. We have now changed it.

      Add "s" to "participant: "Notably, despite differing strategies, those with BPD achieved similar accuracy to CON participant."

      We have now changed this.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      Measurement of BOLD MR imaging has regularly found regions of the brain that show reliable suppression of BOLD responses during specific experimental testing conditions. These observations are to some degree unexplained, in comparison with more usual association between activation of the BOLD response and excitatory activation of the neurons (most tightly linked to synaptic activity) in the same brain location. This paper finds two patients whose brains were tested with both non-invasive functional MRI and with invasive insertion of electrodes, which allowed the direct recording of neuronal activity. The electrode insertions were made within the fusiform gyrus, which is known to process information about faces, in a clinical search for the sites of intractable epilepsy in each patient. The simple observation is that the electrode location in one patient showed activation of the BOLD response and activation of neuronal firing in response to face stimuli. This is the classical association. The other patient showed an informative and different pattern of responses. In this person, the electrode location showed a suppression of the BOLD response to face stimuli and, most interestingly, an associated suppression of neuronal activity at the electrode site.

      Strengths:

      Whilst these results are not by themselves definitive, they add an important piece of evidence to a long-standing discussion about the origins of the BOLD response. The observation of decreased neuronal activation associated with negative BOLD is interesting because, at various times, exactly the opposite association has been predicted. It has been previously argued that if synaptic mechanisms of neuronal inhibition are responsible for the suppression of neuronal firing, then it would be reasonable

      Weaknesses:

      The chief weakness of the paper is that the results may be unique in a slightly awkward way. The observation of positive BOLD and neuronal activation is made at one brain site in one patient, while the complementary observation of negative BOLD and neuronal suppression actually derives from the other patient. Showing both effects in both patients would make a much stronger paper.

      We thank reviewer #1 for their positive evaluation of our paper. Obviously, we agree with the reviewer that the paper would be much stronger if BOTH effects – spike increase and decrease – would be found in BOTH patients in their corresponding fMRI regions (lateral and medial fusiform gyrus) (also in the same hemisphere). Nevertheless, we clearly acknowledge this limitation in the (revised) version of the manuscript (p.8: Material and Methods section).

      Note that with respect to the fMRI data, our results are not surprising, as we indicate in the manuscript: BOLD increases to faces (relative to nonface objects) are typically found in the LatFG and BOLD decreases in the medialFG (in the revised version, we have added the reference to an early neuroimaging paper that describes this dissociation clearly:

      Pelphrey, K. A., Mack, P. B., Song, A., Güzeldere, G., & McCarthy, G. Faces evoke spatially differentiated patterns of BOLD activation and deactivation. Neuroreport 14, 955–959 (2003).

      This pattern of increase/decrease in fMRI can be appreciated in both patients on Figure 2, although one has to consider both the transverse and coronal slices to appreciate it.

      Regarding electrophysiological data, in the current paper, one could think that P1 shows only increases to faces, and P2 would show only decreases (irrespective of the region). However, that is not the case since 11% of P1’s face-selective units are decreases (89% are increases) and 4% of P2’s face-selective units are increases. This has now been made clearer in the revised manuscript (p.5).

      As the reviewer is certainly aware, the number and positions of the electrodes are based on strict clinical criteria, and we will probably never encounter a situation with two neighboring (macro-micro hybrid electrodes), one with microelectrodes ending up in the lateral MidFG, the other in the medial MidFG, in the same patient. If there is no clinical value for the patient, this cannot be done.

      The only thing we can do is to strengthen these results in the future by collecting data on additional patients with an electrode either in the lateral or the medial FG, together with fMRI. But these are the only two patients we have been able to record so far with electrodes falling unambiguously in such contrasted regions and with large (and comparable) measures.

      While we acknowledge that the results may be unique because of the use of 2 contrasted patients only (and this is why the paper is a short report), the data is compelling in these 2 cases, and we are confident that it will be replicated in larger cohorts in the future.

      Finally, information regarding ethics approval has been provided in the paper.

      Reviewer #2 (Public review):

      Summary:

      This is a short and straightforward paper describing BOLD fMRI and depth electrode measurements from two regions of the fusiform gyrus that show either higher or lower BOLD responses to faces vs. objects (which I will call face-positive and facenegative regions). In these regions, which were studied separately in two patients undergoing epilepsy surgery, spiking activity increased for faces relative to objects in the face-positive region and decreased for faces relative to objects in the face-negative region. Interestingly, about 30% of neurons in the face-negative region did not respond to objects and decreased their responses below baseline in response to faces (absolute suppression).

      Strengths:

      These patient data are valuable, with many recording sessions and neurons from human face-selective regions, and the methods used for comparing face and object responses in both fMRI and electrode recordings were robust and well-established. The finding of absolute suppression could clarify the nature of face selectivity in human fusiform gyrus since previous fMRI studies of the face-negative region could not distinguish whether face < object responses came from absolute suppression, or just relatively lower but still positive responses to faces vs. objects.

      Weaknesses:

      The authors claim that the results tell us about both 1) face-selectivity in the fusiform gyrus, and 2) the physiological basis of the BOLD signal. However, I would like to see more of the data that supports the first claim, and I am not sure the second claim is supported.

      (1) The authors report that ~30% of neurons showed absolute suppression, but those data are not shown separately from the neurons that only show relative reductions. It is difficult to evaluate the absolute suppression claim from the short assertion in the text alone (lines 105-106), although this is a critical claim in the paper.

      We thank reviewer #2 for their positive evaluation of our paper. We understand the reviewer’s point, and we partly agree. Where we respectfully disagree is that the finding of absolute suppression is critical for the claim of the paper: finding an identical contrast between the two regions in terms of RELATIVE increase/decrease of face-selective activity in fMRI and spiking activity is already novel and informative. Where we agree with the reviewer is that the absolute suppression could be more documented: it wasn’t, due to space constraints (brief report). We provide below an example of a neuron showing absolute suppression to faces (P2), as also requested in the recommendations to authors. In the frequency domain, there is only a face-selective response (1.2 Hz and harmonics) but no significant response at 6 Hz (common general visual response). In the time-domain, relative to face onset, the response drops below baseline level. It means that this neuron has baseline (non-periodic) spontaneous spiking activity that is actively suppressed when a face appears.

      Author response image 1.

      (2) I am not sure how much light the results shed on the physiological basis of the BOLD signal. The authors write that the results reveal "that BOLD decreases can be due to relative, but also absolute, spike suppression in the human brain" (line 120). But I think to make this claim, you would need a region that exclusively had neurons showing absolute suppression, not a region with a mix of neurons, some showing absolute suppression and some showing relative suppression, as here. The responses of both groups of neurons contribute to the measured BOLD signal, so it seems impossible to tell from these data how absolute suppression per se drives the BOLD response.

      It is a fact that we find both kinds of responses in the same region. We cannot tell with this technique if neurons showing relative vs. absolute suppression of responses are spatially segregated for instance (e.g., forming two separate sub-regions) or are intermingled. And we cannot tell from our data how absolute suppression per se drives the BOLD response. In our view, this does not diminish the interest and originality of the study, but the statement "that BOLD decreases can be due to relative, but also absolute, spike suppression in the human brain” has been rephrased in the revised manuscript: "that BOLD decreases can be due to relative, or absolute (or a combination of both), spike suppression in the human brain”.

      Reviewer #3 (Public review):

      In this paper the authors conduct two experiments an fMRI experiment and intracranial recordings of neurons in two patients P1 and P2. In both experiments, they employ a SSVEP paradigm in which they show images at a fast rate (e.g. 6Hz) and then they show face images at a slower rate (e.g. 1.2Hz), where the rest of the images are a variety of object images. In the first patient, they record from neurons over a region in the mid fusiform gyrus that is face-selective and in the second patient, they record neurons from a region more medially that is not face selective (it responds more strongly to objects than faces). Results find similar selectivity between the electrophysiology data and the fMRI data in that the location which shows higher fMRI to faces also finds face-selective neurons and the location which finds preference to non faces also shows non face preferring neurons.

      Strengths:

      The data is important in that it shows that there is a relationship between category selectivity measured from electrophysiology data and category-selective from fMRI. The data is unique as it contains a lot of single and multiunit recordings (245 units) from the human fusiform gyrus - which the authors point out - is a humanoid specific gyrus.

      Weaknesses:

      My major concerns are two-fold:

      (i) There is a paucity of data; Thus, more information (results and methods) is warranted; and in particular there is no comparison between the fMRI data and the SEEG data.

      We thank reviewer #3 for their positive evaluation of our paper. If the reviewer means paucity of data presentation, we agree and we provide more presentation below, although the methods and results information appear as complete to us. The comparison between fMRI and SEEG is there, but can only be indirect (i.e., collected at different times and not related on a trial-by-trial basis for instance). In addition, our manuscript aims at providing a short empirical contribution to further our understanding of the relationship between neural responses and BOLD signal, not to provide a model of neurovascular coupling.

      (ii) One main claim of the paper is that there is evidence for suppressed responses to faces in the non-face selective region. That is, the reduction in activation to faces in the non-face selective region is interpreted as a suppression in the neural response and consequently the reduction in fMRI signal is interpreted as suppression. However, the SSVEP paradigm has no baseline (it alternates between faces and objects) and therefore it cannot distinguish between lower firing rate to faces vs suppression of response to faces.

      We understand the concern of the reviewer, but we respectfully disagree that our paradigm cannot distinguish between lower firing rate to faces vs. suppression of response to faces. Indeed, since the stimuli are presented periodically (6 Hz), we can objectively distinguish stimulus-related activity from spontaneous neuronal firing. The baseline corresponds to spikes that are non-periodic, i.e., unrelated to the (common face and object) stimulation. For a subset of neurons, even this non-periodic baseline activity is suppressed, above and beyond the suppression of the 6 Hz response illustrated on Figure 2. We mention it in the manuscript, but we agree that we do not present illustrations of such decrease in the time-domain for SU, which we did not consider as being necessary initially (please see below for such presentation).

      (1) Additional data: the paper has 2 figures: figure 1 which shows the experimental design and figure 2 which presents data, the latter shows one example neuron raster plot from each patient and group average neural data from each patient. In this reader's opinion this is insufficient data to support the conclusions of the paper. The paper will be more impactful if the researchers would report the data more comprehensively.

      We answer to more specific requests for additional evidence below, but the reviewer should be aware that this is a short report, which reaches the word limit. In our view, the group average neural data should be sufficient to support the conclusions, and the example neurons are there for illustration. And while we cannot provide the raster plots for a large number of neurons, the anonymized data is made available at:

      (a) There is no direct comparison between the fMRI data and the SEEG data, except for a comparison of the location of the electrodes relative to the statistical parametric map generated from a contrast (Fig 2a,d). It will be helpful to build a model linking between the neural responses to the voxel response in the same location - i.e., estimate from the electrophysiology data the fMRI data (e.g., Logothetis & Wandell, 2004).

      As mentioned above the comparison between fMRI and SEEG is indirect (i.e., collected at different times and not related on a trial-by-trial basis for instance) and would not allow to make such a model.

      (b) More comprehensive analyses of the SSVEP neural data: It will be helpful to show the results of the frequency analyses of the SSVEP data for all neurons to show that there are significant visual responses and significant face responses. It will be also useful to compare and quantify the magnitude of the face responses compared to the visual responses.

      The data has been analyzed comprehensively, but we would not be able to show all neurons with such significant visual responses and face-selective responses.

      (c) The neuron shown in E shows cyclical responses tied to the onset of the stimuli, is this the visual response?

      Correct, it’s the visual response at 6 Hz.

      If so, why is there an increase in the firing rate of the neuron before the face stimulus is shown in time 0?

      Because the stimulation is continuous. What is displayed at 0 is the onset of the face stimulus, with each face stimulus being preceded by 4 images of nonface objects.

      The neuron's data seems different than the average response across neurons; This raises a concern about interpreting the average response across neurons in panel F which seems different than the single neuron responses

      The reviewer is correct, and we apologize for the confusion. This is because the average data on panel F has been notch-filtered for the 6 Hz (and harmonic responses), as indicated in the methods (p.11): ‘a FFT notch filter (filter width = 0.05 Hz) was then applied on the 70 s single or multi-units time-series to remove the general visual response at 6 Hz and two additional harmonics (i.e., 12 and 18 Hz)’.

      Here is the same data without the notch-filter (the 6Hz periodic response is clearly visible):

      Author response image 2.

      For sake of clarity, we prefer presenting the notch-filtered data in the paper, but the revised version makes it clear in the figure caption that the average data has been notch-filtered.

      (d) Related to (c) it would be useful to show raster plots of all neurons and quantify if the neural responses within a region are homogeneous or heterogeneous. This would add data relating the single neuron response to the population responses measured from fMRI. See also Nir 2009.

      We agree with the reviewer that this is interesting, but again we do not think that it is necessary for the point made in the present paper. Responses in these regions appear rather heterogenous, and we are currently working on a longer paper with additional SEEG data (other patients tested for shorter sessions) to define and quantify the face-selective neurons in the MidFusiform gyrus with this approach (without relating it to the fMRI contrast as reported here).

      (e) When reporting group average data (e.g., Fig 2C,F) it is necessary to show standard deviation of the response across neurons.

      We agree with the reviewer and have modified Figure 2 accordingly in the revised manuscript.

      (f) Is it possible to estimate the latency of the neural responses to face and object images from the phase data? If so, this will add important information on the timing of neural responses in the human fusiform gyrus to face and object images.

      The fast periodic paradigm to measure neural face-selectivity has been used in tens of studies since its original reports:

      In this paradigm, the face-selective response spreads to several harmonics (1.2 Hz, 2.4 Hz, 3.6 Hz, etc.) (which are summed for quantifying the total face-selective amplitude). This is illustrated below by the averaged single units’ SNR spectra across all recording sessions for both participants.

      Author response image 3.

      There is no unique phase-value, each harmonic being associated with a phase-value, so that the timing cannot be unambiguously extracted from phase values. Instead, the onset latency is computed directly from the time-domain responses, which is more straightforward and reliable than using the phase. Note that the present paper is not about the specific time-courses of the different types of neurons, which would require a more comprehensive report, but which is not necessary to support the point made in the present paper about the SEEG-fMRI sign relationship.

      (g) Related to (e) In total the authors recorded data from 245 units (some single units and some multiunits) and they found that both in the face and nonface selective most of the recoded neurons exhibited face -selectivity, which this reader found confusing: They write “ Among all visually responsive neurons, we found a very high proportion of face-selective neurons (p < 0.05) in both activated and deactivated MidFG regions (P1: 98.1%; N = 51/52; P2: 86.6%; N = 110/127)’. Is the face selectivity in P1 an increase in response to faces and P2 a reduction in response to faces or in both it’s an increase in response to faces

      Face-selectivity is defined as a DIFFERENTIAL response to faces compared to objects, not necessarily a larger response to faces. So yes, face-selectivity in P1 is an increase in response to faces and P2 a reduction in response to faces.

      Additional methods

      (a) it is unclear if the SSVEP analyses of neural responses were done on the spikes or the raw electrical signal. If the former, how is the SSVEP frequency analysis done on discrete data like action potentials?

      The FFT is applied directly on spike trains using Matlab’s discrete Fourier Transform function. This function is suitable to be applied to spike trains in the same way as to any sampled digital signal (here, the microwires signal was sampled at 30 kHz, see Methods).

      In complementary analyses, we also attempted to apply the FFT on spike trains that had been temporally smoothed by convolving them with a 20ms square window (Le Cam et al., 2023, cited in the paper ). This did not change the outcome of the frequency analyses in the frequency range we are interested in. We have also added one sentence with information in the methods section about spike detection (p.10).

      (b) it is unclear why the onset time was shifted by 33ms; one can measure the phase of the response relative to the cycle onset and use that to estimate the delay between the onset of a stimulus and the onset of the response. Adding phase information will be useful.

      The onset time was shifted by 33ms because the stimuli are presented with a sinewave contrast modulation (i.e., at 0ms, the stimulus has 0% contrast). 100% contrast is reached at half a stimulation cycle, which is 83.33ms here, but a response is likely triggered before reaching 100% contrast. To estimate the delay between the start of the sinewave (0% contrast) and the triggering of a neural response, we tested 7 SEEG participants with the same images presented in FPVS sequences either as a sinewave contrast (black line) modulation or as a squarewave (i.e. abrupt) contrast modulation (red line). The 33ms value is based on these LFP data obtained in response to such sinewave stimulation and squarewave stimulation of the same paradigm. This delay corresponds to 4 screen refresh frames (120 Hz refresh rate = 8.33ms by frame) and 35% of the full contrast, as illustrated below (please see also Retter, T. L., & Rossion, B. (2016). Uncovering the neural magnitude and spatio-temporal dynamics of natural image categorization in a fast visual stream. Neuropsychologia, 91, 9–28).

      Author response image 4.

      (2) Interpretation of suppression:

      The SSVEP paradigm alternates between 2 conditions: faces and objects and has no baseline; In other words, responses to faces are measured relative to the baseline response to objects so that any region that contains neurons that have a lower firing rate to faces than objects is bound to show a lower response in the SSVEP signal. Therefore, because the experiment does not have a true baseline (e.g. blank screen, with no visual stimulation) this experimental design cannot distinguish between lower firing rate to faces vs suppression of response to faces.

      The strongest evidence put forward for suppression is the response of non-visual neurons that was also reduced when patients looked at faces, but since these are non-visual neurons, it is unclear how to interpret the responses to faces.

      We understand this point, but how does the reviewer know that these are non-visual neurons? Because these neurons are located in the visual cortex, they are likely to be visual neurons that are not responsive to non-face objects. In any case, as the reviewer writes, we think it’s strong evidence for suppression.

      We thank all three reviewers for their positive evaluation of our paper and their constructive comments.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      Zhang et al. addressed the question of whether advantageous and disadvantageous inequality aversion can be vicariously learned and generalized. Using an adapted version of the ultimatum game (UG), in three phases, participants first gave their own preference (baseline phase), then interacted with a "teacher" to learn their preference (learning phase), and finally were tested again on their own (transfer phase). The key measure is whether participants exhibited similar choice preferences (i.e., rejection rate and fairness rating) influenced by the learning phase, by contrasting their transfer phase and baseline phase. Through a series of statistical modeling and computational modeling, the authors reported that both advantageous and disadvantageous inequality aversion can indeed be learned (Study 1), and even be generalised (Study 2).

      Strengths:

      This study is very interesting, it directly adapted the lab's previous work on the observational learning effect on disadvantageous inequality aversion, to test both advantageous and disadvantageous inequality aversion in the current study. Social transmission of action, emotion, and attitude have started to be looked at recently, hence this research is timely. The use of computational modeling is mostly appropriate and motivated. Study 2, which examined the vicarious inequality aversion in conditions where feedback was never provided, is interesting and important to strengthen the reported effects. Both studies have proper justifications to determine the sample size.

      Weaknesses:

      Despite the strengths, a few conceptual aspects and analytical decisions have to be explained, justified, or clarified.

      INTRODUCTION/CONCEPTUALIZATION

      (1) Two terms seem to be interchangeable, which should not, in this work: vicarious/observational learning vs preference learning. For vicarious learning, individuals observe others' actions (and optionally also the corresponding consequence resulting directly from their own actions), whereas, for preference learning, individuals predict, or act on behalf of, the others' actions, and then receive feedback if that prediction is correct or not. For the current work, it seems that the experiment is more about preference learning and prediction, and less so about vicarious learning. The intro and set are heavily around vicarious learning, and later the use of vicarious learning and preference learning is rather mixed in the text. I think either tone down the focus on vicarious learning, or discuss how they are different. Some of the references here may be helpful: (Charpentier et al., Neuron, 2020; Olsson et al., Nature Reviews Neuroscience, 2020; Zhang & Glascher, Science Advances, 2020)

      We are appreciative of the Reviewer for raising this question and providing the reference. In response to this comment we have elected to avoid, in most cases, use of the term ‘vicarious’ and instead focus the paper on learning of others’ preferences (without specific commitment to various/observational learning per se). These changes are reflected throughout all sections of the revised manuscript, and in the revised title. We believe this simplified terminology has improved the clarity of our contribution.

      EXPERIMENTAL DESIGN

      (2) For each offer type, the experiment "added a uniformly distributed noise in the range of (-10 ,10)". I wonder what this looks like? With only integers such as 25:75, or even with decimal points? More importantly, is it possible to have either 70:30 or 90:10 option, after adding the noise, to have generated an 80:20 split shown to the participants? If so, for the analyses later, when participants saw the 80:20 split, which condition did this trial belong to? 70:30 or 90:10? And is such noise added only to the learning phase, or also to the baseline/transfer phases? This requires some clarification.

      We thank the Reviewer for pointing this out. The uniformly distributed noise was added to all three phases to make the proposers’ behavior more realistic. This added noise was rounded to integer numbers, constrained from -9 to 9, which means in both 70:30 and 90:10 offer types, an 80:20 split could not occur. We have made this feature of our design clear in the Method section Line 524 ~ 528:

      “In all task phases, we added uniformly distributed noise to each trial’s offer (ranging from -9 to 9, inclusive, rounding to the nearest integer) such that the random amount added (or subtracted) from the Proposer’s share was subtracted (or added) to the Receiver’s share. We adopted this manipulation to make the proposers’ behavior appear more realistic. The orders of offers participants experienced were fully randomized within each experiment phase. ”

      (3) For the offer conditions (90:10, 70:30, 50:50, 30:70, 10:90) - are they randomized? If so, how is it done? Is it randomized within each participant, and/or also across participants (such that each participant experienced different trial sequences)? This is important, as the order especially for the learning phase can largely impact the preference learning of the participants.

      We agree with the Reviewer the order in which offers are experienced could be very important. The order of the conditions was randomized independently for each participant (i.e. each participant experienced different trial sequences). We made this point clear in the Methods part. Line 527 ~ 528:

      “The orders of offers participants experienced were fully randomized within each experiment phase.”

      STATISTICAL ANALYSIS & COMPUTATIONAL MODELING

      (4) In Study 1 DI offer types (90:10, 70:30), the rejection rate for DI-AI averse looks consistently higher than that for DI averse (ie, the blue line is above the yellow line). Is this significant? If so, how come? Since this is a between-subject design, I would not anticipate such a result (especially for the baseline). Also, for the LME results (eg, Table S3), only interactions were reported but not the main results.

      We thank the Reviewer for pointing out this feature of the results. Prompted by this comment, we compared the baseline rejection rates between two conditions for these two offer types, finding in Experiment 1 that rejection rates in the DI-AI-averse condition were significantly higher than in the DI-averse condition (DI-AI-averse vs. DI-averse; Offer 90:10, β = 0.13, p < 0.001, Offer 70:30, β = 0.09, p < 0.034). We agree with the Reviewer that there should, in principle, be no difference between the experiences of participants in these two conditions is identical in the Baseline phase. However, we did not observe these difference in baseline preferences in Experiment 2 (DI-AI-averse vs. DI-averse; Offer 90:10, β = 0.07, p < 0.100, Offer 70:30, β = 0.05, p < 0.193). On the basis of the inconsistency of this effect across studies we believe this is a spurious difference in preferences stemming from chance.

      Regarding the LME results, the reason why only interaction terms are reported is due to the specification of the model and the rationale for testing.

      Taking the model reported in Table S3 as an example—a logistic model which examines Baseline phase rejection rates as a function of offer level and condition—the between-subject conditions (DI-averse and DI-AI-averse) are represented by dummy-coded variables. Similarly, offer types were also dummy-coded, such that each of the five columns (90:10, 70:30, 50:50, 30:70, and 10:90) correspond corresponded to a particular offer type. This model specification yields ten interaction terms (i.e., fixed effects) of interest—for example, the “DI-averse × Offer 90:10” indicates baseline rejection rates for 90:10 offers in DI-averse condition. Thus, to compare rejection rates across specific offer types, we estimate and report linear contrasts between these resultant terms. We have clarified the nature of these reported tests in our revised Results—for example, line189-190: “linear contrasts; e.g. 90:10 vs 10:90, all Ps<0.001, see Table S3 for logistic regression coefficients for rejection rates).

      Also in response to this comment that and a recommendation from Reviewer 2 (see below), we have revised our supplementary materials to make each model specification clearer as SI line 25:

      RejectionRate ~ 0 + (Disl + Advl):(Offer10 + Offer30 + Offer50 + Offer70 + Offer90) + (1|Subject)”

      (5) I do not particularly find this analysis appealing: "we examined whether participants' changes in rejection rates between Transfer and Baseline, could be explained by the degree to which they vicariously learned, defined as the change in punishment rates between the first and last 5 trials of the Learning phase." Naturally, the participants' behavior in the first 5 trials in the learning phase will be similar to those in the baseline; and their behavior in the last 5 trials in the learning phase would echo those at the transfer phase. I think it would be stronger to link the preference learning results to the change between the baseline and transfer phase, eg, by looking at the difference between alpha (beta) at the end of the learning phase and the initial alpha (beta).

      Thanks for pointing this out. Also, considering the comments from Reviewer 2 concerning the interpretation of this analysis, we have elected to remove this result from our revision.

      (6) I wonder if data from the baseline and transfer phases can also be modeled, using a simple Fehr-Schimdt model. This way, the change in alpha/beta can also be examined between the baseline and transfer phase.

      We agree with the Reviewer that a simplified F-S model could be used, in principle, to characterize Baseline and Transfer phase behavior, but it is our view that the rejection rates provide readers with the clearest (and simplest) picture of how participants are responding to inequity. Put another way, we believe that the added complexity of using (and explaining) a new model to characterize simple, steady-state choice behavior (within these phases) would not be justified or add appreciable insights about participants’ behavior.

      (7) I quite liked Study 2 which tests the generalization effect, and I expected to see an adapted computational modeling to directly reflect this idea. Indeed, the authors wrote, "[...] given that this model [...] assumes the sort of generalization of preferences between offer types [...]". But where exactly did the preference learning model assume the generalization? In the methods, the modeling seems to be only about Study 1; did the authors advise their model to accommodate Study 2? The authors also ran simulation for the learning phase in Study 2 (Figure 6), and how did the preference update (if at all) for offers (90:10 and 10:90) where feedback was not given? Extending/Unpacking the computational modeling results for Study 2 will be very helpful for the paper.

      We are appreciative of the Reviewer’s positive impression of Experiment 2. Upon reflection, we realize that our original submission was not clear about the modeling done in Experiment 2, and we should clarify here that we did also fit the Preference Inference model to this dataset. As in Experiment 1, this model assumes that the participants have a representation of the teacher’s preference as a Fehr-Schmidt form utility function and infer the Teacher’s Envy and Guilt parameters through learning. The model indicates that, on the basis of experience with the Teacher’s preferences on moderately unfair offers (i.e., offer 70:30 and offer 30:70), participants can successfully infer these guess of these two parameters, and in turn, compute Fehr-Schmidt utility to guide their decisions in the extreme unfair offers (i.e., offer 90:10 and offer 10:90).

      In response to this comment, we have made this clearer in our Results (Line 377-382):

      “Finally, following Experiment 1, we fit a series of computational models of Learning phase choice behavior, comparing the goodness-of-fit of the four best-fitting models from Experiment 1 (see Methods). As before, we found that the Preference Inference model provided the best fit of participants’ Learning Phase behavior (Figure S1a, Table S12). Given that this model is able to infer the Teacher’s underlying inequity-averse preferences (rather than learns offer-specific rejection preferences), it is unsurprising that this model best describes the generalization behavior observed in Experiment 2.”

      and in our revised Methods (Line 551-553)

      “We considered 6 computational models of Learning Phase choice behavior, which we fit to individual participants’ observed sequences of choices, in both Experiments 1 and 2, via Maximum Likelihood Estimation”

      Reviewer #2 (Public review):

      Summary:

      This study investigates whether individuals can learn to adopt egalitarian norms that incur a personal monetary cost, such as rejecting offers that benefit them more than the giver (advantageous inequitable offers). While these behaviors are uncommon, two experiments demonstrate that individuals can learn to reject such offers through vicarious learning - by observing and acting in line with a "teacher" who follows these norms. The authors use computational modelling to argue that learners adopt these norms through a sophisticated process, inferring the latent structure of the teacher's preferences, akin to theory of mind.

      Strengths:

      This paper is well-written and tackles a critical topic relevant to social norms, morality, and justice. The findings, which show that individuals can adopt just and fair norms even at a personal cost, are promising. The study is well-situated in the literature, with clever experimental design and a computational approach that may offer insights into latent cognitive processes. Findings have potential implications for policymakers.

      Weaknesses:

      Note: in the text below, the "teacher" will refer to the agent from which a participant presumably receives feedback during the learning phase.

      (1) Focus on Disadvantageous Inequity (DI): A significant portion of the paper focuses on responses to Disadvantageous Inequitable (DI) offers, which is confusing given the study's primary aim is to examine learning in response to Advantageous Inequitable (AI) offers. The inclusion of DI offers is not well-justified and distracts from the main focus. Furthermore, the experimental design seems, in principle, inadequate to test for the learning effects of DI offers. Because both teaching regimes considered were identical for DI offers the paradigm lacks a control condition to test for learning effects related to these offers. I can't see how an increase in rejection of DI offers (e.g., between baseline and generalization) can be interpreted as speaking to learning. There are various other potential reasons for an increase in rejection of DI offers even if individuals learn nothing from learning (e.g. if envy builds up during the experiment as one encounters more instances of disadvantageous fairness).

      We are appreciative of the Reviewer’s insight here and for the opportunity to clarify our experimental logic. We included DI offers in order to 1) expose participants to the full spectrum of offer types, and avoid focusing participants exclusively upon AI offers, which might result in a demand characteristic and 2) to afford exploration of how learning dynamics might differ in DI context s—which was, to some extent, examined in our previous study (FeldmanHall, Otto, & Phelps, 2018)—versus AI contexts. Furthermore, as this work builds critically on our previous study, we reasoned that replicating these original findings (in the DI context) would be important for demonstrating the generality of the learning effects in the DI context across experimental settings. We now remark on this point in our revised Introduction Line 129 ~132:

      “In addition, to mechanistically probe how punitive preferences are acquired in Adv-I and Dis-I contexts—in turn, assessing the replicability of our earlier study investigating punitive preference acquisition in the Dis context—we also characterize trial-by-trial acquisition of punitive behavior with computational models of choice.”

      (2) Statistical Analysis: The analysis of the learning effects of AI offers is not fully convincing. The authors analyse changes in rejection rates within each learning condition rather than directly comparing the two. Finding a significant effect in one condition but not the other does not demonstrate that the learning regime is driving the effect. A direct comparison between conditions is necessary for establishing that there is a causal role for the learning regime.

      We agree with the Reviewer and upon reflection, believe that direct comparisons between conditions would be helpful to support the claim that the different learning conditions are responsible for the observed learning effects. In brief, these specific tests buttress the idea that exposure to AI-averse preferences result in increases in AI punishment rates in the Transfer phase (over and above the rates observed for participants who were only exposed to DI-averse preferences).

      Accordingly, our revision now reports statistics concerning the differences between conditions for AI offers in Experiment 1 (Line 198~ 207):

      “Importantly, when comparing these changes between the two learning conditions, we observed significant differences in rejection rates for Adv-I offers: compared to exposure to a Teacher who rejected only Dis-I offers, participants exposed to a Teacher who rejected both Dis-I and Adv-I offers were more likely to reject Adv-I offers and rated these offers more unfair. This difference between conditions was evident in both 30:70 offers (Rejection rates: β(SE) = 0.10(0.04), p = 0.013; Fairness ratings: β(SE) = -0.86(0.17), p < 0.001) and 10:90 offers (Rejection rates: β(SE) = 0.15(0.04), p < 0.001, Fairness ratings: β(SE) = -1.04(0.17), p < 0.001). As a control, we also compared rejection rates and fairness rating changes between conditions in Dis-I offers (90:10 and 30:70) and Fair offers (i.e., 50:50) but observed no significant difference (all ps > 0.217), suggesting that observing an Adv-I-averse Teacher’s preferences did not influence participants’ behavior in response to Dis-I offers.”

      Line 222 ~ 230:

      “A mixed-effects logistic regression revealed a significant larger (positive) effect of trial number on rejection rates of Adv-I offers for the Adv-Dis-I-Averse condition compared to the Dis-I-Averse condition. This relative rejection rate increase was evident both in 30:70 offers (Table S7; β(SE) = -0.77(0.24), p < 0.001) and in 10:90 offers (β(SE) = -1.10(0.33), p < 0.001). In contrast, comparing Dis-I and Fairness offers when the Teacher showed the same tendency to reject, we found no significant difference between the two conditions (90:10 splits: β(SE)=-0.48(0.21),p=0.593;70:30 splits: β(SE)=-0.01(0.14),p=0.150; 50:50 splits: β(SE)=-0.00(0.21),p=0.086). In other words, participants by and large appeared to adjust their rejection choices in accordance with the Teacher’s feedback in an incremental fashion.”

      And in Experiment 2 Line 333 ~ 345:

      “Similar to what we observed in Experiment 1 (Figure 4a), Compared to the participants in the Dis-I-Averse Condition, participants in the Adv-I-Averse Condition increased their rates of rejection of extreme Adv-I offerers (i.e., 10:90) in the Transfer Phase, relative to the Baseline phase (β(SE) = -0.12(0.04), p < 0.004; Table S9), suggesting that participants’ learned (and adopted) Adv-I-averse preferences, generalized from one specific offer type (30:70) to an offer types for which they received no Teacher feedback (10:90). Examining extreme Dis-I offers where the Teacher exhibited identical preferences across the two learning conditions, we found no difference in the Changes of Rejection Rates from Baseline to Transfer phase between conditions (β(SE) = -0.05(0.04), p < 0.259). Mirroring the observed rejection rates (Figure 4b), relative to the Dis-I-Averse Condition, participants’ fairness ratings for extreme Adv-I offers increased more from the Baseline to Transfer phase in the Adv-Dis-I-Averse Condition than in the Dis-I-Averse condition (β(SE) = -0.97(0.18), p < 0.001), but, importantly, changes in fairness ratings for extreme Dis-I offers did not differ significantly between learning conditions (β(SE) = -0.06(0.18), p < 0.723)”

      Line 361 ~ 368:

      “Examining the time course of rejection rates in Adv-I-contexts during the Learning phase (Figure 5) revealed that participants learned over time to punish mildly unfair 30:70 offers, and these punishment preferences generalized to more extreme offers (10:90). Specifically, compared to the Dis-I-Averse Condition, in the Adv-Dis-I-Averse condition we observed a significant larger trend of increase in rejections rates for 10:90 (Adv-I) offers (Figure 5, β(SE) = -0.81(0.26), p < 0.002 mixed-effects logistic regression, see Table S10). Again, when comparing the rejection rate increase in the extremely Dis-I offers (90:10), we didn’t find significant difference between conditions (β(SE) = -0.25(0.19), p < 0.707).”

      (3) Correlation Between Learning and Contagion Effects:

      The authors argue that correlations between learning effects (changes in rejection rates during the learning phase) and contagion effects (changes between the generalization and baseline phases) support the idea that individuals who are better aligning their preferences with the teacher also give more consideration to the teacher's preferences later during generalization phase. This interpretation is not convincing. Such correlations could emerge even in the absence of learning, driven by temporal trends like increasing guilt or envy (or even by slow temporal fluctuations in these processes) on behalf of self or others. The reason is that the baseline phase is temporally closer to the beginning of the learning phase whereas the generalization phase is temporally closer to the end of the learning phase. Additionally, the interpretation of these effects seems flawed, as changes in rejection rates do not necessarily indicate closer alignment with the teacher's preferences. For example, if the teacher rejects an offer 75% of the time then a positive 5% learning effect may imply better matching the teacher if it reflects an increase in rejection rate from 65% to 70%, but it implies divergence from the teacher if it reflects an increase from 85% to 90%. For similar reasons, it is not clear that the contagion effects reflect how much a teacher's preferences are taken into account during generalization.

      This comment is very similar to a previous comment made by Reviewer 1, who also called into question the interpretability of these correlations. In response to both of these comments we have elected to remove these analyses from our revision.

      (4) Modeling Efforts: The modelling approach is underdeveloped. The identification of the "best model" lacks transparency, as no model-recovery results are provided, and fits for the losing models are not shown, leaving readers in the dark about where these models fail. Moreover, the reinforcement learning (RL) models used are overly simplistic, treating actions as independent when they are likely inversely related (for example, the feedback that the teacher would have rejected an offer provides feedback that rejection is "correct" but also that acceptance is "an error", and the later is not incorporated into the modelling). It is unclear if and to what extent this limits current RL formulations. There are also potentially important missing details about the models. Can the authors justify/explain the reasoning behind including these variants they consider? What are the initial Q-values? If these are not free parameters what are their values?

      We are appreciative of the Reviewer for identifying these potentially unaddressed questions.

      The RL models we consider in the present study are naïve models which, in our previous study (FeldmanHall, Otto, & Phelps, 2018), we found to capture important aspects of learning. While simplistic, we believed these models serve as a reasonable baseline for evaluating more complex models, such as the Preference Inference model. We have made this point more explicit in our revised Introduction, Line 129 ~ 132:

      “In addition, to mechanistically probe how punitive preferences may be acquired in Adv-I and Dis-I contexts—in turn, assessing the replicability of our earlier study investigating punitive preference acquisition in the Dis-I context—we also characterize trial-by-trial acquisition of punitive behavior with computational models of choice.”

      Again, following from our previous modeling of observational learning (FeldmanHall et al., 2018), we believe that the feedback the Teacher provides here is ideally suited to the RL formalism. In particular, when the teacher indicates that the participant’s choice is what they would have preferred, the model receives a reward of ‘1’ (e.g., the participant rejects and the Teacher indicates they would preferred rejection, resulting in a positive prediction error) otherwise, the model receives a reward of ‘0’ (e.g., the participant accepts and the Teacher indicates they would preferred rejection, resulting in a negative prediction error), indicating that the participant did not choose in accordance with the Teacher’s preferences. Through an error driven learning process, these models provide a naïve way of learning to act in accordance with the Teacher’s preferences.

      Regarding the requested model details: When treating the initial values as free parameters (model 5), we set Q(reject, offertype) as free values in [0,1] and Q(accept,offertype) as 0.5. This setting can capture participants' initial tendency to reject or accept offers from this offer type. When the initial values are fixed, for all offer types we set Q(reject, offertype) = Q(accept,offertype) = 0.5. In practice, when the initial values are fixed, setting them to 0.5 or 0 doesn’t make much difference. We have clarified these points in our revised Methods, Line 275 ~ 576:

      “We kept the initial values fixed in this model, that is Q<sub>0</sub>(reject,offertype) =0.5, (offertype ∈ 90:10, 70:30, 50:50, 30:70, 10:90)”

      And Line 582 ~ 584:

      “Formally, this model treats Q<sub>0</sub>(reject,offertype) =0.5, (offertype ∈ 90:10, 70:30, 50:50, 30:70, 10:90) as free parameters with values between 0 and 1.”

      (5) Conceptual Leap in Modeling Interpretation: The distinction between simple RL models and preference-inference models seems to hinge on the ability to generalize learning from one offer to another. Whereas in the RL models learning occurs independently for each offer (hence to cross-offer generalization), preference inference allows for generalization between different offers. However, the paper does not explore RL models that allow generalization based on the similarity of features of the offers (e.g., payment for the receiver, payment for the offer-giver, who benefits more). Such models are more parsimonious and could explain the results without invoking a theory of mind or any modelling of the teacher. In such model versions, a learner learns a functional form that allows to predict the teacher's feedback based on said offer features (e.g., linear or quadratic form). Because feedback for an offer modulates the parameters of this function (feature weights) generalization occurs without necessarily evoking any sophisticated model of the other person. This leaves open the possibility that RL models could perform just as well or even show superiority over the preference learning model, casting doubt on the authors' conclusions. Of note: even the behaviourists knew that as Little Albert was taught to fear rats, this fear generalized to rabbits. This could occur simply because rabbits are somewhat similar to rats. But this doesn't mean little Alfred had a sophisticated model of animals he used to infer how they behave.

      We are appreciative of the Reviewer for their suggestion of an alternative explanation for the observed generalization effects. Our understanding of the suggestion, put simply, put simply, is that an RL model could capture the observed generalization effects if the model were to learn and update a functional form of the Teacher’s rejection preferences using an RL-like algorithm. This idea is similar, conceptually to our account of preference learning whereby the learner has a representation of the teacher’s preferences. In our experiment the offer is in the range of [0-100], the crux of this idea is why the participants should take the functional form (either v-shaped or quadratic) with the minimum at 50. This is important because, at the beginning of the learning phase, the rejection rates are already v-shaped with 50 as its minimum. The participants do not need to adjust the minimum of this functional form. Thus, if we assume that the participants represent the teacher’s rejection rate as a v-shape function with a minimum at [50,50], then this very likely implies that the participants have a representation that the teacher has a preference for fairness. Above all, we agree that with suitable setup of the functional form, one could implement an RL model to capture the generalization effects, without presupposing an internal “model” of the teacher’s preferences.

      However, there is another way of modeling the generalization effect by truly “model-free” similarity-based Reinforcement learning. In this approach, we do not assume any particular functional form of the teacher’s preferences, but rather, assumes that experience acquired in one offer type can be generalized to offers that are close (i.e., similar) to the original offer. Accordingly, we implement this idea using a simple RL model in which the action values for each offer type is updated by a learning rate that is scaled by the distance between that offer and the experienced offer (i.e., the offer that generated the prediction error). This learning rate is governed by a Gaussian distribution, similar to the case in the Gaussian process regression (cf. Chulz, Speekenbrink, & Krause, 2018). The initial value of the ‘Reject’ action, for each offer , is set to a free parameter between 0 and 1, and the initial value for the 'Accept’ action was set to 0.5. The results show that even though this model exhibits the trend of increasing rejection rates observed in the AI-DI punish condition, the initial preferences (i.e., starting point of learning) diverges markedly from the Learning phase behavior we observed in Experiment 1:

      Author response image 1.

      This demonstrated that the participant at least maintains a representation of the teacher’s preference at the beginning. That is, they have prior knowledge about the shape of this preference. We incorporated this property into the model, that is, we considered a new model that assumes v-shaped starting values for rejection with two parameters, alpha and beta, governing the slope of this v-shaped function (this starting value actually mimics the shape of the preference functions of the Fehr-Schmidt model). We found that this new model (which we term the “Model RL Sim Vstart”) provided a satisfactory qualitative fit of the Transfer phase learning curves in Experiment 1 (see below).

      Author response image 2.

      However, we didn’t adopt this model as the best model for the following reasons. First, this model yielded a larger AIC value (indicating worse quantitative fit) compared to our preference Inference model in both Experiments 1 and 2, likely owing to its increased complexity (5 free parameters versus 4 in the Preference Inference model). Accordingly, we believe that inclusion of this model in our revised submission would be more distracting than helpful on account of the added complexity of explaining and justifying these assumptions, and of course its comparatively poor goodness of fit (relative to the preference inference model).

      (6) Limitations of the Preference-Inference Model: The preference-inference model struggles to capture key aspects of the data, such as the increase in rejection rates for 70:30 DI offers during the learning phase (e.g. Figure 3A, AI+DI blue group). This is puzzling.

      Thinking about this I realized the model makes quite strong unintuitive predictions that are not examined. For example, if a subject begins the learning phase rejecting the 70:30 offer more than 50% of the time (meaning the starting guilt parameter is higher than 1.5), then overleaning the tendency to reject will decrease to below 50% (the guilt parameter will be pulled down below 1.5). This is despite the fact the teacher rejects 75% of the offers. In other words, as learning continues learners will diverge from the teacher. On the other hand, if a participant begins learning to tend to accept this offer (guilt < 1.5) then during learning they can increase their rejection rate but never above 50%. Thus one can never fully converge on the teacher. I think this relates to the model's failure in accounting for the pattern mentioned above. I wonder if individuals actually abide by these strict predictions. In any case, these issues raise questions about the validity of the model as a representation of how individuals learn to align with a teacher's preferences (given that the model doesn't really allow for such an alignment).

      In response to this comment we explain our efforts to build a new model that might be able conceptually resolves the issue identified by the Reviewer.

      The key intuition guiding the Preference inference model is a Bayesian account of learning which we aimed to further simplify. In this setting, a Bayesian learner maintains a representation of the teacher’s inequity aversion parameters and updates it according to the teacher’s (observed) behavior. Intuitively, the posterior distribution shifts to the likelihood of the teacher’s action. On this view, when the teacher rejects, for instance, an AI offer, the learner should assign a higher probability to larger values of the Guilt parameter, and in turn the learner should change their posterior estimate to better capture the teacher’s preferences.

      In the current study, we simplified this idea, implementing this sort of learning using incremental “delta rule” updating (e.g. Equation 8 of the main text). Then the key question is to define the “teaching signal”. Assuming that the teacher rejects an offer 70:30, based on Bayesian reasoning, the teacher’s envy parameter (α) is more likely to exceed 1.5 (computed as 30/(50-30), per equation 7) than to be smaller than 1.5. Thus, 1.5, which is then used in equation 8 to update α, can be thought of as a teaching signal. We simply assumed that if the initial estimate is already greater than 1.5, which means the prior is consistent with the likelihood, no updating would occur. This assumption raises the question of how to set the learning rate range. In principle, an envy parameter that is larger than 1.5 should be the target of learning (i.e., the teaching signal), and thus our model definition allows the learning rate to be greater than 1, incorporating this possibility.

      Our simplified preference inference model has already successfully captured some key aspects of the participants’ learning behavior. However, it may fail in the following case: assume that the participant has an initial estimate of 1.51 for the envy parameter (β). Let’s say this corresponds to a rejection rate of 60%. Thus, no matter how many times the teacher rejects the offer 70:30, the participant’s estimate of the envy parameter remains the same, but observing only one offer acceptance would decrease this estimate, and in turn, would decrease the model’s predicted rejection rate. We believe this is the anomalous behavior—in 70:30 offers—identified by the Reviewer which the model does not appear able to recreate participants’ in these offers.

      This issue actually touches the core of our model specification, that is, the choosing of the teaching signal. As we chose 1.5 as the teaching signal—i.e. lower bound on whenever the teacher rejects or accepts an offer of 70:30, a very small deviation of 1.5 would fail one part of updating. One way to mitigate this problem would be to choose a lower bound for α greater than 1.5, such that when the Teacher rejects a 70:30 offer, we assign a number greater than 1.5 (by ‘hard-coding’ this into the model via modification of equation 7). One sensible candidate value could be the middle point between 1.5 and 10 (the maximum value of α per our model definition). Intuitively, the model of this setting could still pull up the value of α to 1.51 when the teacher rejects 70:30, thus alleviating (but not completely eliminating) the anomaly.

      We fitted this modified Preference Inference model to the data from Experiment 1 (see Author response image 3 below) and found that even though this model has a smaller AIC (and thus better quantitative fit than the original Preference Inference model), it still doesn’t fully capture the participants’ behavior for 70:30 offers.

      Author response image 3.

      Accordingly, rather than revising our model to include an unprincipled ‘kludge’ to account for this minor anomaly in the model behavior, we have opted to report our original model in our revision as we still believe it parsimoniously captures our intuitions about preference learning and provides a better fit to the observed behavior than the other RL models considered in the present study.

      Reviewer #1 (Recommendations for the authors):

      (1) I do not particularly prefer the acronyms AI and DI for disadvantageous inequity and advantageous inequity. Although they have been used in the literature, not every single paper uses them. More importantly, AI these days has such a strong meaning of artificial intelligence, so when I was reading this, I'd need to very actively inhibit this interpretation. I believe for the readability for a wider readership of eLife, I would advise not to use AI/DI here, but rather use the full terms.

      We thank the Reviewer for this suggestion. As the full spelling of the two terms are somewhat lengthy, and appear frequently in the figures, we have elected to change the abbreviations for disadvantageous inequity and advantageous inequity to Dis-I and Adv-I, respectively in the main text and the supplementary information. We still use AI/DI in the response letter to make the terminology consistent.

      (2) Do "punishment rate" and "rejection rate" mean the same? If so, it would be helpful to stick with one single term, eg, rejection rate.

      We thank the Reviewer for this suggestion. As these terms have the same meaning, we have opted to use the term “rejection rate” throughout the main text.

      (3) For the linear mixed effect models, were other random effect structures also considered (eg, random slops of experimental conditions)? It might be worth considering a few model specifications and selecting the best one to explain the data.

      Thanks for this comment. Following established best practices (Barr, Levy, Scheepers, & Tily, 2013) we have elected to use a maximal random effects structure, whereby all possible predictor variables in the fixed effects structure also appear in the random effects structure.

      (4) For equation (4), the softmax temperature is denoted as tau, but later in the text, it is called gamma. Please make it consistent.

      We are appreciative of the Reviewer’s attention to detail. We have corrected this error.

      Reviewer #2 (Recommendations for the authors):

      (1) Several Tables in SI are unclear. I wasn't clear if these report raw probabilities of coefficients of mixed models. For any mixed models, it would help to give the model specification (e.g., Walkins form) and explain how variables were coded.

      We are appreciative of the Reviewer’s attention to detail. We have clarified, in the captions accompanying our supplemental regression tables, that these coefficients represent log-odds. Regretfully we are unaware of the “Walkins form” the Reviewer references (even after extensive searching of the scientific literature). However, in our new revision we do include lme4 model syntax in our supplemental information which we believe will be helpful for readers seeking replicate our model specification.

      (2) In one of the models it was said that the guilt and envy parameters were bounded between 0-1 but this doesn't make sense and I think values outside this range were later reported.

      We are again appreciative of the Reviewer’s attention to detail. This was an error we have corrected— the actual range is [0,10].

      (3) It is unclear if the model parameters are recoverable.

      In response to this comment our revision now reports a basic parameter recovery analysis for the winning Preference Inference model. This is reported in our revised Methods:

      “Finally, to verify if the free parameters of the winning model (Preference Inference) are recoverable, we simulated 200 artificial subjects, based on the Learning Phase of Experiment 1, with free parameters randomly chosen (uniformly) from their defined ranges. We then employed the same model-fitting procedure as described above to estimate these parameter value, observing that parameters. We found that all parameters of the model can be recovered (see Figure S2).”

      And scatter plots depicting these simulated (versus recovered) parameters are given in Figure S2 of our revised Supplementary Information:

      (4) I was confused about what Figure S2 shows. The text says this is about correlating contagious effects for different offers but the captions speak about learning effects. This is an important aspect which is unclear.

      We have removed this figure in response to both Reviewers’ comments about the limited insights that can be drawn on the basis of these correlations.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The aim of this paper is to develop a simple method to quantify fluctuations in the partitioning of cellular elements. In particular, they propose a flow-cytometry-based method coupled with a simple mathematical theory as an alternative to conventional imaging-based approaches.

      Strengths:

      The approach they develop is simple to understand and its use with flow-cytometry measurements is clearly explained. Understanding how the fluctuations in the cytoplasm partition vary for different kinds of cells is particularly interesting.

      Weaknesses:

      The theory only considers fluctuations due to cellular division events. This seems a large weakness because it is well known that fluctuations in cellular components are largely affected by various intrinsic and extrinsic sources of noise and only under particular conditions does partitioning noise become the dominant source of noise.

      We thank the Reviewer for her/his evaluation of our manuscript. The point raised is indeed a crucial one. In a cell division cycle, there are at least three distinct sources of noise that affect component numbers [1] :

      (1) Gene expression and degradation, which determine component numbers fluctuations during cell growth.

      (2) Variability in cell division time, which depending on the underlying model may or may not be a function of protein level and gene expression.

      (3) Noise in the partitioning/inheritance of components between mother and daughter cells.

      Our approach specifically addresses the latter, with the goal of providing a quantitative measure of this noise source. For this reason, in the present work, we consider homogeneous cancer cell populations that could be considered to be stationary from a population point-of-view. By tracking the time evolution of the distribution of tagged components via live fluorescent markers, we aim at isolating partitioning noise effects. However, as noted by the Reviewer, other sources of noise are present, and depending on the considered system the relative contributions of the different sources may change. Thus, we agree that a quantification of the effect of the various noise sources on the accuracy of our measurements will improve the reliability of our method.

      In this respect, assuming independence between noise sources, we reasoned that variability in cell cycle length would affect the timing of population emergence but not the intrinsic properties of those populations (e.g., Gaussian variance). To test this hypothesis, we conducted a preliminary set of simulations in which cell division times were drawn from an Erlang distribution (mean = 18 h, k=4k = 4k=4). The results, showing the behavior of the mean and variance of the component distributions across generations, are presented in Supplementary Information - Figure 1. Under the assumption of independence between different noise sources, no significant effects were observed even for high asymmetries of the partitioning distribution.

      Next, we quantified the accuracy of our measurements in the presence of cross-talks between the various noise sources.Indeed, cells may adopt different growth and division strategies, which can be grouped into three categories based on what triggers division:

      ● Sizer-like cells divide upon reaching a certain size;

      ● Timer-like cells divide after a fixed time (corresponding to the previously treated case with independent noise);

      ● Adder-like cells divide once their volume has increased by a finite amount.

      A detailed discussion of these strategies, including their mathematical formulation, can be found in [2]. Here we have assumed that cells follow a sizer-like model. In this way, we study a system in which cells with a higher number of components have shorter division times. Hence, older (newer) generations are emptied (populated) starting from higher values.

      As can be observed, higher levels of division asymmetry increase the fluctuations of the system relative to the analytically expected behavior, particularly in later generations.

      The result in Supplementary Information - Figure 3 demonstrates the robustness of our method, as the estimates remain within the pre-established experimental error margin. We have now discussed this aspect both in the main and in the Supplementary Information and thank the Reviewer for pointing it out.

      (1) Soltani, Mohammad, et al. "Intercellular variability in protein levels from stochastic expression and noisy cell cycle processes." PLoS computational biology 12.8 (2016): e1004972.

      (2) Mattia Miotto, Simone Scalise, Marco Leonetti, Giancarlo Ruocco, Giovanna Peruzzi, and Giorgio Gosti. A size-dependent division strategy accounts for leukemia cell size heterogeneity. Communications Physics, 7(1):248, 2024.

      Reviewer #2 (Public review):

      Summary:

      The authors present a combined experimental and theoretical workflow to study partitioning noise arising during cell division. Such quantifications usually require time-lapse experiments, which are limited in throughput. To bypass these limitations, the authors propose to use flow-cytometry measurements instead and analyse them using a theoretical model of partitioning noise. The problem considered by the authors is relevant and the idea to use statistical models in combination with flow cytometry to boost statistical power is elegant. The authors demonstrate their approach using experimental flow cytometry measurements and validate their results using time-lapse microscopy. However, while I appreciate the overall goal and motivation of this work, I was not entirely convinced by the strength of this contribution. The approach focuses on a quite specific case, where the dynamics of the labelled component depend purely on partitioning. As such it seems incompatible with studying the partitioning noise of endogenous components that exhibit production/turnover. The description of the methods was partly hard to follow and should be improved. In addition, I have several technical comments, which I hope will be helpful to the authors.

      We are grateful to the Reviewer for the comments. Indeed, both partitioning and production turnover noise are in general fundamental processes. At present the only way to consider them together are time-consuming and costly transfection/microscopy/tracking experiments. In this work, we aimed at developing a method to effectively pinpoint the first component, i.e. partitioning noise thus we opted to separate the two different noise sources.

      Below, we provided a point-by-point response that we hope will clarify all raised concerns.

      Comments:

      (1) In the theoretical model, copy numbers are considered to be conserved across generations. As a consequence, concentrations will decrease over generations due to dilution. While this consideration seems plausible for the considered experimental system, it seems incompatible with components that exhibit production and turnover dynamics. I am therefore wondering about the applicability/scope of the presented approach and to what extent it can be used to study partitioning noise for endogenous components. As presented, the approach seems to be limited to a fairly small class of experiments/situations.

      We see the Reviewer's point. Indeed, we are proposing a high-throughput and robust procedure to measure the partitioning/inheritance noise of cell components through flow cytometry time courses. By using live-cell staining of cellular compounds, we can track the effect of partitioning noise on fluorescence intensity distribution across successive generations. This specific procedure is purposely optimized to isolate partitioning noise from other sources and, as it is, can not track endogenous components or dyes that require fixation. While this certainly poses limits to the proposed approach, there are numerous contexts in which our methodology could be used to explore the role of asymmetric inheritance. Among others, (i) investigating how specific organelles are differentially partitioned and how this influences cellular behavior could provide deeper insights into fundamental biological processes: asymmetric segregation of organelles is a key factor in cell differentiation, aging, and stress response. During cell division, organelles such as mitochondria, the endoplasmic reticulum, lysosomes, peroxisomes, and centrosomes can be unequally distributed between daughter cells, leading to functional differences that influence their fate. For instance, Kajaitso et al. [1] proposed that asymmetric division of mitochondria in stem cells is associated with the retention of stemness traits in one daughter cell and differentiation in the other. As organisms age, stem cells accumulate damage, and to prevent exhaustion and compromised tissue function, cells may use asymmetric inheritance to segregate older or damaged subcellular components into one daughter cell. (ii) Asymmetric division has also been linked to therapeutic resistance in Cancer Stem Cells [2]. Although the functional consequences are not yet fully determined, the asymmetric inheritance of mitochondria is recognized as playing a pivotal role [3]. Another potential application of our methodology may be (iii) the inheritance of lysosomes, which, together with mitochondria, appears to play a crucial role in determining the fate of human blood stem cells [4]. Furthermore, similar to studies conducted on liquid tumors [5][6], our approach could be extended to investigate cell growth dynamics and the origins of cell size homeostasis in adherent cells [7][8][9]. The aforementioned cases of study can be readily addressed using our approach that in general is applicable whenever live-cell dyes can be used. We have added a discussion of the strengths and limitations of the method in the Discussion section of the revised version of the manuscript

      (1) Katajisto, Pekka, et al. "Asymmetric apportioning of aged mitochondria between daughter cells is required for stemness." Science 348.6232 (2015): 340-343.

      (2) Hitomi, Masahiro, et al. "Asymmetric cell division promotes therapeutic resistance in glioblastoma stem cells." JCI insight 6.3 (2021): e130510.

      (3) García-Heredia, José Manuel, and Amancio Carnero. "Role of mitochondria in cancer stem cell resistance." Cells 9.7 (2020): 1693.

      (4) Loeffler, Dirk, et al. "Asymmetric organelle inheritance predicts human blood stem cell fate." Blood, The Journal of the American Society of Hematology 139.13 (2022): 2011-2023.

      (5) Miotto, Mattia, et al. "Determining cancer cells division strategy." arXiv preprint arXiv:2306.10905 (2023).

      (6) Miotto, Mattia, et al. "A size-dependent division strategy accounts for leukemia cell size heterogeneity." Communications Physics 7.1 (2024): 248.

      (7) Kussell, Edo, and Stanislas Leibler. "Phenotypic diversity, population growth, and information in fluctuating environments." Science 309.5743 (2005): 2075-2078.

      (8) McGranahan, Nicholas, and Charles Swanton. "Clonal heterogeneity and tumor evolution: past, present, and the future." Cell 168.4 (2017): 613-628.

      (9) De Martino, Andrea, Thomas Gueudré, and Mattia Miotto. "Exploration-exploitation tradeoffs dictate the optimal distributions of phenotypes for populations subject to fitness fluctuations." Physical Review E 99.1 (2019): 012417.

      (2) Similar to the previous comment, I am wondering what would happen in situations where the generations could not be as clearly identified as in the presented experimental system (e.g., due to variability in cell-cycle length/stage). In this case, it seems to be challenging to identify generations using a Gaussian Mixture Model. Can the authors comment on how to deal with such situations? In the abstract, the authors motivate their work by arguing that detecting cell divisions from microscopy is difficult, but doesn't their flow cytometry-based approach have a similar problem?

      The point raised is an important one, as it highlights the fundamental role of the gating strategy. The ability to identify the distribution of different generations using the Gaussian Mixture Model (GMM) strongly depends on the degree of overlap between distributions. The more the distributions overlap, the less capable we are of accurately separating them.

      The extent of overlap is influenced by the coefficients of variation (CV) of both the partitioning distribution function and the initial component distribution. Specifically, the component distribution at time t results from the convolution of the component distribution itself at time t−1 and the partitioning distribution function. Therefore, starting with a narrow initial component distribution allows for better separation of the generation peaks. The balance between partitioning asymmetry and the width of the initial component distribution is thus crucial.

      As shown in Supplementary Information - Figure 5, increasing the CV of either distribution reduces the ability to distinguish between different generations.

      However, the variance of the initial distribution cannot be reduced arbitrarily. While selecting a narrow distribution facilitates a better reconstruction of the distributions, it simultaneously limits the number of cells available for the experiment. Therefore, for components exhibiting a high level of asymmetry, further narrowing of the initial distribution becomes experimentally impractical.

      In such cases, an approach previously tested on liquid tumors [1] involves applying the Gaussian Mixture Model (GMM) in two dimensions by co-staining another cellular component with lower division asymmetry.

      Regarding time-lapse fluorescence microscopy, the main challenge lies not in disentangling the interplay of different noise sources, but rather in obtaining sufficient statistical power from experimental data. While microscopy provides detailed insights into the division process and component partitioning, its low throughput limits large-scale statistical analyses. Current segmentation algorithms still perform poorly in crowded environments and with complex cell shapes, requiring a substantial portion of the image analysis pipeline to be performed manually, a process that is time-consuming and difficult to scale. In contrast, our cytometry-based approach bypasses this analysis bottleneck, as it enables a direct population-wide measurement of the system's evolution. We have added a detailed discussion of this argument in the Supplementary Material of the manuscript and added a clarification of the role of the gating strategy in the main text.

      (1) Peruzzi, Giovanna, et al. "Asymmetric binomial statistics explains organelle partitioning variance in cancer cell proliferation." Communications Physics 4.1 (2021): 188.

      (3) I could not find any formal definition of division asymmetry. Since this is the most important quantity of this paper, it should be defined clearly.

      We thank the Reviewer for the note. With division asymmetry we refer to a quantity that reflects how similar two daughter cells are likely to be in terms of inherited components after a division process. We opted to measure it via the coefficient of variation (root squared variance divided by the mean) of the partitioning fraction distribution. We have amended this lack of definition in the reviewed version of the manuscript.

      (4) The description of the model is unclear/imprecise in several parts. For instance, it seems to me that the index "i" does not really refer to a cell in the population, but rather a subpopulation of cells that has undergone a certain number of divisions. Furthermore, why is the argument of Equation 11 suddenly the fraction f as opposed to the component number? I strongly recommend carefully rewriting and streamlining the model description and clearly defining all quantities and how they relate to each other.

      We have amending the text carefully to avoid double naming of variables and clarifying each computation passage. In equation 11 the variable f refers to the fluorescent intensity, but the notation will be changed to increase clarity.

      (5) Similarly, I was not able to follow the logic of Section D. I recommend carefully rewriting this section to make the rationale, logic, and conclusions clear to the reader.

      We have updated the manuscript clarifying the scope of section D and its results. In brief, Section A presents a general model to derive the variance of the partitioning distribution from flow cytometry time-course data without making any assumptions about the shape of the distribution itself. In Section D, our goal is to interpret the origin of asymmetry and propose a possible form for the partitioning distribution. Since the dyes used bind non-specifically to cytoplasmic amines, the tagged proteins are expected to be uniformly distributed throughout the cytoplasm and present in large numbers. Given these assumptions the least complex model for division follows the binomial distribution, with a parameter that measures the bias in the process. Therefore, we performed a similar computation to that in Section A, which allows us to estimate not only the variance but also the degree of biased asymmetry. Finally, we fitted the data to this new model and proposed an experimental interpretation of the results.

      (6) Much theoretical work has been done recently to couple cell-cycle variability to intracellular dynamics. While the authors neglect the latter for simplicity, it would be important to further discuss these approaches and why their simplified model is suitable for their particular experiments.

      We agree with the Reviewer, we have added a discussion on this topic in the Introduction and Discussion sections of the main text.

      (7) In the discussion the authors note that the microscopy-based estimates may lead to an overestimation of the fluctuations due to limited statistics. I could not follow that reasoning. Due to the gating in the flow cytometry measurements, I could imagine that the resulting populations are more stringently selected as compared to microscopy. Could that also be an explanation? More generally, it would be interesting to see how robust the results are in terms of different gating diameters.

      The Reviewer is right on the importance of the sorting procedure. As already discussed in a previous point, the gating strategy we employed plays a fundamental role: it reduces the overlap of fluorescence distributions as generations progress, enables the selection of an initial distribution distinct from the fluorescence background, allowing for longer tracking of proliferation, and synchronizes the initial population. The narrower the initial distribution, the more separated the peaks of different generations will be. However, this also results in a smaller number of cells available for the experiment, requiring a careful balance between precision and experimental feasibility. A similar procedure, although it would certainly limit the estimation error, would be impracticable In the case of microscopy. Indeed, the primary limitation and source of error is the number of recorded events. Our pipeline allowed us to track on the order of hundreds of division dynamics, but the analysis time scales non-linearly with the number of events. Significantly increasing the dataset would have been extremely time-consuming. Reducing the analysis to cells with similar fluorescence, although theoretically true, would have reduced the statistics to a level where the sampling error would drastically dominate the measure. Moreover, different experiments would have been hardly comparable, since different fluorescences could map in equally sized cells. In light of these factors, we expect higher CV for the microscopy measure than for flow cytometry’s ones. In the plots below, we show the behaviour of the mean and the standard deviation of N numbers sampled from a gaussian distribution N(0,1) as a function of the sampling number N. The higher is N the closer the sampled distribution will be to the true one. The region in the hundreds of samples is still very noisy, but to do much better we would have to reach the order of thousands. We have added a discussion on these aspects in the reviewed version of the manuscript, with a deeper description of the importance of the sorting procedure in the Supplementary Material. .

      Author response image 1.

      Standard deviation and mean value of a distribution of points sampled from a Gaussian distribution with mean 0 and standard deviation 1, versus the number of samples, N. Increasing N leads to a closer approximation of the expected values. In orange is highlighted the Microscopy Working Region (Microscopy WR) which corresponds to the number of samples we are able to reach with microscopy experiments. In yellow the region we would have to reach to lower the estimating error, which is although very expensive in terms of analysis time.

      (7) It would be helpful to show flow cytometry plots including the identified subpopulations for all cell lines, currently, they are shown only for HCT116 cells. More generally, very little raw data is shown.

      We have provided the requested plots for the other cell lines together with additional raw data coming from simulations in the Supplementary Material.

      (8) The title of the manuscript could be tailored more to the considered problem. At the moment it is very generic.

      We see the Reviewer point. The proposed title aims at conveying the wide applicability of the presented approach, which ultimately allows for the assessment of the levels of fluctuations in the levels of the cellular components at division. This in turn reflects the asymmetricity in the division.

      Reviewer #1 (Recommendations for the authors):

      (1) I am quite concerned about the fact that the theory only considers fluctuations due to cellular division events since intrinsic and extrinsic noise sources are often dominant. I suggest that the authors simulate a full model of cell growth and division (that accounts for fluctuations in gene expression, cell-cycle dynamics, and cell division to generate a controlled synthetic dataset and then use this as input to their method to understand how robust are their results to the influence of noise sources other than partitioning.

      We thank the reviewer for the suggestions and following his advice we performed two sets of simulations in which we took into account the effect of the other noise sources. A detailed description of the results and the methods has been added to the Supplementary Material, while the topic has also been assessed in the main text. A cell proliferation cycle is affected by different sources of variability: (i) production and degradation processes of molecules; (ii) variability in length of the cell cycle; (iii) partitioning noise, which identifies asymmetric inheritance of components between the two daughter cells. However, the experimental approach and the model have been formulated to specifically address the effects of partitioning noise. Indeed, since we are dealing with components tagged via live fluorescent markers, production of new fluorophores is impossible and can therefore be discarded. Instead, the degradation process is a global effect that influences the behavior of the mean of the distribution in a time-dependent manner. However, by looking at the experimental data in Figure 1 of the main text, no significant depletion of fluorescence is observed, or at least it is hidden by the experimental fluctuations of the measure. Instead, a more careful evaluation has to be done for what concerns fluctuation in cell cycle length. We conducted two sets of simulations. In the first, we assumed the independence between fluctuations in cell cycle length and partitioning noise.

      Cell’s division time was extracted from an Erlang distribution (mean = 18 , k = 4) and the results, showing the behavior of the mean and variance of the component distributions across generations, are presented in Supplementary Information - Figure 1. Under the assumption of independence between different noise sources, no significant effects were observed even for high asymmetries of the partitioning distribution. The second set of simulations considered a situation in which the cell’s components and division time are coupled. We assumed a sizer-like division strategy for which bigger cells have a shorter division time and the results of the simulations are shown in Supplementary Information - Figure 2.

      As can be observed, higher levels of division asymmetry increase the fluctuations of the system relative to the analytically expected behavior, particularly in later generations.

      The result in Supplementary Information - Figure 3 demonstrates the robustness of our method, as the estimates remain within the pre-established experimental error margin. However, a detailed description of this topic has been provided in the Supplementary Information and into the main text.

      (2) I find the use of the Cauchy distribution somewhat odd since this does not have a finite mean or a variance and I suspect it is unlikely this mimics a naturally measurable distribution in their experiments. This should either be justified biologically or else replaced by a more realistic distribution.

      Following the reviewer’s suggestion, we have changed the distribution to Gaussian one.

      (3) There is a large body of literature on gene expression models that incorporate a large amount of detail including cell-cycle dynamics and cell division which are relevant to their discussion but not referenced. I suggest they read the following and see how to incorporate at least some of them in their discussion:

      Frequency domain analysis of fluctuations of mRNA and protein copy numbers within a cell lineage: theory and experimental validation., Physical Review X, 11.2 (2021): 021032.

      Exact solution of stochastic gene expression models with bursting, cell cycle and replication dynamics., Physical Review E, 101.3 (2020): 032403.

      Coupling gene expression dynamics to cell size dynamics and cell cycle events: Exact and approximate solutions of the extended telegraph model., Iscience, 26.1 (2023).

      Models of protein production along the cell cycle: An investigation of possible sources of noise., Plos one, 15.1 (2020): e0226016.

      Sources, propagation and consequences of stochasticity in cellular growth., Nature communications, 9(1), 4528

      Intrinsic and extrinsic noise of gene expression in lineage trees., Scientific Reports, 9.1 (2019): 474.

      We thank the Reviewer for the provided articles. We enlarged both introduction and discussion commenting on them, also in response to the second Reviewer comments.

      Reviewer #2 (Recommendations for the authors):

      (1) Even when it is used only during simulation for the sake of illustration, the Cauchy distribution is a somewhat unfortunate choice as its moments do not exist and hence, the authors' approach would not apply. I would recommend using another distribution instead.

      Following the Reviewer’s suggestion we have changed the distribution to Gaussian ones.

      (2) "cells population" should be "cell population".

      We have amended this mistake in the text.

    1. Author Response

      The following is the authors’ response to the previous reviews.

      Reviewer #1:

      Concerns Public Review:

      1)The framing of 'infinite possible types of conflict' feels like a strawman. While they might be true across stimuli (which may motivate a feature-based account of control), the authors explore the interpolation between two stimuli. Instead, this work provides confirmatory evidence that task difficulty is represented parametrically (e.g., consistent with literatures like n-back, multiple object tracking, and random dot motion). This parametric encoding is standard in feature-based attention, and it's not clear what the cognitive map framing is contributing.

      Suggestion:

      1) 'infinite combinations'. I'm frankly confused by the authors response. I don't feel like the framing has changed very much, besides a few minor replacements. Previous work in MSIT (e.g., by the author Zhongzheng Fu) has looked at whether conflict levels are represented similarly across conflict types using multivariate analyses. In the paper mentioned by Ritz & Shenhav (2023), the authors looked at whether conflict levels are represented similarly across conflict types using multivariate analyses. It's not clear what this paper contributes theoretically beyond the connections to cognitive maps, which feel like an interpretative framework rather than a testable hypothesis (i.e., these previous paper could have framed their work as cognitive maps).

      Response: We acknowledge the limitations inherent in our experimental design, which prevents us from conducting a strict test of the cognitive space view. In our previous revision, we took steps to soften our conclusions and emphasize these limitations. However, we still believe that our study offers valuable and novel insights into the cognitive space, and the tests we conducted are not merely strawman arguments.

      Specifically, our study aimed to investigate the fundamental principles of the cognitive space view, as we stated in our manuscript that “the representations of different abstract information are organized continuously and the representational geometry in the cognitive space is determined by the similarity among the represented information (Bellmund et al., 2018)”. While previous research has applied multivariate analyses to understand cognitive control representation, no prior studies had directedly tested the two key hypotheses associated with cognitive space: (1) that cognitive control representation across conflict types is continuous, and (2) that the similarity among representations of different conflict types is determined by their external similarity.

      Our study makes a unique contribute by directly testing these properties through a parametric manipulation of different conflict types. This approach differs significantly from previous studies in two ways. First, our parametric manipulation involves more than two levels of conflict similarity, enabling us to directly test the two critical hypotheses mentioned above. Unlike studies such as Fu et al. (2022) and other that have treated different conflict types categorically, we introduced a gradient change in conflict similarity. This differentiation allowed us to employ representational similarity analysis (RSA) over the conflict similarity, which goes beyond mere decoding as utilized in prior work (see more explanation below for the difference between Fu et al., 2022 and our study [1]).

      Second, our parametric manipulation of conflict types differs from previous studies that have manipulated task difficulty, and the modulation of multivariate pattern similarity observed in our study could not be attributed by task difficulty. Previous research, including the Ritz & Shenhav (2023) (see below explanation[2]), has primarily shown that task difficulty modulates univoxel brain activation. A recent work by Wen & Egner (2023) reported a gradual change in the multivariate pattern of brain activations across a wide range of frontoparietal areas, supporting the reviewer’s idea that “task difficulty is represented parametrically”. However, we do not believe that our results reflect the task difficulty representation. For instance, in our study, the spatial Stroop-only and Simon-only conditions exhibited similar levels of difficulty, as indicated by their relatively comparable congruency effects (Fig. S1). Despite this similarity in difficulty, we found that the representational similarity between these two conditions was the lowest (see revised Fig. S4, the most off-diagonal value). This observation aligns more closely with our hypothesis that these two conditions are most dissimilar in terms of their conflict types.

      [1] Fu et al. (2022) offers important insights into the geometry of cognitive space for conflict processing. They demonstrated that Simon and flanker conflicts could be distinguished by a decoder that leverages the representational geometry within a multidimensional space. However, their model of cognitive space primarily relies on categorical definitions of conflict types (i.e., Simon versus flanker), rather than exploring a parametric manipulation of these conflict types. The categorical manipulations make it difficult to quantify conceptual similarity between conflict types and hence limit the ability to test whether neural representations of conflict capture conceptual similarity. To the best of our knowledge, no previous studies have manipulated the conflict types parametrically. This gap highlights a broader challenge within cognitive science: effectively manipulating and measuring similarity levels for conflicts, as well as other high-level cognitive processes, which are inherently abstract. We therefore believe our parametric manipulation of conflict types, despite its inevitable limitations, is an important contribution to the literature.

      We have incorporated the above statements into our revised manuscript: Methodological implications. Previous studies with mixed conflicts have applied mainly categorical manipulations of conflict types, such as the multi-source interference task (Fu et al., 2022) and color Stroop-Simon task (Liu et al., 2010). The categorical manipulations make it difficult to quantify conceptual similarity between conflict types and hence limit the ability to test whether neural representations of conflict capture conceptual similarity. To the best of our knowledge, no previous studies have manipulated the conflict types parametrically. This gap highlights a broader challenge within cognitive science: effectively manipulating and measuring similarity levels for conflicts, as well as other high-level cognitive processes, which are inherently abstract. The use of an experimental paradigm that permits parametric manipulation of conflict similarity provides a way to systematically investigate the organization of cognitive control, as well as its influence on adaptive behaviors.

      [2] The work by Ritz & Shenhav (2023) indeed applied multivariate analyses, but they did not test the representational similarity across different levels of task difficulty in a similar way as our investigation into different levels of conflict types, neither did they manipulated conflict types as our study. They first estimated univariate brain activations that were parametrically scaled by task difficulty (e.g., target coherence), yielding one map of parameter estimates (i.e., encoding subspace) for each of the target coherence and distractor congruence. The multivoxel patterns from the above maps were correlated to test whether the target coherence and distractor congruence share the similar neural encoding. It is noteworthy that the encoding of task difficulty in their study is estimated at the univariate level, like the univariate parametric modulation analysis in our study. The representational similarity across target coherence and distractor congruence was the second-order test and did not reflect the similarity across different difficulty levels. Though, we have found another study (Wen & Egner, 2023) that has directly tested the representational similarity across different levels of task difficulty, and they observed a higher representational similarity between conditions with similar difficulty levels within a wide range of brain regions.

      Reference:

      Wen, T., & Egner, T. (2023). Context-independent scaling of neural responses to task difficulty in the multiple-demand network. Cerebral Cortex, 33(10), 6013-6027. https://doi.org/10.1093/cercor/bhac479

      Fu, Z., Beam, D., Chung, J. M., Reed, C. M., Mamelak, A. N., Adolphs, R., & Rutishauser, U. (2022). The geometry of domain-general performance monitoring in the human medial frontal cortex. Science (New York, N.Y.), 376(6593), eabm9922. https://doi.org/10.1126/science.abm9922

      Ritz, H., & Shenhav, A. (2023). Orthogonal neural encoding of targets and distractors supports multivariate cognitive control. https://doi.org/10.1101/2022.12.01.518771 Another issue is suggesting mixtures between two types of conflict may be many independent sources of conflict. Again, this feels like the strawman. There's a difference between infinite combinations of stimuli on the one hand, and levels of feature on the other hand. The issue of infinite stimuli is why people have proposed feature-based accounts, which are often parametric, eg color, size, orientation, spatial frequency. Mixing two forms of conflict is interesting, but the task limitations (i.e., highly correlated features) prevent an analysis of whether these are truly mixed (or eg reflect variations on just one of the conflict types). Without being able to compare a mixture between types vs levels of only one type, it's not clear what you can draw from these results re: how these are combined (and not clear how it reconciles the debate between general and specific).

      Response: As the reviewer pointed out, a feature (or a parameterization) is an efficient way to encode potentially infinite stimuli. This is the same idea as our hypothesis: different conflict types are represented in a cognitive space akin to concrete features such as a color spectrum. This concept can be illustrated in the figure below.

      Author response image 1.

      We would like to clarify that in our study we have manipulated five levels of conflict types, but they all originated from two fundamental sources: vertically spatial Stroop and horizontally Simon conflicts. We agree that the mixture of these two sources does not inherently generate additional conflict sources. However, this mixture does influence the similarity among different conflict conditions, which provides essential variability that is crucial for testing the core hypotheses (i.e., continuity and similarity modulation, see the response above) of the cognitive space view. This clarification is crucial as the reviewer’s impression might have been influenced by our introduction, where we repeatedly emphasized multiple sources of conflicts. Our aim in the introduction was to outline a broader conceptual framework, which might not directly reflect the specific design of our current study. Recognizing the possibility of misinterpretation, we have adjusted our introduction and discussion to place less emphasis on the variety of possible conflict sources. For example, we have removed the expression “The large variety of conflict sources implies that there may be innumerable number of conflict conditions” from the introduction. As we have addressed in the previous response, the observed conflict similarity effect could not be attributed to merely task difficulty. Similarly, the mixture of spatial Stroop and Simon conflicts should not be attributed to one conflict source only; doing so would oversimplify it to an issue of task difficulty, as it would imply that our manipulation of conflict types merely represented varying levels of a single conflict, akin to manipulating task difficulty when everything else being equal. Importantly, the mixed conditions differ from variations along a single conflict source in that they also incorporate components of the other conflict source, thereby introducing difference beyond that would be found within variances of a single conflict source. There are a few additional evidence challenging the single dimension assumption. In our previous revisions, we compared model fittings between the Cognitive-Space model and the Stroop-/Simon-only models, and results showed that the CognitiveSpace model (BIC = 5377093) outperformed the Stroop-Only (BIC = 5377122) and Simon-Only (BIC = 5377096) models. This suggests that mixed conflicts might not be solely reflective of either Stroop or Simon sources, although we did not include these results due to concerns raised by reviewers about the validity of such comparisons, given the high anticorrelation between the two dimensions. Furthermore, Fu et al. (2022) demonstrated that the mixture of Simon and Flanker conflicts (the sf condition) is represented as the vector sum of the Flanker and Simon dimensions within their space model, indicating a compositional nature. Similarly, our mixed conditions are combinations of Stroop and Simon conflicts, and it is plausible that these mixtures represent a fusion of both Stroop and Simon components, rather than just one. Thus, we disagree that the mixture of conflicts is a strawman. In response to this concern, we have included a statement in our limitation section: “Another limitation is that in our design, the spatial Stroop and Simon effects are highly anticorrelated. This constraint may make the five conflict types represented in a unidimensional space (e.g., a circle) embedded in a 2D space. This limitation also means we cannot conclusively rule out the possibility of a real unidimensional space driven solely by spatial Stroop or Simon conflicts. However, this appears unlikely, as it would imply that our manipulation of conflict types merely represented varying levels of a single conflict, akin to manipulating task difficulty when everything else being equal. If task difficulty were the primary variable, we would expect to see greater representational similarity between task conditions of similar difficulty, such as the Stroop and Simon conditions, which demonstrates comparable congruency effects (see Fig. S1). Contrary to this, our findings reveal that the Stroop-only and Simon-only conditions exhibit the lowest representational similarity (Fig. S4). Furthermore, Fu et al. (2022) has shown that the representation of mixtures of Simon and Flanker conflicts was compositional, rather than reflecting single dimension, which also applies to our cases.”

      My recommendation would be to dramatically rewrite to reduce the framing of this providing critical evidence in favor of cognitive maps, and being more overt about the limitations of this task. However, the authors are not required to make further revisions in eLife's new model, and it's not clear how my scores would change if they made those revisions (ie the conceptual limitations would remain, the claims would just now match the more limited scope).

      Response: With the above rationales and the adjustments we have made in the manuscripts, we believe that we have thoroughly acknowledged and articulated the limitations of our study. Therefore, we have decided against a complete rewrite of the manuscript.

      Public Review:

      2) The representations within DLPFC appear to treat 100% Stoop and (to a lesser extent) 100% Simon differently than mixed trials. Within mixed trials, the RDM within this region don't strongly match the predictions of the conflict similarity model. It appears that there may be a more complex relationship encoded in this region.

      Suggestion:

      2) RSMs in the key region of interest. I don't really understand the authors response here either. e.g,. 'It is essential to clarify that our conclusions were based on the significant similarity modulation effect identified in our statistical analysis using the cosine similarity model, where we did not distinguish between the within-Stroop condition and the other four within-conflict conditions (Fig. 7A, now Fig. 8A). This means that the representation of conflict type was not biased by the seemingly disparities in the values shown here'. In Figure 1C, it does look like they are testing this model.

      It seems like a stronger validation would test just the mixture trials (i.e., ignoring Simon-only and stroop-only). However, simon/stroop-only conditions being qualitatively different does beg the question of whether these are being represented parametrically vs categorically.

      Response: We apologize for the confusion caused by our previous response. To clarify, our conclusions have been drawn based on the robust conflict similarity effect.

      The conflict similarity regressor is defined by higher values in the diagonal cells (representing within-conflict similarity) than the off-diagonal cells (indicating between-conflict similarity), as illustrated in Fig. 1C and Fig. 8A (now Fig. 4B). It is important to note that this regressor may not be particularly sensitive to the variations within the diagonal cells. Our previous response aimed to emphasize that the inconsistencies observed along the diagonal do not contradict our core hypothesis regarding the conflict similarity effect.

      We recognized that since the visualization in Fig. S4, based on the raw RSM (i.e., Pearson correlation), may have been influenced by other regressors in our model than the conflict similarity effect. To reflect pattern similarity with confounding factors controlled for, we have visualized the RSM by including only the fixed effect of the conflict similarity and the residual while excluding all other factors. As shown in the revised Figure S4, the difference between the within-Stroop and other diagonal cells was greatly reduced. Instead, it revealed a clear pattern where that the diagonal values were higher than the off-diagonal values in the incongruent condition, aligning with our hypothesis regarding the conflict similarity modulator. Although some visual distinctions persist within the five diagonal cells (e.g., in the incongruent condition, the Stroop, Simon, and StMSmM conditions appear slightly lower than StHSmL and StLSmM conditions), follow-up one-way ANOVAs among these five diagonal conditions showed no significant differences. This held true for both incongruent and congruent conditions, with Fs < 1. Thus, we conclude that there is no strong evidence supporting the notion that Simon- and spatial Stroop-only conditions are systematically different from other conflict types. As a result, we decided not to exclude these two conflict types from analysis.

      Author response image 2.

      The stronger conflict type similarity effect in incongruent versus congruent conditions. Shown are the summary representational similarity matrices for the right 8C region in incongruent (left) and congruent (right) conditions, respectively. Each cell represents the averaged Pearson correlation (after regressing out all factors except the conflict similarity) of cells with the same conflict type and congruency in the 1400×1400 matrix. Note that the seemingly disparities in the values of withinconflict cells (i.e., the diagonal) did not reach significance for either incongruent or congruent trials, Fs < 1.

      Public Review:

      3) To orthogonalized their variables, the authors need to employ a complex linear mixed effects analysis, with a potential influence of implementation details (e.g., high-level interactions and inflated degrees of freedom).

      Suggestion:

      3) The DF for a mixed model should not be the number of observations minus the number of fixed effects. The gold standard is to use satterthwaite correction (e.g. in Matlab, fixedEffects(lme,'DFMethod','satterthwaite')), or number of subjects - number of fixed effects (i.e. you want to generalize to new subjects, not just new samples from the same subjects). Honestly, running a 4-way interaction probably is probably using more degrees of freedom than are appropriate given the number of subjects.

      Response: We concur with the reviewer’s comment that our previous estimation of degrees of freedom (DFs) was inaccurate. Following your suggestion, we have now applied the “Satterthwaite” approach to approximate the DFs for all our linear mixed effect model analyses. This adjustment has led to the correction of both DFs and p values. In the Methods section, we have mentioned this revision.

      “We adjusted the t and p values with the degrees of freedom calculated through the Satterthwaite approximation method (Satterthwaite, 1946). Of note, this approach was applied to all the mixed-effect model analyses in this study.”

      The application of this method has indeed resulted in a reduction of our statistical significance. However, our overall conclusions remained robust. Instead of the highly stringent threshold used in our previous version (Bonferonni corrected p < .0001), we have now adopted a relatively more lenient threshold of Bonferonni correction at p < 0.05, which is commonly employed in the literature. Furthermore, it is worth noting that the follow-up criteria 2 and 3 are inherently second-order analyses. Criterion 2 involves examining the interaction effect (conflict similarity effect difference between incongruent and congruent conditions), and criterion 3 involves individual correlation analyses. Due to their second-order nature, these criteria inherently have lower statistical power compared to criterion 1 (Blake & Gangestad, 2020). We thus have applied a more lenient but still typically acceptable false discovery rate (FDR) correction to criteria 2 and 3. This adjustment helps maintain the rigor of our analysis while considering the inherent differences in statistical power across the various criteria. We have mentioned this revision in our manuscript:

      “We next tested whether these regions were related to cognitive control by comparing the strength of conflict similarity effect between incongruent and congruent conditions (criterion 2) and correlating the strength to behavioral similarity modulation effect (criterion 3). Given these two criteria pertain to second-order analyses (interaction or individual analyses) and thus might have lower statistical power (Blake & Gangestad, 2020), we applied a more lenient threshold using false discovery rate (FDR) correction (Benjamini & Hochberg, 1995) on the above-mentioned regions.”

      With these adjustments, we consistently identified similar brain regions as observed in our previous version. Specifically, we found that only the right 8C region met the three criteria in the conflict similarity analysis. In addition, the regions meeting the criteria for the orientation effect included the FEF and IP2 in left hemisphere, and V1, V2, POS1, and PF in the right hemisphere. We have thoroughly revised the description of our results, updated the figures and tables in both the revised manuscript and supplementary material to accurately reflect these outcomes.

      Reference:

      Blake, K. R., & Gangestad, S. (2020). On Attenuated Interactions, Measurement Error, and Statistical Power: Guidelines for Social and Personality Psychologists. Pers Soc Psychol Bull, 46(12), 1702-1711. https://doi.org/10.1177/0146167220913363

      Minor:

      1. Figure 8 should come much earlier (e.g, incorporated into Figure 1), and there should be consistent terms for 'cognitive map' and 'conflict similarity'.

      Response: We appreciate this suggestion. Considering that Figure 7 (“The crosssubject RSA model and the rationale”) also describes the models, we have merged Figure 7 and 8 and moved the new figure ahead, before we report the RSA results. Now you could find it in the new Figure 4, see below. We did not incorporate them into Figure 1 since Figure 1 is already too crowded.

      Author response image 3.

      Fig. 4. Rationale of the cross-subject RSA model and the schematic of key RSMs. A) The RSM is calculated as the Pearson’s correlation between each pair of conditions across the 35 subjects. For 17 subjects, the stimuli were displayed on the top-left and bottom-right quadrants, and they were asked to respond with left hand to the upward arrow and right hand to the downward arrow. For the other 18 subjects, the stimuli were displayed on the top-right and bottom-left quadrants, and they were asked to respond with left hand to the downward arrow and right hand to the upward arrow. Within each subject, the conflict type and orientation regressors were perfectly covaried. For instance, the same conflict type will always be on the same orientation. To de-correlate conflict type and orientation effects, we conducted the RSA across subjects from different groups. For example, the bottom-right panel highlights the example conditions that are orthogonal to each other on the orientation, response, and Simon distractor, whereas their conflict type, target and spatial Stroop distractor are the same. The dashed boxes show the possible target locations for different conditions. (B) and (C) show the orthogonality between conflict similarity and orientation RSMs. The within-subject RSMs (e.g., Group1-Group1) for conflict similarity and orientation are all the same, but the cross-group correlations (e.g., Group2-Group1) are different. Therefore, we can separate the contribution of these two effects when including them as different regressors in the same linear regression model. (D) and (E) show the two alternative models. Like the cosine model (B), within-group trial pairs resemble betweengroup trial pairs in these two models. The domain-specific model is an identity matrix. The domaingeneral model is estimated from the absolute difference of behavioral congruency effect, but scaled to 0 (lowest similarity) – 1 (highest similarity) to aid comparison. The plotted matrices in B-E include only one subject each from Group 1 and Group 2. Numbers 1-5 indicate the conflict type conditions, for spatial Stroop, StHSmL, StMSmM, StLSmH, and Simon, respectively. The thin lines separate four different sub-conditions, i.e., target arrow (up, down) × congruency (incongruent, congruent), within each conflict type.

      In our manuscript, the term “cognitive map/space” was used when explaining the results in a theoretical perspective, whereas the “conflict similarity” was used to describe the regressor within the RSA. These terms serve distinct purposes in our study and cannot be interchangeably substituted. Therefore, we have retained them in their current format. However, we recognize that the initial introduction of the “Cognitive-Space model” may have appeared somewhat abrupt. To address this, we have included a brief explanatory note: “The model described above employs the cosine similarity measure to define conflict similarity and will be referred to as the Cognitive-Space model.”

    2. Author Response

      The following is the authors’ response to the previous reviews.

      Thank you and the reviewers for further providing constructive comments and suggestions on our manuscript. On behalf of all the co-authors, I have enclosed a revised version of the above referenced paper. Below, I have merged similar public reviews and recommendations (if applicable) from each reviewer and provided point-by-point responses.

      Reviewer #1:

      People can perform a wide variety of different tasks, and a long-standing question in cognitive neuroscience is how the properties of different tasks are represented in the brain. The authors develop an interesting task that mixes two different sources of difficulty, and find that the brain appears to represent this mixture on a continuum, in the prefrontal areas involved in resolving task difficulty. While these results are interesting and in several ways compelling, they overlap with previous findings and rely on novel statistical analyses that may require further validation.

      Strengths

      1. The authors present an interesting and novel task for combining the contributions of stimulus-stimulus and stimulus-response conflict. While this mixture has been measured in the multi-source interference task (MSIT), this task provides a more graded mixture between these two sources of difficulty.

      2. The authors do a good job triangulating regions that encoding conflict similarity, looking for the conjunction across several different measures of conflict encoding. These conflict measures use several best-practice approaches towards estimating representational similarity.

      3. The authors quantify several salient alternative hypothesis and systematically distinguish their core results from these alternatives.

      4. The question that the authors tackle is important to cognitive control, and they make a solid contribution.

      The authors have addressed several of my concerns. I appreciate the authors implementing best practices in their neuroimaging stats.

      I think that the concerns that remain in my public review reflect the inherent limitations of the current work. The authors have done a good job working with the dataset they've collected.

      Response: We would like to thank the reviewer for the positive evaluation of our manuscript and the constructive comments and suggestions. In response to your suggestions and concerns, we have removed the Stroop/Simon-only and the Stroop+Simon models, revised our conclusion and modified the misleading phrases.

      We have provided detailed responses to your comments below.

      1. The evidence from this previous work for mixtures between different conflict sources makes the framing of 'infinite possible types of conflict' feel like a strawman. The authors cite classic work (e.g., Kornblum et al., 1990) that develops a typology for conflict which is far from infinite. I think few people would argue that every possible source and level of difficulty will have to be learned separately. This work provides confirmatory evidence that task difficulty is represented parametrically (e.g., consistent with the n-back, MOT, and random dot motion literature).

      notes for my public concerns.

      In their response, the authors say:

      'If each combination of the Stroop-Simon combination is regarded as a conflict condition, there would be infinite combinations, and it is our major goal to investigate how these infinite conflict conditions are represented effectively in a space with finite dimensions.'

      I do think that this is a strawman. The paper doesn't make a strong case that this position ('infinite combinations') is widely held in the field. There is previous work (e.g., n-back, multiple object tracking, MSIT, dot motion) that has already shown parametric encoding of task difficulty. This paper provides confirmatory evidence, using an interesting new task, that demand are parametric, but does not provide a major theoretical advance.

      Response: We agree that the previous expression may have seemed somewhat exaggerative. While it is not “infinite”, recent research indeed suggests that the cognitive control shows domain-specificity across various “domains”, including conflict types (Egner, 2008), sensory modalities (Yang et al., 2017), task-irrelevant stimuli (Spape et al., 2008), and task sets (Hazeltine et al., 2011), to name a few.

      These findings collectively support the notion that cognitive control is contextspecific (Bream et al., 2014). That is, cognitive control can be tuned and associated with different (and potentially large numbers of) contexts. Recently, Kikumoto and Mayr (2020) demonstrated that combinations of stimulus, rule and response in the same task formed separatable, conjunctive representations. They further showed that these conjunctive representations facilitate performance. This is in line with the idea that each stimulus-location combination in the present task may be represented separately in a domain-specific manner. Moreover, domain-general task representation can also become domain-specific with learning, which further increases the number of domain-specific conjunctive representations (Mill et al., 2023). In line with the domain-specific account of cognitive control, we referred to the “infinite combinations” in our previous response to emphasize the extreme case of domainspecificity. However, recognizing that the term “infinite” may lead to ambiguity, we have replaced it with phrases such as “a large number of”, “hugely varied”, in our revised manuscript.

      We appreciate the reviewer for highlighting the potential connection of our work to existing literature that showed the parametric encoding of task difficulty (e.g., Dagher et al., 1999; Ritz & Shenhav, 2023). For instance, in Ritz et al.’s (2023) study, they parametrically manipulated target difficulty based on consistent ratios of dot color, and found that the difficulty was encoded in the caudal part of dorsal anterior cingulate cortex. Analogically, in our study, the “difficulty” pertains to the behavioral congruency effect that we modulated within the spatial Stroop and Simon dimensions. Notably, we did identify univariate effects in the right dmPFC and IPS associated with the difficulty in the Simon dimension. This parametric effect may lend support to our cognitive space hypothesis, although we exercised caution in interpreting their significance due to the absence of a clear brain-behavioral relevance in these regions. We have added the connection of our work to prior literature in the discussion. The parametric encoding of conflict also mirrors prior research showing the parametric encoding of task demands (Dagher et al., 1999; Ritz & Shenhav, 2023).

      However, our analyses extend beyond solely testing the parametric encoding of difficulty. Instead, we focused on the multivariate representation of different conflict types, which we believe is independent from the univariate parametric encoding. Unlike the univariate encoding that relies on the strength within one dimension, the multivariate representation of conflict types incorporates both the spatial Stroop and Simon dimensions. Furthermore, we found that similar difficulty levels did not yield similar conflict representation, as indicated by the low similarity between the spatial Stroop and Simon conditions, despite both showing a similar level of congruency effect (Fig. S1). Additionally, we also observed an interaction between conflict similarity and difficulty (i.e., congruency, Fig. 4B/D), such that the conflict similarity effect was more pronounced when conflict was present. Therefore, we believe that our findings make contribution to the literature beyond the difficulty effect.

      Reference:

      Egner, T. (2008). Multiple conflict-driven control mechanisms in the human brain. Trends in Cognitive Sciences, 12(10), 374-380. https://doi.org/10.1016/j.tics.2008.07.001

      Yang, G., Nan, W., Zheng, Y., Wu, H., Li, Q., & Liu, X. (2017). Distinct cognitive control mechanisms as revealed by modality-specific conflict adaptation effects. Journal of Experimental Psychology: Human Perception and Performance, 43(4), 807-818. https://doi.org/10.1037/xhp0000351

      Spapé MM, Hommel B (2008). He said, she said: episodic retrieval induces conflict adaptation in an auditory Stroop task. Psychonomic Bulletin Review,15(6):1117-21. https://doi.org/10.3758/PBR.15.6.1117

      Hazeltine E, Lightman E, Schwarb H, Schumacher EH (2011). The boundaries of sequential modulations: evidence for set-level control. Journal of Experimental Psychology: Human Perception & Performance. 2011 Dec;37(6):1898-914. https://doi.org/10.1037/a0024662

      Braem, S., Abrahamse, E. L., Duthoo, W., & Notebaert, W. (2014). What determines the specificity of conflict adaptation? A review, critical analysis, and proposed synthesis. Frontiers in Psychology, 5, 1134. https://doi.org/10.3389/fpsyg.2014.01134

      Kikumoto A, Mayr U. (2020). Conjunctive representations that integrate stimuli, responses, and rules are critical for action selection. Proceedings of the National Academy of Sciences, 117(19):10603-10608. https://doi.org/10.1073/pnas.1922166117.

      Mill, R. D., & Cole, M. W. (2023). Neural representation dynamics reveal computational principles of cognitive task learning. bioRxiv. https://doi.org/10.1101/2023.06.27.546751

      Dagher, A., Owen, A. M., Boecker, H., & Brooks, D. J. (1999). Mapping the network for planning: a correlational PET activation study with the Tower of London task. Brain, 122 ( Pt 10), 1973-1987. https://doi.org/10.1093/brain/122.10.1973

      Ritz, H., & Shenhav, A. (2023). Orthogonal neural encoding of targets and distractors supports multivariate cognitive control. https://doi.org/10.1101/2022.12.01.518771

      1. (Public Reviews) The degree of Stroop vs Simon conflict is perfectly negatively correlated across conditions. This limits their interpretation of an integrated cognitive space, as they cannot separately measure Stroop and Simon effects. The author's control analyses have limited ability to overcome this task limitation. While these results are consistent with parametric encoding, they cannot adjudicate between combined vs separated representations.

      (Recommendations) I think that it is still an issue that the task's two features (stroop and simon conflict) are perfectly correlated. This fundamentally limits their ability to measure the similarity in these features. The authors provide several control analyses, but I think these are limited.

      Response: We need to acknowledge that the spatial Stroop and Simon components in the five conflict conditions were not “perfectly” correlated, with r = –0.89. This leaves some room for the preliminary model comparison to adjudicate between these models. However, it’s essential to note that conclusions based on these results must be tempered. In line with the reviewer’s observation, we agree that the high correlation between the two conflict sources posed a potential limitation on our ability to independently investigate the contribution of spatial Stroop and Simon conflicts. Therefore, in addition to the limitation we have previously acknowledged, we have now further revised our conclusion and adjusted our expressions accordingly.

      Specifically, we now regard the parametric encoding of cognitive control not as direct evidence of the cognitive space view but as preliminary evidence that led us to propose this hypothesis, which requires further testing. Notably, we have also modified the title from “Conflicts are represented in a cognitive space to reconcile domain-general and domain-specific cognitive control” to “Conflicts are parametrically encoded: initial evidence for a cognitive space view to reconcile the debate of domain-general and domain-specific cognitive control”. Also, we revised the conclusion as: In sum, we showed that the cognitive control can be parametrically encoded in the right dlPFC and guides cognitive control to adjust goal-directed behavior. This finding suggests that different cognitive control states may be encoded in an abstract cognitive space, which reconciles the long-standing debate between the domain-general and domain-specific views of cognitive control and provides a parsimonious and more broadly applicable framework for understanding how our brains efficiently and flexibly represents multiple task settings.

      From Recommendations The authors perform control analyses that test stroop-only and simon-only models. However, these analyses use a totally different similarity metric, that's based on set intersection rather than geometry. This metric had limited justification or explanation, and it's not clear whether these models fit worse because of the similarity metric. Even here, Simon-only model fit better than Stroop+Simon model. The dimensionality analyses may reflect the 1d manipulation by the authors (i.e. perfectly corrected stroop and simon effects).

      Response: The Jaccard measure is the most suitable method we can conceive of for assessing the similarity between two conflicts when establishing the Stroop-only and Simon-only models, achieved by projecting them onto the vertical or horizontal axes, respectively (Author response image 1A). This approach offers two advantages. First, the Jaccard similarity combines both similarity (as reflected by the numerator) and distance (reflected by the difference between denominator and numerator) without bias towards either. Second, the Jaccard similarity in our design is equivalent to the cosine similarity because the denominator in the cosine similarity is identical to the denominator in the Jaccard similarity (both are the radius of the circle, Author response image 1B).

      Author response image 1.

      Definition of Jaccard similarity. A) Two conflicts (1 and 2) are projected onto the spatial Stroop/Simon axis in the Stroop/Simon-only model, respectively. The Jaccard similarity for Stroop-only and Simon-only model are and respectively. Letters a-d are the projected vectors from the two conflicts to the two axes. Blue and red colors indicate the conflict conditions. Shorter vectors are the intersection and longer vectors are the union. B) According to the cosine similarity model, the similarity is defined as , where e is the projected vector from conflict 1 to conflict 2, and g is the vector of conflict 1. The Jaccard similarity for this case is defined by , where f is the projector vector from conflict 2 to itself. Because f = g in our design, the Jaccard similarity is equivalent to the cosine similarity.

      Therefore, we believe that the model comparisons between cosine similarity model and the Stroop/Simon-Only models were equitable. However, we acknowledge the reviewer’s and other reviewers’ concerns about the correlation between spatial Stroop and Simon conflicts, which reduces the space to one dimension (1d) and limits our ability to distinguish between the Stroop-only and Simon-only models, as well as between Stroop+Simon and cosine similarity models. While these distinctions are undoubtedly important for understanding the geometry of the cognitive space, we recognize that they go beyond the major objective of this study, that is, to differentiate the cosine similarity model from domain-general/specific models. Therefore, we have chosen to exclude the Stroop-only, Simon-only and Stroop+Simon models in our revised manuscript.

      Something that raised additional concerns are the RSMs in the key region of interest (Fig S5). The pure stroop task appears to be represented very differently from all of the conditions that include simon conflict.

      Together, I think these limitations reflect the structure of the task and research goals, not the statistical approach (which has been meaningfully improved).

      Response: We appreciate the reviewer for pointing this out. It is essential to clarify that our conclusions were based on the significant similarity modulation effect identified in our statistical analysis using the cosine similarity model, where we did not distinguish between the within-Stroop condition and the other four within-conflict conditions (Fig. 7A, now Fig. 8A). This means that the representation of conflict type was not biased by the seemingly disparities in the values shown here. Moreover, to specifically test the differences between the within-Stroop condition and the other within-conflict conditions, we conducted a mixed-effect model analysis only including trial pairs from the same conflict type. In this analysis, the primary predictor was the cross-condition difference (0 for within-Stroop condition and 1 for other within-conflict conditions). The results showed no significant cross-condition difference in either the incongruent (t = 1.22, p = .23) or the congruent (t = 1.06, p = .29) trials. Thus, we believe the evidence for different similarities is inconclusive in our data and decided not to interpret this numerical difference. We have added this note in the revised figure caption for Figure S5.

      Author response image 2.

      Fig. S5. The stronger conflict type similarity effect in incongruent versus congruent conditions. (A) Summary representational similarity matrices for the right 8C region in incongruent (left) and congruent (right) conditions, respectively. Each cell represents the averaged Pearson correlation of cells with the same conflict type and congruency in the 1400×1400 matrix. Note that the seemingly disparities in the values of Stroop and other within-conflict cells (i.e., the diagonal) did not reach significance for either incongruent (t = 1.22, p = .23) or congruent (t = 1.06, p = .29) trials. (2) Scatter plot showing the averaged neural similarity (Pearson correlation) as a function of conflict type similarity in both conditions. The values in both A and B are calculated from raw Pearson correlation values, in contrast to the z-scored values in Fig. 4D.

      Minor:

      • In the analysis of similarity_orientation, the df is very large (~14000). Here, and throughout, the df should be reflective of the population of subjects (ie be less than the sample size).

      Response: The large degrees of freedom (df) in our analysis stem from the fact that we utilized a mixed-effect linear model, incorporating all data points (a total of 400×35=14000). In mixed-effect models, the df is determined by subtracting the number of fixed effects (in our case, 7) from the total number of observations. Notably, we are in line with the literature that have reported the df in this manner (e.g., Iravani et al., 2021; Schmidt & Weissman, 2015; Natraj et al., 2022).

      Reference:

      Iravani B, Schaefer M, Wilson DA, Arshamian A, Lundström JN. The human olfactory bulb processes odor valence representation and cues motor avoidance behavior. Proc Natl Acad Sci U S A. 2021 Oct 19;118(42):e2101209118. https://doi.org/10.1073/pnas.2101209118.

      Schmidt, J.R., Weissman, D.H. Congruency sequence effects and previous response times: conflict adaptation or temporal learning?. Psychological Research 80, 590–607 (2016). https://doi.org/10.1007/s00426-015-0681-x.

      Natraj, N., Silversmith, D. B., Chang, E. F., & Ganguly, K. (2022). Compartmentalized dynamics within a common multi-area mesoscale manifold represent a repertoire of human hand movements. Neuron, 110(1), 154-174. https://doi.org/10.1016/j.neuron.2021.10.002.

      • it would improve the readability if there was more didactic justification for why analyses are done a certain way (eg justifying the jaccard metric). This will help less technically-savvy readers.

      Response: We appreciate the reviewer’s suggestion. However, considering the Stroop/Simon-only models in our design may not be a valid approach for distinguishing the contributions of the Stroop/Simon components, we have decided not to include the Jaccard metrics in our revised manuscript.

      Besides, to improve the readability, we have moved Figure S4 to the main text (labeled as Figure 7), and added the domain-general/domain-specific schematics in Figure 8.

      Author response image 3.

      Figure 8. Schematic of key RSMs. (A) and (B) show the orthogonality between conflict similarity and orientation RSMs. The within-subject RSMs (e.g., Group1-Group1) for conflict similarity and orientation are all the same, but the cross-group correlations (e.g., Group2-Group1) are different. Therefore, we can separate the contribution of these two effects when including them as different regressors in the same linear regression model. (C) and (D) show the two alternative models. Like the cosine model (A), within-group trial pairs resemble between-group trial pairs in these two models. The domain-specific model is an identity matrix. The domain-general model is estimated from the absolute difference of behavioral congruency effect, but scaled to 0(lowest similarity)-1(highest similarity) to aid comparison. The plotted matrices here include only one subject each from Group 1 and Group 2. Numbers 1-5 indicate the conflict type conditions, for spatial Stroop, StHSmL, StMSmM, StLSmH, and Simon, respectively. The thin lines separate four different sub-conditions, i.e., target arrow (up, down) × congruency (incongruent, congruent), within each conflict type.

      Reviewer #2:

      This study examines the construct of "cognitive spaces" as they relate to neural coding schemes present in response conflict tasks. The authors use a novel experimental design in which different types of response conflict (spatial Stroop, Simon) are parametrically manipulated. These conflict types are hypothesized to be encoded jointly, within an abstract "cognitive space", in which distances between task conditions depend only on the similarity of conflict types (i.e., where conditions with similar relative proportions of spatial-Stroop versus Simon conflicts are represented with similar activity patterns). Authors contrast such a representational scheme for conflict with several other conceptually distinct schemes, including a domain-general, domain-specific, and two task-specific schemes. The authors conduct a behavioral and fMRI study to test which of these coding schemes is used by prefrontal cortex. Replicating the authors' prior work, this study demonstrates that sequential behavioral adjustments (the congruency sequence effect) are modulated as a function of the similarity between conflict types. In fMRI data, univariate analyses identified activation in left prefrontal and dorsomedial frontal cortex that was modulated by the amount of Stroop or Simon conflict present, and representational similarity analyses (RSA) that identified coding of conflict similarity, as predicted under the cognitive space model, in right lateral prefrontal cortex.

      This study tackles an important question regarding how distinct types of conflict might be encoded in the brain within a computationally efficient representational format. The ideas postulated by the authors are interesting ones and the statistical methods are generally rigorous.

      Response: We would like to express our sincere appreciation for the reviewer’s positive evaluation of our manuscript and the constructive comments and suggestions. In response to your suggestions and concerns, we excluded the StroopOnly, SimonOnly and Stroop+Simon models, and added the schematic of domain-general/specific model RSMs. We have provided detailed responses to your comments below.

      The evidence supporting the authors claims, however, is limited by confounds in the experimental design and by lack of clarity in reporting the testing of alternative hypotheses within the method and results.

      1. Model comparison

      The authors commendably performed a model comparison within their study, in which they formalized alternative hypotheses to their cognitive space hypothesis. We greatly appreciate the motivation for this idea and think that it strengthened the manuscript. Nevertheless, some details of this model comparison were difficult for us to understand, which in turn has limited our understanding of the strength of the findings.

      The text indicates the domain-general model was computed by taking the difference in congruency effects per conflict condition. Does this refer to the "absolute difference" between congruency effects? In the rest of this review, we assume that the absolute difference was indeed used, as using a signed difference would not make sense in this setting. Nevertheless, it may help readers to add this information to the text.

      Response: We apologize for any confusion. The “difference” here indeed refers to the “absolute difference” between congruency effects. We have now clarified this by adding the word “absolute” accordingly.

      "Therefore, we defined the domain-general matrix as the absolute difference in their congruency effects indexed by the group-averaged RT in Experiment 2."

      Regarding the Stroop-Only and Simon-Only models, the motivation for using the Jaccard metric was unclear. From our reading, it seems that all of the other models --- the cognitive space model, the domain-general model, and the domain-specific model --- effectively use a Euclidean distance metric. (Although the cognitive space model is parameterized with cosine similarities, these similarity values are proportional to Euclidean distances because the points all lie on a circle. And, although the domain-general model is parameterized with absolute differences, the absolute difference is equivalent to Euclidean distance in 1D.) Given these considerations, the use of Jaccard seems to differ from the other models, in terms of parameterization, and thus potentially also in terms of underlying assumptions. Could authors help us understand why this distance metric was used instead of Euclidean distance? Additionally, if Jaccard must be used because this metric seems to be non-standard in the use of RSA, it would likely be helpful for many readers to give a little more explanation about how it was calculated.

      Response: We believe that the Jaccard similarity measure is consistent with the Cosine similarity measure. The Jaccard similarity is calculated as the intersection divided by the union. To define the similarity of two conflicts in the Stroop-only and Simon-only models, we first project them onto the vertical or horizontal axes, respectively (as shown in Author response image 1A). The Jaccard similarity in our design is equivalent to the cosine similarity because the denominator in the Jaccard similarity is identical to the denominator in the cosine similarity (both are the radius of the circle, Author response image 1B).

      However, it is important to note that a cosine similarity cannot be defined when conflicts are projected onto spatial Stroop or Simon axis simultaneously. Therefore, we used the Jaccard similarity in the previous version of our manuscript.

      Author response image 4.

      Definition of Jaccard similarity. A) Two conflicts (1 and 2) are projected onto the spatial Stroop/Simon axis in the Stroop/Simon-only model, respectively. The Jaccard similarity for Stroop-only and Simon-only model are and respectively. Letters a-d are the projected vectors from the two conflicts to the two axes. Blue and red colors indicate the conflict conditions. Shorter vectors are the intersection and longer vectors are the union. B) According to the cosine similarity model, the similarity is defined as , where e is the projected vector from conflict 1 to conflict 2, and g is the vector of conflict 1. The Jaccard similarity for this case is defined by , where f is the projector vector from conflict 2 to itself. Because f = g in our design, the Jaccard similarity is equivalent to the cosine similarity.

      However, we agree with the reviewer’s and other reviewers’ concern that the correlation between spatial Stroop and Simon conflicts makes it less likely to distinguish the Stroop+Simon from cosine similarity models. While distinguishing them is essential to understand the detailed geometry of the cognitive space, it is beyond our major purpose, that is, to distinguish the cosine similarity model with the domain-general/specific models. Therefore, we have chosen to exclude the Stroop-only, Simon-only and Stroop+Simon models from our revised manuscript.

      When considering parameterizing the Stroop-Only and Simon-Only models with Euclidean distances, one concern we had is that the joint inclusion of these models might render the cognitive space model unidentifiable due to collinearity (i.e., the sum of the Stroop-Only and Simon-Only models could be collinear with the cognitive space model). Could the authors determine whether this is the case? This issue seems to be important, as the presence of such collinearity would suggest to us that the design is incapable of discriminating those hypotheses as parameterized.

      Response: We acknowledge that our design does not allow for a complete differentiation between the parallel encoding (StroopOnly+SimonOnly) model and the cognitive space model, given their high correlation (r = 0.85). However, it is important to note that the StroopOnly+SimonOnly model introduces more free parameters, making the model fitting poorer than the cognitive space model.

      Additionally, the cognitive space model also shows high correlations with the StroopOnly and SimonOnly models (both rs = 0.66). It is crucial to emphasize that our study’s primary goal does not involve testing the parallel encoding hypothesis (through the StroopOnly+SimonOnly model). As a result, we have chosen to remove the model comparison results with the StroopOnly, SimonOnly and StroopOnly+SimonOnly models. Instead, the cognitive space model shows lower correlation with the purely domain-general (r = −0.16) and domain-specific (r = 0.46) models.

      1. Issue of uniquely identifying conflict coding

      We certainly appreciate the efforts that authors have taken to address potential confounders for encoding of conflict in their original submission. We broach this question not because we wish authors to conduct additional control analyses, but because this issue seems to be central to the thesis of the manuscript and we would value reading the authors' thoughts on this issue in the discussion.

      To summarize our concerns, conflict seems to be a difficult variable to isolate within aggregate neural activity, at least relative to other variables typically studied in cognitive control, such as task-set or rule coding. This is because it seems reasonable to expect that many more nuisance factors covary with conflict -- such as univariate activation, level of cortical recruitment, performance measures, arousal --- than in comparison with, for example, a well-designed rule manipulation. Controlling for some of these factors post-hoc through regression is commendable (as authors have done here), but such a method will likely be incomplete and can provide no guarantees on the false positive rate.

      Relatedly, the neural correlates of conflict coding in fMRI and other aggregate measures of neural activity are likely of heterogeneous provenance, potentially including rate coding (Fu et al., 2022), temporal coding (Smith et al., 2019), modulation of coding of other more concrete variables (Ebitz et al., 2020, 10.1101/2020.03.14.991745; see also discussion and reviews of Tang et al., 2016, 10.7554/eLife.12352), or neuromodulatory effects (e.g., Aston-Jones & Cohen, 2005). Some of these origins would seem to be consistent with "explicit" coding of conflict (conflict as a representation), but others would seem to be more consistent with epiphenomenal coding of conflict (i.e., conflict as an emergent process). Again, these concerns could apply to many variables as measured via fMRI, but at the same time, they seem to be more pernicious in the case of conflict. So, if authors consider these issues to be germane, perhaps they could explicitly state in the discussion whether adopting their cognitive space perspective implies a particular stance on these issues, how they interpret their results with respect to these issues, and if relevant, qualify their conclusions with uncertainty on these issues.

      Response: We appreciate the reviewer’s insightful comments regarding the representation and process of conflict.

      First, we agree that the conflict is not simply a pure feature like a stimulus but often arises from the interaction (e.g., dimension overlap) between two or more aspects. For example, in the manual Stroop, conflict emerges from the inconsistent semantic information between color naming and word reading. Similarly, other higher-order cognitive processes such as task-set also underlie the relationship between concrete aspects. For instance, in a face/house categorization task, the taskset is the association between face/house and the responses. When studying these higher-order processes, it is often impossible to completely isolate them from bottomup features. Therefore, methods like the representational similarity analysis and regression models are among the limited tools available to attempt to dissociate these concrete factors from conflict representation. While not perfect, this approach has been suggested and utilized in practice (Freund et al., 2021).

      Second, we agree that conflict can be both a representation and an emerging process. These two perspectives are not necessarily contradictory. According to David Marr’s influential three-level theory (Marr, 1982), representation is the algorithm of the process to achieve a goal based on the input. Therefore, a representation can refer to not only a static stimulus (e.g., the visual representation of an image), but also a dynamic process. Building on this perspective, we posit that the representation of cognitive control consists of an array of dynamic representations embedded within the overall process. A similar idea has been proposed that the abstract task profiles can be progressively constructed as a representation in our brain (Kikumoto & Mayr, 2020).

      We have incorporated this discussion into the manuscript:

      "Recently an interesting debate has arisen concerning whether cognitive control should be considered as a process or a representation (Freund, Etzel, et al., 2021). Traditionally, cognitive control has been predominantly viewed as a process. However, the study of its representation has gained more and more attention. While it may not be as straightforward as the visual representation (e.g., creating a mental image from a real image in the visual area), cognitive control can have its own form of representation. An influential theory, Marr’s (1982) three-level model proposed that representation serves as the algorithm of the process to achieve a goal based on the input. In other words, representation can encompass a dynamic process rather than being limited to static stimuli. Building on this perspective, we posit that the representation of cognitive control consists of an array of dynamic representations embedded within the overall process. A similar idea has been proposed that the representation of task profiles can be progressively constructed with time in the brain (Kikumoto & Mayr, 2020)."

      Reference:

      Freund, M. C., Etzel, J. A., & Braver, T. S. (2021). Neural Coding of Cognitive Control: The Representational Similarity Analysis Approach. Trends in Cognitive Sciences, 25(7), 622-638. https://doi.org/10.1016/j.tics.2021.03.011

      Marr, D. C. (1982). Vision: A computational investigation into human representation and information processing. New York: W.H. Freeman.

      Kikumoto A, Mayr U. (2020). Conjunctive representations that integrate stimuli, responses, and rules are critical for action selection. Proceedings of the National Academy of Sciences, 117(19):10603-10608. https://doi.org/10.1073/pnas.1922166117.

      1. Interpretation of measured geometry in 8C

      We appreciate the inclusion of the measured similarity matrices of area 8C, the key area the results focus on, to the supplemental, as this allows for a relatively model-agnostic look at a portion of the data. Interestingly, the measured similarity matrix seems to mismatch the cognitive space model in a potentially substantive way. Although the model predicts that the "pure" Stroop and Simon conditions will have maximal self-similarity (i.e., the Stroop-Stroop and Simon-Simon cells on the diagonal), these correlations actually seem to be the lowest, by what appears to be a substantial margin (particularly the Stroop-Stroop similarities). What should readers make of this apparent mismatch? Perhaps authors could offer their interpretation on how this mismatch could fit with their conclusions.

      Response: We appreciate the reviewer for bringing this to our attention. It is essential to clarify that our conclusions were based on the significant similarity modulation effect observed in our statistical analysis using the cosine similarity model, where we did not distinguish between the within-Stroop condition and the other four withinconflict conditions (Fig. 7A). This means that the representation of conflict type was not biased by the seemingly disparities in the values shown here. Moreover, to specifically address the potential differences between the within-Stroop condition and the other within-conflict conditions, we conducted a mixed-effect model. In this analysis, the primary predictor was the cross-condition difference (0 for within-Stroop condition and 1 for other within-conflict conditions). The results showed no significant cross-condition difference in either the incongruent trials (t = 1.22, p = .23) or the congruent (t = 1.06, p = .29) trials. Thus, we believe the evidence for different similarities is inconclusive in our data and decided not to interpret this numerical difference.

      We have added this note in the revised figure caption for Figure S5.

      Author response image 5.

      Fig. S5. The stronger conflict type similarity effect in incongruent versus congruent conditions. (A) Summary representational similarity matrices for the right 8C region in incongruent (left) and congruent (right) conditions, respectively. Each cell represents the averaged Pearson correlation of cells with the same conflict type and congruency in the 1400×1400 matrix. Note that the seemingly disparities in the values of Stroop and other within-conflict cells (i.e., the diagonal) did not reach significance for either incongruent (t = 1.22, p = .23) or congruent (t = 1.06, p = .29) trials. (2) Scatter plot showing the averaged neural similarity (Pearson correlation) as a function of conflict type similarity in both conditions. The values in both A and B are calculated from raw Pearson correlation values, in contrast to the z-scored values in Fig. 4D.

      1. It would likely improve clarity if all of the competing models were displayed as summarized RSA matrices in a single figure, similar to (or perhaps combined with) Figure 7.

      Response: We appreciate the reviewer’s suggestion. We now have incorporated the domain-general and domain-specific models into the Figure 7 (now Figure 8).

      Author response image 6.

      Figure 8. Schematic of key RSMs. (A) and (B) show the orthogonality between conflict similarity and orientation RSMs. The within-subject RSMs (e.g., Group1-Group1) for conflict similarity and orientation are all the same, but the cross-group correlations (e.g., Group2-Group1) are different. Therefore, we can separate the contribution of these two effects when including them as different regressors in the same linear regression model. (C) and (D) show the two alternative models. Like the cosine model (A), within-group trial pairs resemble between-group trial pairs in these two models. The domain-specific model is an identity matrix. The domain-general model is estimated from the absolute difference of behavioral congruency effect, but scaled to 0(lowest similarity)-1(highest similarity) to aid comparison. The plotted matrices here include only one subject each from Group 1 and Group 2. Numbers 1-5 indicate the conflict type conditions, for spatial Stroop, StHSmL, StMSmM, StLSmH, and Simon, respectively. The thin lines separate four different sub-conditions, i.e., target arrow (up, down) × congruency (incongruent, congruent), within each conflict type.

      1. Because this model comparison is key to the main inferences in the study, it might also be helpful for most readers to move all of these RSA model matrices to the main text, instead of in the supplemental.

      Response: We thank the reviewer for this suggestion. We have moved the Fig. S4 to the main text, labeled as the new Figure 7.

      1. It may be worthwhile to check how robust the observed brain-behavior association (Fig 4C) is to the exclusion of the two datapoints with the lowest neural representation strength measure, as these points look like they have high leverage.

      Response: We calculated the Pearson correlation after excluding the two points and found it does not affect the results too much, with the r = 0.50, p = .003 (compared to the original r = 0.52, p = .001).

      Additionally, we found the two axes were mistakenly shifted in Fig 4C. Therefore, we corrected this error in the revised manuscript. The correlation results would not be influenced.

      Author response image 7.

      Fig. 4. The conflict type effect. (A) Brain regions surviving the Bonferroni correction (p < 0.0001) across the regions (criterion 1). Labeled regions are those meeting the criterion 2. (B) Different encoding of conflict type in the incongruent with congruent conditions. * Bonferroni corrected p < .05. (C) The brain-behavior correlation of the right 8C (criterion 3). The x-axis shows the beta coefficient of the conflict type effect from the RSA, and the y-axis shows the beta coefficient obtained from the behavioral linear model using the conflict similarity to predict the CSE in Experiment 2. (D) Illustration of the different encoding strength of conflict type similarity in incongruent versus congruent conditions of right 8C. The y-axis is derived from the z-scored Pearson correlation coefficient, consistent with the RSA methodology. See Fig. S4B for a plot with the raw Pearson correlation measurement. l = left; r = right.

      Reviewer #3:

      Yang and colleagues investigated whether information on two task-irrelevant features that induce response conflict is represented in a common cognitive space. To test this, the authors used a task that combines the spatial Stroop conflict and the Simon effect. This task reliably produces a beautiful graded congruency sequence effect (CSE), where the cost of congruency is reduced after incongruent trials. The authors measured fMRI to identify brain regions that represent the graded similarity of conflict types, the congruency of responses, and the visual features that induce conflicts. They applied univariate, multivariate, and connectivity analyses to fMRI data to identify brain regions that represent the graded similarity of conflict types, the congruency of responses, and the visual features that induce conflicts. They further directly assessed the dimensionality of represented conflict space.

      The authors identified the right dlPFC (right 8C), which shows 1) stronger encoding of graded similarity of conflicts in incongruent trials and 2) a positive correlation between the strength of conflict similarity type and the CSE on behavior. The dlPFC has been shown to be important for cognitive control tasks. As the dlPFC did not show a univariate parametric modulation based on the higher or lower component of one type of conflict (e.g., having more spatial Stroop conflict or less Simon conflict), it implies that dissimilarity of conflicts is represented by a linear increase or decrease of neural responses. Therefore, the similarity of conflict is represented in multivariate neural responses that combine two sources of conflict.

      The strength of the current approach lies in the clear effect of parametric modulation of conflict similarity across different conflict types. The authors employed a clever cross-subject RSA that counterbalanced and isolated the targeted effect of conflict similarity, decorrelating orientation similarity of stimulus positions that would otherwise be correlated with conflict similarity. A pattern of neural response seems to exist that maps different types of conflict, where each type is defined by the parametric gradation of the yoked spatial Stroop conflict and the Simon conflict on a similarity scale. The similarity of patterns increases in incongruent trials and is correlated with CSE modulation of behavior.

      The main significance of the paper lies in the evidence supporting the use of an organized "cognitive space" to represent conflict information as a general control strategy. The authors thoroughly test this idea using multiple approaches and provide convincing support for their findings. However, the universality of this cognitive strategy remains an open question.

      (Public Reviews) Taken together, this study presents an exciting possibility that information requiring high levels of cognitive control could be flexibly mapped into cognitive map-like representations that both benefit and bias our behavior. Further characterization of the representational geometry and generalization of the current results look promising ways to understand representations for cognitive control.

      Response: We would like to thank the reviewer for the positive evaluation of our manuscript and for providing constructive comments. In response to your suggestions, we have acknowledged the potential limitation of the design and the cross-subject RSA approach, and incorporated the open questions to the discussions. Please find our detailed responses below.

      The task presented in the study involved two sources of conflict information through a single salient visual input, which might have encouraged the utilization of a common space.

      Response: We agree that the unified visual input in our design may have facilitated the utilization of a common space. However, we believe the stimuli are not necessarily unified in the construction of the common space. To further test the potential interaction between the concrete stimulus setting and the cognitive space representation, it is necessary to use varied stimuli in future research. We have left this as an open question in the discussion:

      Can we effectively map any sources of conflict with completely different stimuli into a single space?

      The similarity space was analyzed at the level of between-individuals (i.e., crosssubject RSA) to mitigate potential confounds in the design, such as congruency and the orientation of stimulus positions. This approach makes it challenging to establish a direct link between the quality of conflict space representation and the patterns of behavioral adaptations within individuals.

      Response: By setting the variables as random effects at the subject level, we have extracted the individual effects that incorporate both the group-level fixed effects and individual-level random effects. We believe this approach yields results that are as reliable, if not more, than effects calculated from individual data only. First, the mixed effect linear (LME) model has included all the individual data, forming the basis for establishing random effects. Therefore, the individual effects derived from this approach inherently reflect the individual-specific effects. To support this notion, we have included a simulation script (accessible in the online file “simulation_LME.mlx” at https://osf.io/rcq8w) to demonstrate the strong consistency between the two approaches (see Author response image 8). In this simulation, we generated random data (Y) for 35 subjects, each containing 20 repeated measurements across 5 conditions. To streamline the simulation, we only included one predictor (X), which was treated as both fixed and random effects at the subject level. We applied two methods to calculate the individual beta coefficient. The first involved extracting individual beta coefficients from the LME model by summing the fixed effect with the subject-specific random effect. The second method was entailed conducting a regression analysis using data from each subject to obtain the slope. We tested their consistency by calculating the Pearson correlation between the derived beta coefficients. This simulation was repeated 100 times.

      Author response image 8.

      The consistent individual beta coefficients between the mixed effect model and the individual regression analysis. A) The distribution of Pearson correlation between the two methods for 100 times. B) An example from the simulation showing the highly correlated results from the two methods. Each data point indicates a subject (n=35).

      Second, the potential difference between the two methods lies in that the LME model have also taken the group-level variance into account, such as the dissociable variances of the conflict similarity and orientation across subject groups. This enabled us to extract relatively cleaner conflict similarity effects for each subject, which we believe can be better linked to the individual behavioral adaptations. Moreover, we have extracted the behavioral adaptations scores (i.e., the similarity modulation effect on CSE) using a similar LME approach. Conducting behavioral analysis solely using individual data would have been less reliable, given the limited sample size of individual data (~32 points per subject). This also motivated us to maintain consistency by extracting individual neural effects using LME models.

      Furthermore, it remains unclear at which cognitive stages during response selection such a unified space is recruited. Can we effectively map any sources of conflict into a single scale? Is this unified space adaptively adjusted within the same brain region? Additionally, does the amount of conflict solely define the dimensions of this unified space across many conflict-inducing tasks? These questions remain open for future studies to address.

      Response: We appreciate the reviewer’s constructive open questions. We respond to each of them based on our current understanding.

      1) It remains unclear at which cognitive stages during response selection such a unified space is recruited.

      We anticipate that the cognitive space is recruited to guide the transference of behavioral CSE at two critical stages. The first stage involves the evaluation of control demands, where the representational distance/similarity between previous and current trials influences the adjustment of cognitive control. The second stage pertains to is control execution, where the switch from one control state to another follows a path within the cognitive space. It is worth noting that future studies aiming to address this question may benefit from methodologies with higher temporal resolutions, such as EEG and MEG, to provide more precise insights into the temporal dynamics of the process of cognitive space recruitment.

      2) Can we effectively map any sources of conflict into a single scale?

      It is possible that various sources of conflict can be mapped onto the same space based on their similarity, even if finding such an operational defined similarity may be challenging. However, our results may offer an approach to infer the similarity between two conflicts. One way is to examine their congruency sequence effect (CSE), with a stronger CSE suggesting greater similarity. The other way is to test their representational similarity within the dorsolateral prefrontal cortex.

      3) Is this unified space adaptively adjusted within the same brain region? We do not have an answer to this question. We showed that the cognitive space does not change with time (Note. S3). What have adjusted is the control demand to resolve the quickly changing conflict conditions from trial to trial. Though, it is an interesting question whether the cognitive space may be altered, for example, when the mental state changes significantly. And if yes, we can further test whether the change of cognitive space is also within the right dlPFC.

      4) Additionally, does the amount of conflict solely define the dimensions of this unified space across many conflict-inducing tasks?

      Our understanding of this comment is that the amount of conflict refers to the number of conflict sources. Based on our current finding, the dimensions of the space are indeed defined by how many different conflict sources are included. However, this would require the different conflict sources are orthogonal. If some sources share some aspects, the cognitive space may collapse to a lower dimension. We have incorporated the first question into the discussion:

      Moreover, we anticipate that the representation of cognitive space is most prominently involved at two critical stages to guide the transference of behavioral CSE. The first stage involves the evaluation of control demands, where the representational distance/similarity between previous and current trials influences the adjustment of cognitive control. The second stage pertains to control execution, where the switch from one control state to another follows a path within the cognitive space. However, we were unable to fully distinguish between these two stages due to the low temporal resolution of fMRI signals in our study. Future research seeking to delve deeper into this question may benefit from methodologies with higher temporal resolutions, such as EEG and MEG.

      We have included the other questions into the manuscript as open questions, calling for future research.

      Several interesting questions remains to be answered. For example, is the dimension of the unified space across conflict-inducing tasks solely determined by the number of conflict sources? Can we effectively map any sources of conflict with completely different stimuli into a single space? Is the cognitive space geometry modulated by the mental state? If yes, what brain regions mediate the change of cognitive space?

      Minor comments:

      • The original comment about out-of-sample predictions to examine the continuity of the space was a suggestion for testing neural representations, not behavior (I apologize for the lack of clarity). Given the low dimensionality of the conflict space shown by the participation ratio, we expect that linear separability exists only among specific combinations of conditions. For example, the pair of conflicts 1 and 5 together is not linearly separable from conflicts 2 and 3. But combined with other results, this is already implied.

      Response: We apologize for the misunderstanding. In fact, performing a prediction analysis using the extensive RSM in our study does presents certain challenges, primarily due to its substantial size (1400x1400) and the intricate nature of the mixed-effect linear model. In our efforts to simplify the prediction process by excluding random effects, we did observe a correlation between the predicted and original values, albeit a relatively small Pearson correlation coefficient of r = 0.024, p < .001. This small correlation can be attributed to two key factors. First, the exclusion of data points impacts not only the conflict similarity regressor but also other regressors within the model, thereby diminishing the predictive power. Secondly, the large amount of data points in the model heightens the risk of overfitting, subsequently reducing the model’s capacity for generalization and increasing the likelihood of unreliable predictions. Given these potential problems, we have opted not to include this prediction in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Editor's note:

      Thank you for taking time and efforts to improve this study. After re-review, two reviewers have a consensus that the connections the fatty acids and sperm motility is still ambiguous. Thus, I recommend to further tone down this conclusion consistently in the title and the text pointed out by reviewers before making a final version of record.

      We sincerely appreciate the considerable time and effort you and the reviewers devoted to evaluating our manuscript. We have revised the title and text to express the relationship between fatty acids and sperm motility more consistently and toned down. With these revisions, we would like to proceed with publishing the manuscript as the Version of Record (VoR). Thank you very much for your guidance in improving our study.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this revised report, Yamanaka and colleagues investigate a proposed mechanism by which testosterone modulates seminal plasma metabolites in mice. Based on limited evidence in previous versions of the report, the authors softened the claim that oleic acid derived from seminal vesicle epithelium strongly affects linear progressive motility in isolated cauda epididymal sperm in vitro. Though the report still contains somewhat ambiguous references to the strength of the relationship between fatty acids and sperm motility.

      Strengths:

      Often, reported epidydimal sperm from mice have lower percent progressive motility compared with sperm retrieved from the uterus or by comparison with human ejaculated sperm. The findings in this report may improve in vitro conditions to overcome this problem, as well as add important physiological context to the role of reproductive tract glandular secretions in modulating sperm behaviors. The strongest observations are related to the sensitivity of seminal vesicle epithelial cells to testosterone. The revisions include the addition of methodological detail, modified language to reflect the nuance of some of the measurements, as well as re-performed experiments with more appropriate control groups. The findings are likely to be of general interest to the field by providing context for follow-on studies regarding the relationship between fatty acid beta oxidation and sperm motility pattern.

      Weaknesses:

      The connection between media fatty acids and sperm motility pattern remains inconclusive.

      We would like to express our sincere gratitude to the judges for their cooperation in reviewing the manuscript and for your helpful comments, which were instrumental in improving manuscript.

      Reviewer #2 (Public review):

      Using a combination of in vivo studies with testosterone-inhibited and aged mice with lower testosterone levels as well as isolated mouse and human seminal vesicle epithelial cells the authors show that testosterone induces an increase in glucose uptake. They find that testosterone induces a difference in gene expression with a focus on metabolic enzymes. Specifically, they identify increased expression of enzymes regulating cholesterol and fatty acid synthesis, leading to increased production of 18:1 oleic acid. The revised version strengthens the role of ACLY as the main regulator of seminal vesicle epithelial cell metabolic programming. The authors propose that fatty acids are secreted by seminal vesicle epithelial cells and are taken up by sperm, positively affecting sperm function. A lipid mixture mimicking the lipids secreted by seminal vesicle epithelial cells, however, only has a small and mostly non-significant effect on sperm motility, suggesting the authors were not apply to pinpoint the seminal vesicle fluid component that positively affects sperm function.

      We greatly appreciate the reviewer’s thoughtful comments and time spent reviewing this manuscript. The relationship between lipids such as fatty acids and sperm motility remains unclear in the current dataset. Therefore, before finalizing the manuscript, we revised the title and text, as suggested by the reviewers, to express this conclusion more cautiously and consistently.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Some additional comments are provided below to aid the authors in improving the quality of the work:

      Major Comments:

      (1) In the newly added supplemental figure 5, the authors note that the percentage data were arcisine transformed prior to statistical analysis without providing any other justification. This seems strange, especially for such a small sample size. It seems more appropriate for the authors to use a nonparametric test. Forcing symmetry without knowing what the shape of the true distribution is makes the ANOVA hard to interpret. Additionally, why use pairwise comparisons rather than comparing each group to the control (LM 0%). Also, note that the graphs are not individually labeled to distinguish them in the legend (A, B, C, etc.). Ultimately, the treatment differences don't seem that meaningful, even if the authors were able to observe statistical significance with the somewhat over-manipulated method of analysis.

      Ultimately, the conclusion of this experiment (Supplemental figure 5) remains unchanged, but we agree that the relationship between fatty acids and sperm motility remains unclear. Therefore, before finalizing the manuscript, we revised the title and text as pointed out by the reviewers to express this conclusion more cautiously and consistently throughout the manuscript.

      Arcsin transform is commonly used for percentage data [Zar, J.H. 2010. Biostatistical analysis., McDonald, J.H. 2014. Handbook of biological statistics.]. If the values are low or high, such as 0 to 30% or 70 to 100%, without arcsine transformation will result in a large deviation from the normality of the data. However, even if such a conversion is performed, it does not necessarily mean that the assumptions of normality and homogeneity of variance, which are prerequisites for parametric statistical analysis methods, are satisfied.

      Given the small sample size and the possibility of non-normal data, we performed Shapiro–Wilk tests for each group (n = 6) and found no departure from normality (all p > 0.1). Q–Q plots and Levene’s test (p > 0.1) likewise supported the assumptions of ANOVA. Following the reviewer’s recommendation, we repeated the analysis with a Kruskal–Wallis test followed by Dunn’s post-hoc comparisons (Bonferroni corrected). Both approaches led to the same conclusions, with non-parametric p-values equal to or smaller than the parametric ones. In the revised manuscript we now report ANOVA as the primary analysis. The author response image includes effect sizes with 95 % confidence intervals, and provide the non-parametric results for transparency.

      Author response image 1.

      Results of reanalysis of supplementary Figure 5 using nonparametric tests and effect sizes with 95% confidence intervals. Upper part; Differences between groups were assessed by Kruskal–Wallis test, differences among values were analyzed by Dunn’s post-hoc comparisons (Bonferroni corrected) for multiple comparisons. Different letters represent significantly different groups. Lower part; The effect sizes with 95 % confidence intervals. For example, Cliff's Δ = -1 (95% CI ~ -0.6) in VSL's “LM 0 vs LM1” means that LM 1% values exceed LM 0 %values in all pairs.

      (2) I appreciate that the authors toned down the interpretation of the effects of seminal plasma metabolites on sperm motility with a cautionary statement on Lines 397-405 and Line 259. However, they send mixed signals with the title of the report: "Testosterone-Induced Metabolic Changes in Seminal Vesicle Epithelial cells Alter Plasma Components to Enhance Sperm Motility", and on line 265 when the say "ACLY expression is upregulated by testosterone and is essential for the metabolic shift of seminal vesicle epithelial cells that mediates sperm linear motility".

      The wording has been softened overall. The title has been changed to “Testosterone-Induced Metabolic Changes in Seminal Vesicle Epithelium Modify Seminal Plasma Components with Potential to Improve Sperm Motility” In the results (lines 265-266), we have stated that “ACLY expression is upregulated by testosterone and is essential for the metabolic shift that is associated with increased linear motility” without implying a causal relationship.

      Minor Comments:

      (1) Typo on line 31: "understanding the male fertility mechanisms and may perspective for the development of potential biomarkers of male fertility and advance in the treatment of male infertility."

      We have made the following corrections. “These findings suggest that testosterone-dependent lipid remodeling may contribute to sperm straight-line motility, and further functional verification is required.”

      (2) Line 193: the statement is confusing "Therefore, we analyzed mitochondrial metabolism using a flux analyzer, predicting that more glucose is metabolized, pyruvate is metabolized from phosphoenolpyruvic acid through glycolysis in response to testosterone, and is further metabolized in the mitochondria." For example, 'Metabolized through glycolysis' is an ambiguous way to describe the pyruvate kinase reaction. Additionally, phosphoenolpyruvate has three acid ionizable groups, two of which have pKa's well below physiological pH, so phosphoenolpyruvate is the correct intermediate rather than phosphoenolpyruvic acid. The authors make similar mistakes with other organic acids such as citric acid.

      Rewritten as “We therefore examined cellular energy metabolism with a flux analyzer, anticipating that testosterone would elevate glycolytic flux, thereby producing more pyruvate from phosphoenolpyruvate. Because extracellular pyruvate levels simultaneously declined, we inferred that the cells had an increased pyruvate demand and, at that time, hypothesized that the excess pyruvate would enter the mitochondria to support enhanced oxidative metabolism.” (lines 193-198)

      The organic acids are now referenced in their appropriate forms (e.g., citrate, phosphoenolpyruvate).

      (3) Line: 271: "Acly" should be all capitalized to "ACLY". The report mixes capitalizing through out and could be more consistent.

      We appreciate the reviewer’s attention to nomenclature and have standardized the manuscript accordingly. Proteins are written in Roman letters, all in capital letters. Mouse gene symbols: italics, first letter capitalize.

      Reviewer #2 (Recommendations for the authors):

      Major comments:

      (1) 'Once capacitation is complete, sperm cannot maintain that state for a long time'. The publications cited by the author do not support that statement and this reviewer also does not agree. Lower fertilization efficiency from in vitro capacitated epidydimal sperm does not have to mean capacitation is reversed, it can simply mean in vitro capacitation conditions not accurately mimic capacitation in vivo.

      We thank the reviewer for pointing this out and would like to clarify our position. Our statement does not suggest a "reversal" of active capacitation. Rather, it reflects the well-documented fact that capacitation is a transient process. Sperm that undergo capacitation too early cannot maintain that state for long enough to retain their ability to fertilize at the moment and location of fertilization in vivo.

      (2) How do the authors explain the discrepancy between the results shown in Fig. S1E, the increase in sperm motility upon mixing of sperm with SVF and the results reported in Li et al 2025. Mentioning decapacitating factors without further explanation is insufficient.

      We appreciate the reviewer's feedback pointing out the need for a clearer explanation.

      Seminal plasma is inherently binary, containing both decapacitation factors that delay or inhibit capacitation and nutrient substrates that promote sperm motility.

      In vivo, it is believed that the coating of sperm by decapacitation factors is removed by uterine fluid and albumin as it passes through the female reproductive tract [PMID: 22827391, PMID: 24274412]. In contrast, standard fertilization culture media lack a clearance pathway, so decapacitating factors are retained throughout the culture period. As a result, the cleavage rate after in vitro fertilization using sperm exposed to seminal vesicle fluid decreased dramatically.

      Lipids, such as fatty acids, increased sperm motility without directly inducing markers of fertilization. These results suggest that the enhancement of motility by lipids is functionally distinct from the capacitation-inhibiting function of seminal plasma proteins. The data from this study are consistent with the biphasic model. Specifically, decapacitation factors temporarily stabilize the sperm membrane, preventing early capacitation. Meanwhile, lipids enhance sperm motility, enabling them to rapidly pass through the hostile uterine environment.

      (3) This reviewer does not see the merit in including a lipid mixture motility experiment compared to using OA alone. The increase in motility is still small and far from comparable to the motility increase with seminal vesicle fluid. In this reviewer's opinion the experiment is still inconclusive and should not be highlighted in the manuscript title.

      The wording has been softened overall. The title has been changed to “Testosterone-Induced Metabolic Changes in Seminal Vesicle Epithelium Modify Seminal Plasma Components with Potential to Improve Sperm Motility”. (Please see also Reviewer 1's main comment 1)

      Minor comments:

      (1) 'This change includes a large amplitude of flagella' does not make sense. Please correct.

      The following corrections have been made. “This change is characterized by large-amplitude flagellar beating.” (lines 44-45)

    1. Author response:

      The following is the authors’ response to the previous reviews.

      To the Senior Editor and the Reviewing Editor:

      We sincerely appreciate the valuable comments provided by the reviewers, the reviewing editor, and the senior editor. Based on our last response and revision, we are confused by the two limitations noted in the eLife assessment. 

      (1) benchmarking against comparable methods is limited.

      In our last revision, we added the comparison experiments with TNDM, as the reviewers requested. Additionally, it is crucial to emphasize that our evaluation of decoding capabilities of behaviorally relevant signals has been benchmarked against the performance of the ANN on raw signals, which, as Reviewer #1 previously noted, nearly represents the upper limit of performance. Consequently, we believe that our benchmarking methods are sufficiently strong.

      (2) some observations may be a byproduct of their method, and may not constitute new scientific observations.

      We believe that our experimental results are sufficient to demonstrate that our conclusions are not byproducts of d-VAE based on three reasons:

      (1) The d-VAE, as a latent variable model, adheres to the population doctrine, which posits that latent variables are responsible for generating the activities of individual neurons. The goal of such models is to maximize the explanation of the raw signals. At the signal level, the only criterion we can rely on is neural reconstruction performance, in which we have achieved unparalleled results. Thus, it is inappropriate to focus on the mixing process during the model's inference stage while overlooking the crucial de-mixing process during the generation stage and dismissing the significance of our neural reconstruction results. For more details, please refer to the first point in our response to Q4 from Reviewer #4.

      (2) The criterion that irrelevant signals should contain minimal information can effectively demonstrate that our conclusions are not by-products of d-VAE. Unfortunately, the reviewers seem to have overlooked this criterion. For more details, please refer to the third point in our response to Q4 from Reviewer #4

      (3) Our synthetic experimental results also substantiate that our conclusions are not byproducts of d-VAE. However, it appears the reviewers did not give these results adequate consideration. For more details, please refer to the fourth point in our response to Q4 from Reviewer #4.

      Furthermore, our work presents not just "a useful method" but a comprehensive framework. Our study proposes, for the first time, a framework for defining, extracting, and validating behaviorally relevant signals. In our current revision, to clearly distinguish between d-VAE and other methods, we have formalized the extraction of behaviorally relevant signals into a mathematical optimization problem. To our knowledge, current methods have not explicitly proposed extracting behaviorally relevant signals, nor have they identified and addressed the key challenges of extracting relevant signals. Similarly, existing research has not yet defined and validated behaviorally relevant signals. For more details, please refer to our response to Q1 from Reviewer #4.

      Based on these considerations, we respectfully request that you reconsider the eLife assessment of our work. We greatly appreciate your time and attention to this matter.

      The main revisions made to the manuscript are as follows:

      (1) We have formalized the extraction of behaviorally relevant signals into a mathematical optimization problem, enabling a clearer distinction between d-VAE and other models.

      (2) We have moderated the assertion about linear readout to highlight its conjectural nature and have broadened the discussion regarding this conclusion. 

      (3) We have elaborated on the model details of d-VAE and have removed the identifiability claim.

      To Reviewer #1

      Q1: “As reviewer 3 also points out, I would, however, caution to interpret this as evidence for linear read-out of the motor system - your model performs a non-linear transformation, and while this is indeed linearly decodable, the motor system would need to do something similar first to achieve the same. In fact to me it seems to show the opposite, that behaviour-related information may not be generally accessible to linear decoders (including to down-stream brain areas).”

      Thank you for your comments. It's important to note that the conclusions we draw are speculative and not definitive. We use terms like "suggest" to reflect this uncertainty. To further emphasize the conjectural nature of our conclusions, we have deliberately moderated our tone.

      The question of whether behaviorally-relevant signals can be accessed by linear decoders or downstream brain regions hinges on the debate over whether the brain employs a strategy of filtering before decoding. If the brain employs such a strategy, the brain can probably access these signals. In our opinion, it is likely that the brain utilizes this strategy.

      Given the existence of behaviorally relevant signals, it is reasonable to assume that the brain has intrinsic mechanisms to differentiate between relevant and irrelevant signals. There is growing evidence suggesting that the brain utilizes various mechanisms, such as attention and specialized filtering, to suppress irrelevant signals and enhance relevant signals [1-3]. Therefore, it is plausible that the brain filters before decoding, thereby effectively accessing behaviorally relevant signals.

      Thank you for your valuable feedback.

      (1) Sreenivasan, Sameet, and Ila Fiete. "Grid cells generate an analog error-correcting code for singularly precise neural computation." Nature neuroscience 14.10 (2011): 1330-1337.

      (2) Schneider, David M., Janani Sundararajan, and Richard Mooney. "A cortical filter that learns to suppress the acoustic consequences of movement." Nature 561.7723 (2018): 391-395.

      (3) Nakajima, Miho, L. Ian Schmitt, and Michael M. Halassa. "Prefrontal cortex regulates sensory filtering through a basal ganglia-to-thalamus pathway." Neuron 103.3 (2019): 445-458.

      Q2: “As in my initial review, I would also caution against making strong claims about identifiability although this work and TNDM seem to show that in practise such methods work quite well. CEBRA, in contrast, offers some theoretical guarantees, but it is not a generative model, so would not allow the type of analysis done in this paper. In your model there is a para,eter \alpha to balance between neural and behaviour reconstruction. This seems very similar to TNDM and has to be optimised - if this is correct, then there is manual intervention required to identify a good model.”

      Thank you for your comments. 

      Considering your concerns about our identifiability claims and the fact that identifiability is not directly relevant to the core of our paper, we have removed content related to identifiability.

      Firstly, our model is based on the pi-VAE, which also has theoretical guarantees. However, it is important to note that all such theoretical guarantees (including pi-VAE and CEBRA) are based on certain assumptions that cannot be validated as the true distribution of latent variables remains unknown.

      Secondly, it is important to clarify that the identifiability of latent variables does not impact the conclusions of this paper, nor does this paper make specific conclusions about the model's latent variables. Identifiability means that distinct latent variables correspond to distinct observations. If multiple latent variables can generate the same observation, it becomes impossible to determine which one is correct given the observation, which leads to the issue of nonidentifiability. Notably, our analysis focuses on the generated signals, not the latent variables themselves, and thus the identifiability of these variables does not affect our findings. 

      Our approach, dedicated to extracting these signals, distinctly differs from methods such as TNDM, which focuses on extracting behaviorally relevant latent dynamics. To clearly set apart d-VAE from other models, we have framed the extraction of behaviorally relevant signals as the following mathematical optimization problem:

      where 𝑥# denotes generated behaviorally-relevant signals, 𝑥 denotes raw noisy signals, 𝐸(⋅,⋅) demotes reconstruction loss, and 𝑅(⋅) denotes regularization loss. It is important to note that while both d-VAE and TNDM employ reconstruction loss, relying solely on this term is insufficient for determining the optimal degree of similarity between the generated and raw noisy signals. The key to accurately extracting behaviorally relevant signals lies in leveraging prior knowledge about these signals to determine the optimal similarity degree, encapsulated by 𝑅(𝒙𝒓).  Other studies have not explicitly proposed extracting behaviorally-relevant signals, nor have they identified and addressed the key challenges involved in extracting relevant signals. Consequently, our approach is distinct from other methods.

      Thank you for your valuable feedback.

      Q3: “Somewhat related, I also found that the now comprehensive comparison with related models shows that the using decoding performance (R2) as a metric for model comparison may be problematic: the R2 values reported in Figure 2 (e.g. the MC_RTT dataset) should be compared to the values reported in the neural latent benchmark, which represent well-tuned models (e.g. AutoLFADS). The numbers (difficult to see, a table with numbers in the appendix would be useful, see: https://eval.ai/web/challenges/challenge-page/1256/leaderboard) seem lower than what can be obtained with models without latent space disentanglement. While this does not necessarily invalidate the conclusions drawn here, it shows that decoding performance can depend on a variety of model choices, and may not be ideal to discriminate between models. I'm also surprised by the low neural R2 for LFADS I assume this is condition-averaged) - LFADS tends to perform very well on this metric.”

      Thank you for your comments. The dataset we utilized is not from the same day as the neural latent benchmark dataset. Notably, there is considerable variation in the length of trials within the RTT paradigm, and the dataset lacks explicit trial information, rendering trial-averaging unsuitable. Furthermore, behaviorally relevant signals are not static averages devoid of variability; even behavioral data exhibits variability. We computed the neural R2 using individual trials rather than condition-averaged responses. 

      Thank you for your valuable feedback.

      Q4: “One statement I still cannot follow is how the prior of the variational distribution is modelled. You say you depart from the usual Gaussian prior, but equation 7 seems to suggest there is a normal prior. Are the parameters of this distribution learned? As I pointed out earlier, I however suspect this may not matter much as you give the prior a very low weight. I also still am not sure how you generate a sample from the variational distribution, do you just draw one for each pass?”

      Thank you for your questions.

      The conditional distribution of prior latent variables 𝑝%(𝒛|𝒚) is a Gaussian distribution, but the distribution of prior latent variables 𝑝(𝒛) is a mixture Gaussian distribution. The distribution of prior latent variables 𝑝(𝒛) is:

      where denotes the empirical distribution of behavioral variables

      𝒚, and 𝑁 denotes the number of samples, 𝒚(𝒊) denotes the 𝒊th sample, δ(⋅) denotes the Dirac delta function, and 𝑝%(𝒛|𝒚) denotes the conditional distribution of prior latent variables given the behavioral variables parameterized by network 𝑚. Based on the above equation, we can see that 𝑝(𝒛) is not a Gaussian distribution, it is a Gaussian mixture model with 𝑁 components, which is theoretically a universal approximator of continuous probability densities.

      Learning this prior is important, as illustrated by our latent variable visualizations, which are not a Gaussian distribution. Upon conducting hypothesis testing for both latent variables and behavioral variables, neither conforms to Gaussian distribution (Lilliefors test and Kolmogorov-Smirnov test). Consequently, imposing a constraint on the latent variables towards N(0,1) is expected to affect performance adversely.

      Regarding sampling, during training process, we draw only one sample from the approximate posterior distribution . It is worth noting that drawing multiple samples or one sample for each pass does not affect the experimental results. After training, we can generate a sample from the prior by providing input behavioral data 𝒚(𝒊) and then generating corresponding samples via and . To extract behaviorally-relevant signals from raw signals, we use and .

      Thank you for your valuable feedback.

      Q5: “(1) I found the figures good and useful, but the text is, in places, not easy to follow. I think the manuscript could be shortened somewhat, and in some places more concise focussed explanations would improve readability.

      (2) I would not call the encoding "complex non-linear" - non-linear is a clear term, but complex can mean many things (e.g. is a quadratic function complex?) ”

      Thank you for your recommendation. We have revised the manuscript for enhanced clarity.  We call the encoding “complex nonlinear” because neurons encode information with varying degrees of nonlinearity, as illustrated in Fig. 3b, f, and Fig. S3b.

      Thank you for your valuable feedback.

      To Reviewer #2

      Q1: “I still remain unconvinced that the core findings of the paper are "unexpected". In the response to my previous Specific Comment #1, they say "We use the term 'unexpected' due to the disparity between our findings and the prior understanding concerning neural encoding and decoding." However, they provide no citations or grounding for why they make those claims. What prior understanding makes it unexpected that encoding is more complex than decoding given the entropy, sparseness, and high dimensionality of neural signals (the "encoding") compared to the smoothness and low dimensionality of typical behavioural signals (the "decoding")?” 

      Thank you for your comments. We believe that both the complexity of neural encoding and the simplicity of neural decoding in motor cortex are unexpected.

      The Complexity of Neural Encoding: As noted in the Introduction, neurons with small R2 values were traditionally considered noise and consequently disregarded, as detailed in references [1-3]. However, after filtering out irrelevant signals, we discovered that these neurons actually contain substantial amounts of behavioral information, previously unrecognized. Similarly, in population-level analyses, neural signals composed of small principal components (PCs) are often dismissed as noise, with analyses typically utilizing only between 6 and 18 PCs [4-10]. Yet, the discarded PC signals nonlinearly encode significant amounts of information, with practically useful dimensions found to range between 30 and 40—far exceeding the usual number analyzed. These findings underscore the complexity of neural encoding and are unexpected.

      The Simplicity of Neural Decoding: In the motor cortex, nonlinear decoding of raw signals has been shown to significantly outperform linear decoding, as evidenced in references [11,12]. Interestingly, after separating behaviorally relevant and irrelevant signals, we observed that the linear decoding performance of behaviorally relevant signals is nearly equivalent to that of nonlinear decoding—a phenomenon previously undocumented in the motor cortex. This discovery is also unexpected.

      Thank you for your valuable feedback.

      (1) Georgopoulos, Apostolos P., Andrew B. Schwartz, and Ronald E. Kettner. "Neuronal population coding of movement direction." Science 233.4771 (1986): 1416-1419.

      (2) Hochberg, Leigh R., et al. "Reach and grasp by people with tetraplegia using a neurally controlled robotic arm." Nature 485.7398 (2012): 372-375. 

      (3) Inoue, Yoh, et al. "Decoding arm speed during reaching." Nature communications 9.1 (2018): 5243.

      (4) Churchland, Mark M., et al. "Neural population dynamics during reaching." Nature 487.7405 (2012): 51-56.

      (5) Kaufman, Matthew T., et al. "Cortical activity in the null space: permitting preparation without movement." Nature neuroscience 17.3 (2014): 440-448.

      (6) Elsayed, Gamaleldin F., et al. "Reorganization between preparatory and movement population responses in motor cortex." Nature communications 7.1 (2016): 13239.

      (7) Sadtler, Patrick T., et al. "Neural constraints on learning." Nature 512.7515 (2014): 423426.

      (8) Golub, Matthew D., et al. "Learning by neural reassociation." Nature neuroscience 21.4 (2018): 607-616.

      (9) Gallego, Juan A., et al. "Cortical population activity within a preserved neural manifold underlies multiple motor behaviors." Nature communications 9.1 (2018): 4233.

      (10) Gallego, Juan A., et al. "Long-term stability of cortical population dynamics underlying consistent behavior." Nature neuroscience 23.2 (2020): 260-270.

      (11) Glaser, Joshua I., et al. "Machine learning for neural decoding." Eneuro 7.4 (2020).

      (12) Willsey, Matthew S., et al. "Real-time brain-machine interface in non-human primates achieves high-velocity prosthetic finger movements using a shallow feedforward neural network decoder." Nature Communications 13.1 (2022): 6899.

      Q2: “I still take issue with the premise that signals in the brain are "irrelevant" simply because they do not correlate with a fixed temporal lag with a particular behavioural feature handchosen by the experimenter. In the response to my previous review, the authors say "we employ terms like 'behaviorally-relevant' and 'behaviorally-irrelevant' only regarding behavioral variables of interest measured within a given task, such as arm kinematics during a motor control task.". This is just a restatement of their definition, not a response to my concern, and does not address my concern that the method requires a fixed temporal lag and continual decoding/encoding. My example of reward signals remains. There is a huge body of literature dating back to the 70s on the linear relationships between neural and activity and arm kinematics; in a sense, the authors have chosen the "variable of interest" that proves their point. This all ties back to the previous comment: this is mostly expected, not unexpected, when relating apparently-stochastic, discrete action potential events to smoothly varying limb kinematics.”

      Thank you for your comments. 

      Regarding the experimenter's specification of behavioral variables of interest, we followed common practice in existing studies [1, 2]. Regarding the use of fixed temporal lags, we followed the same practice as papers related to the dataset we use, which assume fixed temporal lags [3-5]. Furthermore, many studies in the motor cortex similarly use fixed temporal lags [68].

      Concerning the issue of rewards, in the paper you mentioned [9], the impact of rewards occurs after the reaching phase. It's important to note that in our experiments, we analyze only the reaching phase, without any post-movement phase. 

      If the impact of rewards can be stably reflected in the signals in the reaching phase of the subsequent trial, and if the reward-induced signals do not interfere with decoding—since these signals are harmless for decoding and beneficial for reconstruction—our model is likely to capture these signals. If the signals induced by rewards during the reaching phase are randomly unstable, our model will likely be unable to capture them.

      If the goal is to extract post-movement neural activity from both rewarded and unrewarded trials, and if the neural patterns differ between these conditions, one could replace the d-VAE's regression loss, used for continuous kinematics decoding, with a classification loss tailored to distinguish between rewarded and unrewarded conditions.

      To clarify the definition, we have revised it in the manuscript. Specifically, before a specific definition, we briefly introduce the relevant signals and irrelevant signals. Behaviorally irrelevant signals refer to those not directly associated with the behavioral variables of interest and may include noise or signals from variables of no interest. In contrast, behaviorally relevant signals refer to those directly related to the behavioral variables of interest. For instance, rewards in the post-movement phase are not directly related to behavioral variables (kinematics) in the reaching movement phase.

      It is important to note that our definition of behaviorally relevant signals not only includes decoding capabilities but also specific requirement at the signal level, based on two key requirements:

      (1) they should closely resemble raw signals to preserve the underlying neuronal properties without becoming so similar that they include irrelevant signals. (encoding requirement), and  (2) they should contain behavioral information as much as possible (decoding requirement). Signals that meet both requirements are considered effective behaviorally relevant signals. In our study, we assume raw signals are additively composed of behaviorally-relevant and irrelevant signals. We define irrelevant signals as those remaining after subtracting relevant signals from raw signals. Therefore, we believe our definition is clearly articulated. 

      Thank you for your valuable feedback.

      (1) Sani, Omid G., et al. "Modeling behaviorally relevant neural dynamics enabled by preferential subspace identification." Nature Neuroscience 24.1 (2021): 140-149.

      (2) Buetfering, Christina, et al. "Behaviorally relevant decision coding in primary somatosensory cortex neurons." Nature neuroscience 25.9 (2022): 1225-1236.

      (3) Wang, Fang, et al. "Quantized attention-gated kernel reinforcement learning for brain– machine interface decoding." IEEE transactions on neural networks and learning systems 28.4 (2015): 873-886.

      (4) Dyer, Eva L., et al. "A cryptography-based approach for movement decoding." Nature biomedical engineering 1.12 (2017): 967-976.

      (5) Ahmadi, Nur, Timothy G. Constandinou, and Christos-Savvas Bouganis. "Robust and accurate decoding of hand kinematics from entire spiking activity using deep learning." Journal of Neural Engineering 18.2 (2021): 026011.

      (6) Churchland, Mark M., et al. "Neural population dynamics during reaching." Nature 487.7405 (2012): 51-56.

      (7) Kaufman, Matthew T., et al. "Cortical activity in the null space: permitting preparation without movement." Nature neuroscience 17.3 (2014): 440-448.

      (8) Elsayed, Gamaleldin F., et al. "Reorganization between preparatory and movement population responses in motor cortex." Nature communications 7.1 (2016): 13239.

      (9) Ramkumar, Pavan, et al. "Premotor and motor cortices encode reward." PloS one 11.8 (2016): e0160851.

      Q3: “The authors seem to have missed the spirit of my critique: to say "linear readout is performed in motor cortex" is an over-interpretation of what their model can show.”

      Thank you for your comments. It's important to note that the conclusions we draw are speculative and not definitive. We use terms like "suggest" to reflect this uncertainty. To further emphasize the conjectural nature of our conclusions, we have deliberately moderated our tone.

      The question of whether behaviorally-relevant signals can be accessed by downstream brain regions hinges on the debate over whether the brain employs a strategy of filtering before decoding. If the brain employs such a strategy, the brain can probably access these signals. In our view, it is likely that the brain utilizes this strategy.

      Given the existence of behaviorally relevant signals, it is reasonable to assume that the brain has intrinsic mechanisms to differentiate between relevant and irrelevant signals. There is growing evidence suggesting that the brain utilizes various mechanisms, such as attention and specialized filtering, to suppress irrelevant signals and enhance relevant signals [1-3]. Therefore, it is plausible that the brain filters before decoding, thereby effectively accessing behaviorally relevant signals.

      Regarding the question of whether the brain employs linear readout, given the limitations of current observational methods and our incomplete understanding of brain mechanisms, it is challenging to ascertain whether the brain employs a linear readout. In many cortical areas, linear decoders have proven to be sufficiently accurate. Consequently, numerous studies [4, 5, 6], including the one you referenced [4], directly employ linear decoders to extract information and formulate conclusions based on the decoding results. Contrary to these approaches, our research has compared the performance of linear and nonlinear decoders on behaviorally relevant signals and found their decoding performance is comparable. Considering both the decoding accuracy and model complexity, our results suggest that the motor cortex may utilize linear readout to decode information from relevant signals. Given the current technological limitations, we consider it reasonable to analyze collected data to speculate on the potential workings of the brain, an approach that many studies have also embraced [7-10]. For instance, a study [7] deduces strategies the brain might employ to overcome noise by analyzing the structure of recorded data and decoding outcomes for new stimuli.

      Thank you for your valuable feedback.

      (1) Sreenivasan, Sameet, and Ila Fiete. "Grid cells generate an analog error-correcting code for singularly precise neural computation." Nature neuroscience 14.10 (2011): 1330-1337.

      (2) Schneider, David M., Janani Sundararajan, and Richard Mooney. "A cortical filter that learns to suppress the acoustic consequences of movement." Nature 561.7723 (2018): 391-395.

      (3) Nakajima, Miho, L. Ian Schmitt, and Michael M. Halassa. "Prefrontal cortex regulates sensory filtering through a basal ganglia-to-thalamus pathway." Neuron 103.3 (2019): 445-458.

      (4) Jurewicz, Katarzyna, et al. "Irrational choices via a curvilinear representational geometry for value." bioRxiv (2022): 2022-03.

      (5) Hong, Ha, et al. "Explicit information for category-orthogonal object properties increases along the ventral stream." Nature neuroscience 19.4 (2016): 613-622.

      (6) Chang, Le, and Doris Y. Tsao. "The code for facial identity in the primate brain." Cell 169.6 (2017): 1013-1028.

      (7) Ganmor, Elad, Ronen Segev, and Elad Schneidman. "A thesaurus for a neural population code." Elife 4 (2015): e06134.

      (8) Churchland, Mark M., et al. "Neural population dynamics during reaching." Nature 487.7405 (2012): 51-56.

      (9) Gallego, Juan A., et al. "Cortical population activity within a preserved neural manifold underlies multiple motor behaviors." Nature communications 9.1 (2018): 4233.

      (10) Gallego, Juan A., et al. "Long-term stability of cortical population dynamics underlying consistent behavior." Nature neuroscience 23.2 (2020): 260-270.

      Q4: “Agreeing with my critique is not sufficient; please provide the data or simulations that provides the context for the reference in the fano factor. I believe my critique is still valid.”

      Thank you for your comments. As we previously replied, Churchland's research examines the variability of neural signals across different stages, including the preparation and execution phases, as well as before and after the target appears. Our study, however, focuses exclusively on the movement execution phase. Consequently, we are unable to produce comparative displays similar to those in his research. Intuitively, one might expect that the variability of behaviorally relevant signals would be lower; however, since no prior studies have accurately extracted such signals, the specific FF values of behaviorally relevant signals remain unknown. Therefore, presenting these values is meaningful, and can provide a reference for future research. While we cannot compare FF across different stages, we can numerically compare the values to the Poisson count process. An FF of 1 indicates a Poisson firing process, and our experimental data reveals that most neurons have an FF less than 1, indicating that the variance in firing counts is below the mean.  Thank you for your valuable feedback.

      To Reviewer #4

      Q1: “Overall, studying neural computations that are behaviorally relevant or not is an important problem, which several previous studies have explored (for example PSID in (Sani et al. 2021), TNDM in (Hurwitz et al. 2021), TAME-GP in (Balzani et al. 2023), pi-VAE in (Zhou and Wei 2020), and dPCA in (Kobak et al. 2016), etc). However, this manuscript does not properly put their work in the context of such prior works. For example, the abstract states "One solution is to accurately separate behaviorally-relevant and irrelevant signals, but this approach remains elusive", which is not the case given that these prior works have done that. The same is true for various claims in the main text, for example "Furthermore, we found that the dimensionality of primary subspace of raw signals (26, 64, and 45 for datasets A, B, and C) is significantly higher than that of behaviorally-relevant signals (7, 13, and 9), indicating that using raw signals to estimate the neural dimensionality of behaviors leads to an overestimation" (line 321). This finding was presented in (Sani et al. 2021) and (Hurwitz et al. 2021), which is not clarified here. This issue of putting the work in context has been brought up by other reviewers previously but seems to remain largely unaddressed. The introduction is inaccurate also in that it mixes up methods that were designed for separation of behaviorally relevant information with those that are unsupervised and do not aim to do so (e.g., LFADS). The introduction should be significantly revised to explicitly discuss prior models/works that specifically formulated this behavior separation and what these prior studies found, and how this study differs.”  

      Thank you for your comments. Our statement about “One solution is to accurately separate behaviorally-relevant and irrelevant signals, but this approach remains elusive” is accurate. To our best knowledge, there is no prior works to do this work--- separating accurate behaviorally relevant neural signals at both single-neuron and single-trial resolution. The works you mentioned have not explicitly proposed extracting behaviorally relevant signals, nor have they identified and addressed the key challenges of extracting relevant signals, namely determining the optimal degree of similarity between the generated relevant signals and raw signals. Those works focus on the latent neural dynamics, rather than signal level.

      To clearly set apart d-VAE from other models, we have framed the extraction of behaviorally relevant signals as the following mathematical optimization problem:

      where 𝒙𝒓 denotes generated behaviorally-relevant signals, 𝒙 denotes raw noisy signals, 𝐸(⋅,⋅) demotes reconstruction loss, and 𝑅(⋅) denotes regularization loss. It is important to note that while both d-VAE and TNDM employ reconstruction loss, relying solely on this term is insufficient for determining the optimal degree of similarity between the generated and raw noisy signals. The key to accurately extracting behaviorally relevant signals lies in leveraging prior knowledge about these signals to determine the optimal similarity degree, encapsulated by 𝑅(𝒙𝒓). All the works you mentioned did not have the key part 𝑅(𝒙𝒓).

      Regarding the dimensionality estimation, the dimensionality of neural manifolds quantifies the degrees of freedom required to describe population activity without significant information loss.

      There are two differences between our work and PSID and TNDM. 

      First, the dimensions they refer to are fundamentally different from ours. The dimensionality we describe pertains to a linear subspace, where a neural dimension or neural mode or principal component basis, , with N representing the number of neurons. However, the vector length of a neural mode of PSID and our approach differs; PSID requires concatenating multiple time steps T, essentially making , TNDM, on the other hand, involves nonlinear dimensionality reduction, which is different from linear dimensionality reduction.

      Second, we estimate neural dimensionality by explaining the variance of neural signals, whereas PSID and TNDM determine dimensionality through decoding performance saturation. It is important to note that the dimensionality at which decoding performance saturates may not accurately reflect the true dimensionality of neural manifolds, as some dimensions may contain redundant information that does not enhance decoding performance.

      We acknowledge that while LFADS can generate signals that contain some behavioral information, it was not specifically designed to do so. Following your suggestion, we have removed this reference from the Introduction.

      Thank you for your valuable feedback.

      Q2: “Claims about linearity of "motor cortex" readout are not supported by results yet stated even in the abstract. Instead, what the results support is that for decoding behavior from the output of the dVAE model -- that is trained specifically to have a linear behavior readout from its embedding -- a nonlinear readout does not help. This result can be biased by the very construction of the dVAE's loss that encourages a linear readout/decoding from embeddings, and thus does not imply a finding about motor cortex.”

      Thank you for your comments. We respectfully disagree with the notion that the ability of relevant signals to be linearly decoded is due to constraints that allow embedding to be linearly decoded. Embedding involves reorganizing or transforming the structure of original signals, and they can be linearly decoded does not mean the corresponding signals can be decoded linearly.

      Let's clarify this with three intuitive examples:

      Example 1: Image denoising is a well-established field. Whether employing supervised or blind denoising methods [1, 2], both can effectively recover the original image. This denoising process closely resembles the extraction of behaviorally relevant signals from raw signals. Consider if noisy images are not amenable to linear decoding (classification); would removing the noise enable linear decoding? The answer is no. Typically, the noise in images captured under normal conditions is minimal, yet even the clear images remain challenging to decode linearly.

      Example 2: Consider the task of face recognition, where face images are set against various backgrounds, in this context, the pixels representing the face corresponds to relevant signals, while the background pixels are considered irrelevant. Suppose a network is capable of extracting the face pixels and the resulting embedding can be linearly decoded. Can the face pixels themselves be linearly decoded? The answer is no. If linear decoding of face pixels were feasible, the challenging task of face recognition could be easily resolved by merely extracting the face from the background and training a linear classifier.

      Example 3: In the MNIST dataset, the background is uniformly black, and its impact is minimal. However, linear SVM classifiers used directly on the original pixels significantly underperform compared to non-linear SVMs.

      In summary, embedding involves reorganizing the structure of the original signals through a feature transformation function. However, the reconstruction process can recover the structure of the original signals from the embedding. The fact that the structure of the embedding can be linearly decoded does not imply that the structure of the original signals can be linearly decoded in the same way. It is inappropriate to focus on the compression process without equally considering the reconstruction process.

      Thank you for your valuable feedback.

      (1) Mao, Xiao-Jiao, Chunhua Shen, and Yu-Bin Yang. "Image restoration using convolutional auto-encoders with symmetric skip connections." arXiv preprint arXiv:1606.08921 (2016).

      (2) Lehtinen, Jaakko, et al. "Noise2Noise: Learning image restoration without clean data." International Conference on Machine Learning. International Machine Learning Society, 2018.

      Q3: “Related to the above, it is unclear what the manuscript means by readout from motor cortex. A clearer definition of "readout" (a mapping from what to what?) in general is needed. The mapping that the linearity/nonlinearity claims refer to is from the *inferred* behaviorally relevant neural signals, which themselves are inferred nonlinearly using the VAE. This should be explicitly clarified in all claims, i.e., that only the mapping from distilled signals to behavior is linear, not the whole mapping from neural data to behavior. Again, to say the readout from motor cortex is linear is not supported, including in the abstract.” 

      Thank you for your comments. We have revised the manuscript to make it more clearly. Thank you for your valuable feedback.

      Q4: “Claims about individual neurons are also confounded. The d-VAE distilling processing is a population level embedding so the individual distilled neurons are not obtainable on their own without using the population data. This population level approach also raises the possibility that information can leak from one neuron to another during distillation, which is indeed what the authors hope would recover true information about individual neurons that wasn't there in the recording (the pixel denoising example). The authors acknowledge the possibility that information could leak to a neuron that didn't truly have that information and try to rule it out to some extent with some simulations and by comparing the distilled behaviorally relevant signals to the original neural signals. But ultimately, the distilled signals are different enough from the original signals to substantially improve decoding of low information neurons, and one cannot be sure if all of the information in distilled signals from any individual neuron truly belongs to that neuron. It is still quite likely that some of the improved behavior prediction of the distilled version of low-information neurons is due to leakage of behaviorally relevant information from other neurons, not the former's inherent behavioral information. This should be explicitly acknowledged in the manuscript.”

      Thank you for your comments. We value your insights regarding the mixing process. However, we are confident in the robustness of our conclusions. We respectfully disagree with the notion that the small R2 values containing significant information are primarily due to leakage, and we base our disagreement on four key reasons.

      (1) Neural reconstruction performance is a reliable and valid criterion.

      The purpose of latent variable models is to explain neuronal activity as much as possible. Given the fact that the ground truth of behaviorally-relevant signals, the latent variables, and the generative model is unknow, it becomes evident that the only reliable reference at the signal level is the raw signals. A crucial criterion for evaluating the reliability of latent variable models (including latent variables and generated relevant signals) is their capability to effectively explain the raw signals [1]. Consequently, we firmly maintain the belief that if the generated signals closely resemble the raw signals to the greatest extent possible, in accordance with an equivalence principle, we can claim that these obtained signals faithfully retain the inherent properties of single neurons. 

      Reviewer #4 appears to focus on the compression (mixing) process without giving equal consideration to the reconstruction (de-mixing) process. Numerous studies have demonstrated that deep autoencoders can reconstruct the original signal very effectively. For example, in the field of image denoising, autoencoders are capable of accurately restoring the original image [2, 3]. If one persistently focuses on the fact of mixing and ignores the reconstruction (demix) process, even if the only criterion that we can rely on at the signal level is high, one still won't acknowledge it. If this were the case, many problems would become unsolvable. For instance, a fundamental criterion for latent variable models is their ability to explain the original data. If the ground truth of the latent variables remains unknown and the reconstruction criterion is disregarded, how can we validate the effectiveness of the model, the validity of the latent variables, or ensure that findings related to latent variables are not merely by-products of the model? Therefore, we disagree with the aforementioned notion. We believe that as long as the reconstruction performance is satisfactory, the extracted signals have successfully retained the characteristics of individual neurons.

      In our paper, we have shown in various ways that our generated signals sufficiently resemble the raw signals, including visualizing neuronal activity (Fig. 2m, Fig. 3i, and Fig. S5), achieving the highest performance among competitors (Fig. 2d, h, l), and conducting control analyses. Therefore, we believe our results are reliable. 

      (1) Cunningham, J.P. and Yu, B.M., 2014. Dimensionality reduction for large-scale neural recordings. Nature neuroscience, 17(11), pp.1500-1509.

      (2) Mao, Xiao-Jiao, Chunhua Shen, and Yu-Bin Yang. "Image restoration using convolutional auto-encoders with symmetric skip connections." arXiv preprint arXiv:1606.08921 (2016).

      (3) Lehtinen, Jaakko, et al. "Noise2Noise: Learning image restoration without clean data." International Conference on Machine Learning. International Machine Learning Society, 2018.

      (2) There is no reason for d-VAE to add signals that do not exist in the original signals.

      (1) Adding signals that does not exist in the small R2 neurons would decrease the reconstruction performance. This is because if the added signals contain significant information, they will not resemble the irrelevant signals which contain no information, and thus, the generated signals will not resemble the raw signals. The model optimizes towards reducing the reconstruction loss, and this scenario deviates from the model's optimization direction. It is worth mentioning that when the model only has reconstruction loss without the interference of decoding loss, we believe that information leakage does not happen. Because the model can only be optimized in a direction that is similar to the raw signals; adding non-existent signals to the generated signals would increase the reconstruction loss, which is contrary to the objective of optimization. 

      (2) Information carried by these additional signals is redundant for larger R2 neurons, thus they do not introduce new information that can enhance the decoding performance of the neural population, which does not benefit the decoding loss.

      Based on these two points, we believe the model would not perform such counterproductive and harmful operations.

      (3) The criterion that irrelevant signals should contain minimal information can effectively rule out the leakage scenario.

      The criterion that irrelevant signals should contain minimal information is very important, but it seems that reviewer #4 has continuously overlooked their significance. If the model's reconstruction is insufficient, or if additional information is added (which we do not believe will happen), the residuals would decode a large amount of information, and this criterion would exclude selecting such signals. To clarify, if we assume that x, y, and z denote the raw, relevant, and irrelevant signals of smaller R2 neurons, with x=y+z, and the extracted relevant signals become y+m, the irrelevant signals become z-m in this case. Consequently, the irrelevant signals contain a significant amount of information.

      We presented the decoding R2 for irrelevant signals in real datasets under three distillation scenarios: a bias towards reconstruction (alpha=0, an extreme case where the model only has reconstruction loss without decoding loss), a balanced trade-off, and a bias towards decoding (alpha=0.9), as detailed in Table 1. If significant information from small R2 neurons leaks from large R2 neurons, the irrelevant signals should contain a large amount of information. However, our results indicate that the irrelevant signals contain only minimal information, and their performance closely resembles that of the model training solely with reconstruction loss, showing no significant differences (P > 0.05, Wilcoxon rank-sum test). When the model leans towards decoding, some useful information will be left in the residuals, and irrelevant signals will contain a substantial amount of information, as observed in Table 1, alpha=0.9. Therefore, we will not choose these signals for analysis.

      In conclusion, the criterion that irrelevant signals should contain minimal information is a very effective measure to exclude undesirable signals.

      Author response table 1.

      Decoding R2 of irrelevant signals

      (4) Synthetic experiments can effectively rule out the leakage scenario.

      In the absence of ground truth data, synthetic experiments serve as an effective method for validating models and are commonly employed [1-3]. 

      Our experimental results demonstrate that d-VAE can effectively extract neural signals that more closely resemble actual behaviorally relevant signals (Fig. S2g).  If there were information leakage, it would decrease the similarity to the ground truth signals, hence we have ruled out this possibility. Moreover, in synthetic experiments with small R2 neurons (Fig. S10), results also demonstrate that our model could make these neurons more closely resemble ground truth relevant signals and recover their information. 

      In summary, synthetic experiments strongly demonstrate that our model can recover obscured neuronal information, rather than adding signals that do not exist.

      (1) Pnevmatikakis, Eftychios A., et al. "Simultaneous denoising, deconvolution, and demixing of calcium imaging data." Neuron 89.2 (2016): 285-299.

      (2) Schneider, Steffen, Jin Hwa Lee, and Mackenzie Weygandt Mathis. "Learnable latent embeddings for joint behavioural and neural analysis." Nature 617.7960 (2023): 360-368.

      (3) Zhou, Ding, and Xue-Xin Wei. "Learning identifiable and interpretable latent models of high-dimensional neural activity using pi-VAE." Advances in Neural Information Processing Systems 33 (2020): 7234-7247.

      Based on these four points, we are confident in the reliability of our results. If Reviewer #4 considers these points insufficient, we would highly appreciate it if specific concerns regarding any of these aspects could be detailed.

      Thank you for your valuable feedback.

      Q5: “Given the nuances involved in appropriate comparisons across methods and since two of the datasets are public, the authors should provide their complete code (not just the dVAE method code), including the code for data loading, data preprocessing, model fitting and model evaluation for all methods and public datasets. This will alleviate concerns and allow readers to confirm conclusions (e.g., figure 2) for themselves down the line.”

      Thanks for your suggestion.

      Our codes are now available on GitHub at https://github.com/eric0li/d-VAE. Thank you for your valuable feedback.

      Q6: “Related to 1) above, the authors should explore the results if the affine network h(.) (from embedding to behavior) was replaced with a nonlinear ANN. Perhaps linear decoders would no longer be as close to nonlinear decoders. Regardless, the claim of linearity should be revised as described in 1) and 2) above, and all caveats should be discussed.”

      Thank you for your suggestion. We appreciate your feasible proposal that can be empirically tested. Following your suggestion, we have replaced the decoding of the latent variable z to behavior y with a nonlinear neural network, specifically a neural network with a single hidden layer. The modified model is termed d-VAE2. We applied the d-VAE2 to the real data, and selected the optimal alpha through the validation set. As shown in Table 1, results demonstrate that the performance of KF and ANN remains comparable. Therefore, the capacity to linearly decode behaviorally relevant signals does not stem from the linear decoding of embeddings.

      Author response table 2.

      Decoding R2 of behaviorally relevant signals obtained by d-VAE2

      Additionally, it is worth noting that this approach is uncommon and is considered somewhat inappropriate according to the Information Bottleneck theory [1]. According to the Information Bottleneck theory, information is progressively compressed in multilayer neural networks, discarding what is irrelevant to the output and retaining what is relevant. This means that as the number of layers increases, the mutual information between each layer's embedding and the model input gradually decreases, while the mutual information between each layer's embedding and the model output gradually increases. For the decoding part, if the embeddings that is not closest to the output (behaviors) is used, then these embeddings might contain behaviorally irrelevant signals. Using these embeddings to generate behaviorally relevant signals could lead to the inclusion of irrelevant signals in the behaviorally relevant signals.

      To demonstrate the above statement, we conducted experiments on the synthetic data. As shown in Table 2, we present the performance (neural R2 between the generated signals and the ground truth signals) of both models at several alpha values around the optimal alpha of dVAE (alpha=0.9) selected by the validation set. The experimental results show that at the same alpha value, the performance of d-VAE2 is consistently inferior to that of d-VAE, and d-VAE2 requires a higher alpha value to achieve performance comparable to d-VAE, and the best performance of d-VAE2 is inferior to that of d-VAE.

      Author response table 3.

      Neural R2 between generated signals and real behaviorally relevant signals

      Thank you for your valuable feedback.

      (1) Shwartz-Ziv, Ravid, and Naftali Tishby. "Opening the black box of deep neural networks via information." arXiv preprint arXiv:1703.00810 (2017).

      Q7: “The beginning of the section on the "smaller R2 neurons" should clearly define what R2 is being discussed. Based on the response to previous reviewers, this R2 "signifies the proportion of neuronal activity variance explained by the linear encoding model, calculated using raw signals". This should be mentioned and made clear in the main text whenever this R2 is referred to.”

      Thank you for your suggestion. We have made the modifications in the main text. Thank you for your valuable feedback.

      Q8: “Various terms require clear definitions. The authors sometimes use vague terminology (e.g., "useless") without a clear definition. Similarly, discussions regarding dimensionality could benefit from more precise definitions. How is neural dimensionality defined? For example, how is "neural dimensionality of specific behaviors" (line 590) defined? Related to this, I agree with Reviewer 2 that a clear definition of irrelevant should be mentioned that clarifies that relevance is roughly taken as "correlated or predictive with a fixed time lag". The analyses do not explore relevance with arbitrary time lags between neural and behavior data.”

      Thanks for your suggestion. We have removed the “useless” statements and have revised the statement of “the neural dimensionality of specific behaviors” in our revised manuscripts.

      Regarding the use of fixed temporal lags, we followed the same practice as papers related to the dataset we use, which assume fixed temporal lags [1-3]. Furthermore, many studies in the motor cortex similarly use fixed temporal lags [4-6]. To clarify the definition, we have revised the definition in our manuscript. For details, please refer to the response to Q2 of reviewer #2 and our revised manuscript. We believe our definition is clearly articulated.

      Thank you for your valuable feedback.

      (1) Wang, Fang, et al. "Quantized attention-gated kernel reinforcement learning for brain– machine interface decoding." IEEE transactions on neural networks and learning systems 28.4 (2015): 873-886.

      (2) Dyer, Eva L., et al. "A cryptography-based approach for movement decoding." Nature biomedical engineering 1.12 (2017): 967-976.

      (3) Ahmadi, Nur, Timothy G. Constandinou, and Christos-Savvas Bouganis. "Robust and accurate decoding of hand kinematics from entire spiking activity using deep learning." Journal of Neural Engineering 18.2 (2021): 026011.

      (4) Churchland, Mark M., et al. "Neural population dynamics during reaching." Nature 487.7405 (2012): 51-56.

      (5) Kaufman, Matthew T., et al. "Cortical activity in the null space: permitting preparation without movement." Nature neuroscience 17.3 (2014): 440-448.

      (6) Elsayed, Gamaleldin F., et al. "Reorganization between preparatory and movement population responses in motor cortex." Nature communications 7.1 (2016): 13239. 

      Q9: “CEBRA itself doesn't provide a neural reconstruction from its embeddings, but one could obtain one via a regression from extracted CEBRA embeddings to neural data. In addition to decoding results of CEBRA (figure S3), the neural reconstruction of CEBRA should be computed and CEBRA should be added to Figure 2 to see how the behaviorally relevant and irrelevant signals from CEBRA compare to other methods.”

      Thank you for your question. Modifying CEBRA is beyond the scope of our work. As CEBRA is not a generative model, it cannot obtain behaviorally relevant and irrelevant signals, and therefore it lacks the results presented in Fig. 2. To avoid the same confusion encountered by reviewers #3 and #4 among our readers, we have opted to exclude the comparison with CEBRA. It is crucial to note, as previously stated, that our assessment of decoding capabilities has been benchmarked against the performance of the ANN on raw signals, which almost represents the upper limit of performance. Consequently, omitting CEBRA does not affect our conclusions.

      Thank you for your valuable feedback.

      Q10: “Line 923: "The optimal hyperparameter is selected based on the lowest averaged loss of five-fold training data." => why is this explained specifically under CEBRA? Isn't the same criteria used for hyperparameters of other methods? If so, clarify.”

      Thank you for your question. The hyperparameter selection for CEBRA follows the practice of the original CEBRA paper. The hyperparameter selection for generative models is detailed in the Section “The strategy for selecting effective behaviorally-relevant signals”.  Thank you for your valuable feedback.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review):

      In this paper, the authors evaluate the utility of brain age derived metrics for predicting cognitive decline by performing a 'commonality' analysis in a downstream regression that enables the different contribution of different predictors to be assessed. The main conclusion is that brain age derived metrics do not explain much additional variation in cognition over and above what is already explained by age. The authors propose to use a regression model trained to predict cognition ('brain cognition') as an alternative suited to applications of cognitive decline. While this is less accurate overall than brain age, it explains more unique variance in the downstream regression.  

      Importantly, in this revision, we clarified that we did not intend to use Brain Cognition as an alternative approach. This is because, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Here we made this point more explicit and further stated that the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. By examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age. 

      REVISED VERSION: while the authors have partially addressed my concerns, I do not feel they have addressed them all. I do not feel they have addressed the weight instability and concerns about the stacked regression models satisfactorily.

      Please see our responses to Reviewer #1 Public Review #3 below

      I also must say that I agree with Reviewer 3 about the limitations of the brain age and brain cognition methods conceptually. In particular that the regression model used to predict fluid cognition will by construction explain more variance in cognition than a brain age model that is trained to predict age. This suffers from the same problem the authors raise with brain age and would indeed disappear if the authors had a separate measure of cognition against which to validate and were then to regress this out as they do for age correction. I am aware that these conceptual problems are more widespread than this paper alone (in fact throughout the brain age literature), so I do not believe the authors should be penalised for that. However, I do think they can make these concerns more explicit and further tone down the comments they make about the utility of brain cognition. I have indicated the main considerations about these points in the recommendations section below. 

      Thank you so much for raising this point. We now have the following statement in the introduction and discussion to address this concern (see below). 

      Briefly, we made it explicit that, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. That is, the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. More importantly, by examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age. And this is the third goal of this present study. 

      From Introduction:

      “Third and finally, certain variation in fluid cognition is related to brain MRI, but to what extent does Brain Age not capture this variation? To estimate the variation in fluid cognition that is related to the brain MRI, we could build prediction models that directly predict fluid cognition (i.e., as opposed to chronological age) from brain MRI data. Previous studies found reasonable predictive performances of these cognition-prediction models, built from certain MRI modalities (Dubois et al., 2018; Pat, Wang, Anney, et al., 2022; Rasero et al., 2021; Sripada et al., 2020; Tetereva et al., 2022; for review, see Vieira et al., 2022). Analogous to Brain Age, we called the predicted values from these cognition-prediction models, Brain Cognition. The strength of an out-of-sample relationship between Brain Cognition and fluid cognition reflects variation in fluid cognition that is related to the brain MRI and, therefore, indicates the upper limit of Brain Age’s capability in capturing fluid cognition. This is, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Consequently, if we included Brain Cognition, Brain Age and chronological age in the same model to explain fluid cognition, we would be able to examine the unique effects of Brain Cognition that explain fluid cognition beyond Brain Age and chronological age. These unique effects of Brain Cognition, in turn, would indicate the amount of co-variation between brain MRI and fluid cognition that is missed by Brain Age.”

      From Discussion:

      “Third, by introducing Brain Cognition,  we showed the extent to which Brain Age indices were not able to capture the variation in fluid cognition that is related to brain MRI. More specifically, using Brain Cognition allowed us to gauge the variation in fluid cognition that is related to the brain MRI, and thereby, to estimate the upper limit of what Brain Age can do. Moreover, by examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age.

      From our results, Brain Cognition, especially from certain cognition-prediction models such as the stacked models, has relatively good predictive performance, consistent with previous studies (Dubois et al., 2018; Pat, Wang, Anney, et al., 2022; Rasero et al., 2021; Sripada et al., 2020; Tetereva et al., 2022; for review, see Vieira et al., 2022). We then examined Brain Cognition using commonality analyses (Nimon et al., 2008) in multiple regression models having a Brain Age index, chronological age and Brain Cognition as regressors to explain fluid cognition. Similar to Brain Age indices, Brain Cognition exhibited large common effects with chronological age. But more importantly, unlike Brain Age indices, Brain Cognition showed large unique effects, up to around 11%. As explained above, the unique effects of Brain Cognition indicated the amount of co-variation between brain MRI and fluid cognition that was missed by a Brain Age index and chronological age. This missing amount was relatively high, considering that Brain Age and chronological age together explained around 32% of the total variation in fluid cognition. Accordingly, if a Brain Age index was used as a biomarker along with chronological age, we would have missed an opportunity to improve the performance of the model by around one-third of the variation explained.” 

      This is a reasonably good paper and the use of a commonality analysis is a nice contribution to understanding variance partitioning across different covariates. I have some comments that I believe the authors ought to address, which mostly relate to clarity and interpretation 

      Reviewer #1 Public Review #1

      First, from a conceptual point of view, the authors focus exclusively on cognition as a downstream outcome. I would suggest the authors nuance their discussion to provide broader considerations of the utility of their method and on the limits of interpretation of brain age models more generally. 

      Thank you for your comments on this issue. 

      We now discussed the broader consideration in detail:

      (1) the consistency between our findings on fluid cognition and other recent works on brain disorders, 

      (2) the difference between studies investigating the utility of Brain Age in explaining cognitive functioning, including ours and others (e.g., Butler et al., 2021; Cole, 2020, 2020; Jirsaraie, Kaufmann, et al., 2023) and those explaining neurological/psychological disorders (e.g., Bashyam et al., 2020; Rokicki et al., 2021)

      and 

      (3) suggested solutions we and others made to optimise the utility of Brain Age for both cognitive functioning and brain disorders.

      From Discussion:

      “This discrepancy between the predictive performance of age-prediction models and the utility of Brain Age indices as a biomarker is consistent with recent findings (for review, see Jirsaraie, Gorelik, et al., 2023), both in the context of cognitive functioning (Jirsaraie, Kaufmann, et al., 2023) and neurological/psychological disorders (Bashyam et al., 2020; Rokicki et al., 2021). For instance,  combining different MRI modalities into the prediction models, similar to our stacked models, ocen leads to the highest performance of age prediction models, but does not likely explain the highest variance across different phenotypes, including cognitive functioning and beyond (Jirsaraie, Gorelik, et al., 2023).”

      “There is a notable difference between studies investigating the utility of Brain Age in explaining cognitive functioning, including ours and others (e.g., Butler et al., 2021; Cole, 2020, 2020; Jirsaraie, Kaufmann, et al., 2023) and those explaining neurological/psychological disorders (e.g., Bashyam et al., 2020; Rokicki et al., 2021). We consider the former as a normative type of study and the lader as a case-control type of study (Insel et al., 2010; Marquand et al., 2016). Those case-control Brain Age studies focusing on neurological/psychological disorders often build age-prediction models from MRI data of largely healthy participants (e.g., controls in a case-control design or large samples in a population-based design), apply the built age-prediction models to participants without vs. with neurological/psychological disorders and compare Brain Age indices between the two groups. On the one hand, this means that case-control studies treat Brain Age as a method to detect anomalies in the neurological/psychological group (Hahn et al., 2021). On the other hand, this also means that case-control studies have to ignore underfided models when applied prediction models built from largely healthy participants to participants with neurological/psychological disorders (i.e., Brain Age may predict chronological age well for the controls, but not for those with a disorder). On the contrary, our study and other normative studies focusing on cognitive functioning often build age prediction models from MRI data of largely healthy participants and apply the built age prediction models to participants who are also largely healthy. Accordingly, the age prediction models for explaining cognitive functioning in normative studies, while not allowing us to detect group-level anomalies, do not suffer from being under-fided. This unfortunately might limit the generalisability of our study into just the normative type of study. Future work is still needed to test the utility of brain age in the case-control case.”

      “Next, researchers should not select age-prediction models based solely on age-prediction performance. Instead, researchers could select age-prediction models that explained phenotypes of interest the best. Here we selected age-prediction models based on a set of features (i.e., modalities) of brain MRI. This strategy was found effective not only for fluid cognition as we demonstrated here, but also for neurological and psychological disorders as shown elsewhere (Jirsaraie, Gorelik, et al., 2023; Rokicki et al., 2021). Rokicki and colleagues (2021), for instance, found that, while integrating across MRI modalities led to age prediction models with the highest age-prediction performance, using only T1 structural MRI gave age-prediction models that were better at classifying Alzheimer’s disease. Similarly, using only cerebral blood flow gave age-prediction models that were better at classifying mild/subjective cognitive impairment, schizophrenia and bipolar disorder. 

      As opposed to selecting age-prediction models based on a set of features, researchers could also select age-prediction models based on modelling methods. For instance, Jirsaraie and colleagues (2023) compared gradient tree boosting (GTB) and deep-learning brain network (DBN) algorithms in building age-prediction models. They found GTB to have higher age prediction performance but DBN to have better utility in explaining cognitive functioning. In this case, an algorithm with better utility (e.g., DBN) should be used for explaining a phenotype of interest. Similarly, Bashyam and colleagues (2020) built different DBN-based age-prediction models, varying in age-prediction performance. The DBN models with a higher number of epochs corresponded to higher age-prediction performance. However, DBN-based age-prediction models with a moderate (as opposed to higher or lower) number of epochs were better at classifying Alzheimer’s disease, mild cognitive impairment and schizophrenia. In this case, a model from the same algorithm with better utility (e.g., those DBN with a moderate epoch number) should be used for explaining a phenotype of interest.

      Accordingly, this calls for a change in research practice, as recently pointed out by Jirasarie and colleagues (2023, p7), “Despite mounting evidence, there is a persisting assumption across several studies that the most accurate brain age models will have the most potential for detecting differences in a given phenotype of interest”. Future neuroimaging research should aim to build age-prediction models that are not necessarily good at predicting age, but at capturing phenotypes of interest.”

      Reviewer #1 Public Review #2

      Second, from a methods perspective, there is not a sufficient explanation of the methodological procedures in the current manuscript to fully understand how the stacked regression models were constructed. I would request that the authors provide more information to enable the reader to beUer understand the stacked regression models used to ensure that these models are not overfit. 

      Thank you for allowing us an opportunity to clarify our stacked model. We made additional clarification to make this clearer (see below). We wanted to confirm that we did not use test sets to build a stacked model in both lower and higher levels of the Elastic Net models. Test sets were there just for testing the performance of the models.  

      From Methods:

      “We used nested cross-validation (CV) to build these prediction models (see Figure 7). We first split the data into five outer folds, leaving each outer fold with around 100 participants. This number of participants in each fold is to ensure the stability of the test performance across folds. In each outer-fold CV loop, one of the outer folds was treated as an outer-fold test set, and the rest was treated as an outer-fold training set. Ultimately, looping through the nested CV resulted in a) prediction models from each of the 18 sets of features as well as b) prediction models that drew information across different combinations of the 18 separate sets, known as “stacked models.” We specified eight stacked models: “All” (i.e., including all 18 sets of features),  “All excluding Task FC”, “All excluding Task Contrast”, “Non-Task” (i.e., including only Rest FC and sMRI), “Resting and Task FC”, “Task Contrast and FC”, “Task Contrast” and “Task FC”. Accordingly, there were 26 prediction models in total for both Brain Age and Brain Cognition.

      To create these 26 prediction models, we applied three steps for each outer-fold loop. The first step aimed at tuning prediction models for each of 18 sets of features. This step only involved the outer-fold training set and did not involve the outer-fold test set. Here, we divided the outer-fold training set into five inner folds and applied inner-fold CV to tune hyperparameters with grid search. Specifically, in each inner-fold CV, one of the inner folds was treated as an inner-fold validation set, and the rest was treated as an inner-fold training set. Within each inner-fold CV loop, we used the inner-fold training set to estimate parameters of the prediction model with a particular set of hyperparameters and applied the estimated model to the inner-fold validation set. Acer looping through the inner-fold CV, we, then, chose the prediction models that led to the highest performance, reflected by coefficient of determination (R2), on average across the inner-fold validation sets. This led to 18 tuned models, one for each of the 18 sets of features, for each outer fold.

      The second step aimed at tuning stacked models. Same as the first step, the second step only involved the outer-fold training set and did not involve the outer-fold test set. Here, using the same outer-fold training set as the first step, we applied tuned models, created from the first step, one from each of the 18 sets of features, resulting in 18 predicted values for each participant. We, then, re-divided this outer-fold training set into new five inner folds. In each inner fold, we treated different combinations of the 18 predicted values from separate sets of features as features to predict the targets in separate “stacked” models. Same as the first step, in each inner-fold CV loop, we treated one out of five inner folds as an inner-fold validation set, and the rest as an inner-fold training set. Also as in the first step, we used the inner-fold training set to estimate parameters of the prediction model with a particular set of hyperparameters from our grid. We tuned the hyperparameters of stacked models using grid search by selecting the models with the highest R2 on average across the inner-fold validation sets. This led to eight tuned stacked models.

      The third step aimed at testing the predictive performance of the 18 tuned prediction models from each of the set of features, built from the first step, and eight tuned stacked models, built from the second step. Unlike the first two steps, here we applied the already tuned models to the outer-fold test set. We started by applying the 18 tuned prediction models from each of the sets of features to each observation in the outer-fold test set, resulting in 18 predicted values. We then applied the tuned stacked models to these predicted values from separate sets of features, resulting in eight predicted values. 

      To demonstrate the predictive performance, we assessed the similarity between the observed values and the predicted values of each model across outer-fold test sets, using Pearson’s r, coefficient of determination (R2) and mean absolute error (MAE). Note that for R2, we used the sum of squares definition (i.e., R2 \= 1 – (sum of squares residuals/total sum of squares)) per a previous recommendation (Poldrack et al., 2020). We considered the predicted values from the outer-fold test sets of models predicting age or fluid cognition, as Brain Age and Brain Cognition, respectively.”

      Author response image 1.

      Diagram of the nested cross-validation used for creating predictions for models of each set of features as well as predictions for stacked models. 

      Note some previous research, including ours (Tetereva et al., 2022), splits the observations in the outer-fold training set into layer 1 and layer 2 and applies the first and second steps to layers 1 and 2, respectively. Here we decided against this approach and used the same outer-fold training set for both first and second steps in order to avoid potential bias toward the stacked models. This is because, when the data are split into two layers, predictive models built for each separate set of features only use the data from layer 1, while the stacked models use the data from both layers 1 and 2. In practice with large enough data, these two approaches might not differ much, as we demonstrated previously (Tetereva et al., 2022).

      Reviewer #1 Public Review #3

      Please also provide an indication of the different regression strengths that were estimated across the different models and cross-validation splits. Also, how stable were the weights across splits? 

      The focus of this article is on the predictions. Still, it is informative for readers to understand how stable the feature importance (i.e., Elastic Net coefficients) is. To demonstrate the stability of feature importance, we now examined the rank stability of feature importance using Spearman’s ρ (see Figure 4). Specifically, we correlated the feature importance between two prediction models of the same features, used in two different outer-fold test sets. Given that there were five outer-fold test sets, we computed 10 Spearman’s ρ for each prediction model of the same features.  We found Spearman’s ρ to be varied dramatically in both age-prediction (range\=.31-.94) and fluid cognition-prediction (range\=.16-.84) models. This means that some prediction models were much more stable in their feature importance than others. This is probably due to various factors such as a) the collinearity of features in the model, b) the number of features (e.g., 71,631 features in functional connectivity, which were further reduced to 75 PCAs, as compared to 19 features in subcortical volume based on the ASEG atlas), c) the penalisation of coefficients either with ‘Ridge’ or ‘Lasso’ methods, which resulted in reduction as a group of features or selection of a feature among correlated features, respectively, and d) the predictive performance of the models. Understanding the stability of feature importance is beyond the scope of the current article. As mentioned by Reviewer 1, “The predictions can be stable when the coefficients are not,” and we chose to focus on the prediction in the current article.   

      Author response image 2.

      Stability of feature importance (i.e., Elastic Net Coefficients) of prediction models. Each dot represents rank stability (reflected by Spearman’s ρ) in the feature importance between two prediction models of the same features, used in two different outer-fold test sets. Given that there were five outer-fold test sets, there were 10 Spearman’s ρs for each prediction model.  The numbers to the right of the plots indicate the mean of Spearman’s ρ for each prediction model.  

      Reviewer #1 Public Review #4

      Please provide more details about the task designs, MRI processing procedures that were employed on this sample in addition to the regression methods and bias correction methods used. For example, there are several different parameterisations of the elastic net, please provide equations to describe the method used here so that readers can easily determine how the regularisation parameters should be interpreted.  

      Thank you for the opportunity for us to provide more methodical details.

      First, for the task design, we included the following statements:

      From Methods:

      “HCP-A collected fMRI data from three tasks: Face Name (Sperling et al., 2001), Conditioned Approach Response Inhibition Task (CARIT) (Somerville et al., 2018) and VISual MOTOR (VISMOTOR) (Ances et al., 2009). 

      First, the Face Name task (Sperling et al., 2001) taps into episodic memory. The task had three blocks. In the encoding block [Encoding], participants were asked to memorise the names of faces shown. These faces were then shown again in the recall block [Recall] when the participants were asked if they could remember the names of the previously shown faces. There was also the distractor block [Distractor] occurring between the encoding and recall blocks. Here participants were distracted by a Go/NoGo task. We computed six contrasts for this Face Name task: [Encode], [Recall], [Distractor], [Encode vs. Distractor], [Recall vs. Distractor] and [Encode vs. Recall].

      Second, the CARIT task (Somerville et al., 2018) was adapted from the classic Go/NoGo task and taps into inhibitory control. Participants were asked to press a budon to all [Go] but not to two [NoGo] shapes. We computed three contrasts for the CARIT task: [NoGo], [Go] and [NoGo vs. Go]. 

      Third, the VISMOTOR task (Ances et al., 2009) was designed to test simple activation of the motor and visual cortices. Participants saw a checkerboard with a red square either on the lec or right. They needed to press a corresponding key to indicate the location of the red square. We computed just one contrast for the VISMOTOR task: [Vismotor], which indicates the presence of the checkerboard vs. baseline.” 

      Second, for MRI processing procedures, we included the following statements.

      From Methods:

      “HCP-A provides details of parameters for brain MRI elsewhere (Bookheimer et al., 2019; Harms et al., 2018). Here we used MRI data that were pre-processed by the HCP-A with recommended methods, including the MSMALL alignment (Glasser et al., 2016; Robinson et al., 2018) and ICA-FIX (Glasser et al., 2016) for functional MRI. We used multiple brain MRI modalities, covering task functional MRI (task fMRI), resting-state functional MRI (rsfMRI) and structural MRI (sMRI), and organised them into 19 sets of features.”

      “Sets of Features 1-10: Task fMRI contrast (Task Contrast)

      Task contrasts reflect fMRI activation relevant to events in each task. Bookheimer and colleagues (2019) provided detailed information about the fMRI in HCP-A. Here we focused on the pre-processed task fMRI Connectivity Informatics Technology Initiative (CIFTI) files with a suffix, “_PA_Atlas_MSMAll_hp0_clean.dtseries.nii.” These CIFTI files encompassed both the cortical mesh surface and subcortical volume (Glasser et al., 2013). Collected using the posterior-to-anterior (PA) phase, these files were aligned using MSMALL (Glasser et al., 2016; Robinson et al., 2018), linear detrended (see hdps://groups.google.com/a/humanconnectome.org/g/hcp-users/c/ZLJc092h980/m/GiihzQAUAwAJ) and cleaned from potential artifacts using ICA-FIX (Glasser et al., 2016). 

      To extract Task Contrasts, we regressed the fMRI time series on the convolved task events using a double-gamma canonical hemodynamic response function via FMRIB Software Library (FSL)’s FMRI Expert Analysis Tool (FEAT) (Woolrich et al., 2001). We kept FSL’s default high pass cutoff at 200s (i.e., .005 Hz). We then parcellated the contrast ‘cope’ files, using the Glasser atlas (Gordon et al., 2016) for cortical surface regions and the Freesurfer’s automatic segmentation (aseg) (Fischl et al., 2002) for subcortical regions. This resulted in 379 regions, whose number was, in turn, the number of features for each Task Contrast set of features. “ 

      “Sets of Features 11-13: Task fMRI functional connectivity (Task FC)

      Task FC reflects functional connectivity (FC ) among the brain regions during each task, which is considered an important source of individual differences (Elliod et al., 2019; Fair et al., 2007; Gradon et al., 2018). We used the same CIFTI file “_PA_Atlas_MSMAll_hp0_clean.dtseries.nii.” as the task contrasts. Unlike Task Contrasts, here we treated the double-gamma, convolved task events as regressors of no interest and focused on the residuals of the regression from each task (Fair et al., 2007). We computed these regressors on FSL, and regressed them in nilearn (Abraham et al., 2014). Following previous work on task FC (Elliod et al., 2019), we applied a highpass at .008 Hz. For parcellation, we used the same atlases as Task Contrast (Fischl et al., 2002; Glasser et al., 2016). We computed Pearson’s correlations of each pair of 379 regions, resulting in a table of 71,631 non-overlapping FC indices for each task. We then applied r-to-z transformation and principal component analysis (PCA) of 75 components (Rasero et al., 2021; Sripada et al., 2019, 2020). Note to avoid data leakage, we conducted the PCA on each training set and applied its definition to the corresponding test set. Accordingly, there were three sets of 75 features for Task FC, one for each task. 

      Set of Features 14: Resting-state functional MRI functional connectivity (Rest FC) Similar to Task FC, Rest FC reflects functional connectivity (FC ) among the brain regions, except that Rest FC occurred during the resting (as opposed to task-performing) period. HCPA collected Rest FC from four 6.42-min (488 frames) runs across two days, leading to 26-min long data (Harms et al., 2018). On each day, the study scanned two runs of Rest FC, starting with anterior-to-posterior (AP) and then with posterior-to-anterior (PA) phase encoding polarity. We used the “rfMRI_REST_Atlas_MSMAll_hp0_clean.dscalar.nii” file that was preprocessed and concatenated across the four runs.  We applied the same computations (i.e., highpass filter, parcellation, Pearson’s correlations, r-to-z transformation and PCA) with the Task FC. 

      Sets of Features 15-18: Structural MRI (sMRI)

      sMRI reflects individual differences in brain anatomy. The HCP-A used an established preprocessing pipeline for sMRI (Glasser et al., 2013). We focused on four sets of features: cortical thickness, cortical surface area, subcortical volume and total brain volume. For cortical thickness and cortical surface area, we used Destrieux’s atlas (Destrieux et al., 2010; Fischl, 2012) from FreeSurfer’s “aparc.stats” file, resulting in 148 regions for each set of features. For subcortical volume, we used the aseg atlas (Fischl et al., 2002) from FreeSurfer’s “aseg.stats” file, resulting in 19 regions. For total brain volume, we had five FreeSurfer-based features: “FS_IntraCranial_Vol” or estimated intra-cranial volume, “FS_TotCort_GM_Vol” or total cortical grey mader volume, “FS_Tot_WM_Vol” or total cortical white mader volume, “FS_SubCort_GM_Vol” or total subcortical grey mader volume and “FS_BrainSegVol_eTIV_Ratio” or ratio of brain segmentation volume to estimated total intracranial volume.”

      Third, for regression methods and bias correction methods used, we included the following statements:

      From Methods:

      “For the machine learning algorithm, we used Elastic Net (Zou & Hastie, 2005). Elastic Net is a general form of penalised regressions (including Lasso and Ridge regression), allowing us to simultaneously draw information across different brain indices to predict one target variable. Penalised regressions are commonly used for building age-prediction models (Jirsaraie, Gorelik, et al., 2023). Previously we showed that the performance of Elastic Net in predicting cognitive abilities is on par, if not better than, many non-linear and morecomplicated algorithms (Pat, Wang, Bartonicek, et al., 2022; Tetereva et al., 2022). Moreover, Elastic Net coefficients are readily explainable, allowing us the ability to explain how our age-prediction and cognition-prediction models made the prediction from each brain feature (Molnar, 2019; Pat, Wang, Bartonicek, et al., 2022) (see below). 

      Elastic Net simultaneously minimises the weighted sum of the features’ coefficients. The degree of penalty to the sum of the feature’s coefficients is determined by a shrinkage hyperparameter ‘a’: the greater the a, the more the coefficients shrink, and the more regularised the model becomes. Elastic Net also includes another hyperparameter, ‘ℓ! ratio’, which determines the degree to which the sum of either the squared (known as ‘Ridge’; ℓ! ratio=0) or absolute (known as ‘Lasso’; ℓ! ratio=1) coefficients is penalised (Zou & Hastie, 2005). The objective function of Elastic Net as implemented by sklearn (Pedregosa et al., 2011) is defined as:

      where X is the features, y is the target, and b is the coefficient. In our grid search, we tuned two Elastic Net hyperparameters: a using 70 numbers in log space, ranging from .1 and 100, and ℓ!-ratio using 25 numbers in linear space, ranging from 0 and 1.

      To understand how Elastic Net made a prediction based on different brain features, we examined the coefficients of the tuned model. Elastic Net coefficients can be considered as feature importance, such that more positive Elastic Net coefficients lead to more positive predicted values and, similarly, more negative Elastic Net coefficients lead to more negative predicted values (Molnar, 2019; Pat, Wang, Bartonicek, et al., 2022). While the magnitude of Elastic Net coefficients is regularised (thus making it difficult for us to interpret the magnitude itself directly), we could still indicate that a brain feature with a higher magnitude weights relatively stronger in making a prediction. Another benefit of Elastic Net as a penalised regression is that the coefficients are less susceptible to collinearity among features as they have already been regularised (Dormann et al., 2013; Pat, Wang, Bartonicek, et al., 2022).

      Given that we used five-fold nested cross validation, different outer folds may have different degrees of ‘a’ and ‘ℓ! ratio’, making the final coefficients from different folds to be different. For instance, for certain sets of features, penalisation may not play a big part (i.e., higher or lower ‘a’ leads to similar predictive performance), resulting in different ‘a’ for different folds. To remedy this in the visualisation of Elastic Net feature importance, we refitted the Elastic Net model to the full dataset without spli{ng them into five folds and visualised the coefficients on brain images using Brainspace (Vos De Wael et al., 2020) and Nilern (Abraham et al., 2014) packages. Note, unlike other sets of features, Task FC and Rest FC were modelled acer data reduction via PCA. Thus, for Task FC and Rest FC, we, first, multiplied the absolute PCA scores (extracted from the ‘components_’ attribute of ‘sklearn.decomposition.PCA’) with Elastic Net coefficients and, then, summed the multiplied values across the 75 components, leaving 71,631 ROI-pair indices.

      References

      Abraham, A., Pedregosa, F., Eickenberg, M., Gervais, P., Mueller, A., Kossaifi, J., Gramfort, A., Thirion, B., & Varoquaux, G. (2014). Machine learning for neuroimaging with scikitlearn. Frontiers in Neuroinformatics, 8, 14. hdps://doi.org/10.3389/fninf.2014.00014

      Ances, B. M., Liang, C. L., Leontiev, O., Perthen, J. E., Fleisher, A. S., Lansing, A. E., & Buxton, R. B. (2009). Effects of aging on cerebral blood flow, oxygen metabolism, and blood oxygenation level dependent responses to visual stimulation. Human Brain Mapping, 30(4), 1120–1132. hdps://doi.org/10.1002/hbm.20574

      Bashyam, V. M., Erus, G., Doshi, J., Habes, M., Nasrallah, I. M., Truelove-Hill, M., Srinivasan, D., Mamourian, L., Pomponio, R., Fan, Y., Launer, L. J., Masters, C. L., Maruff, P., Zhuo, C., Völzke, H., Johnson, S. C., Fripp, J., Koutsouleris, N., Saderthwaite, T. D., … on behalf of the ISTAGING Consortium,  the P. A. disease C., ADNI, and CARDIA studies. (2020). MRI signatures of brain age and disease over the lifespan based on a deep brain network and 14 468 individuals worldwide. Brain, 143(7), 2312–2324. hdps://doi.org/10.1093/brain/awaa160

      Bookheimer, S. Y., Salat, D. H., Terpstra, M., Ances, B. M., Barch, D. M., Buckner, R. L., Burgess, G. C., Curtiss, S. W., Diaz-Santos, M., Elam, J. S., Fischl, B., Greve, D. N., Hagy, H. A., Harms, M. P., Hatch, O. M., Hedden, T., Hodge, C., Japardi, K. C., Kuhn, T. P., … Yacoub, E. (2019). The Lifespan Human Connectome Project in Aging: An overview. NeuroImage, 185, 335–348. hdps://doi.org/10.1016/j.neuroimage.2018.10.009

      Butler, E. R., Chen, A., Ramadan, R., Le, T. T., Ruparel, K., Moore, T. M., Saderthwaite, T. D., Zhang, F., Shou, H., Gur, R. C., Nichols, T. E., & Shinohara, R. T. (2021). Pi alls in brain age analyses. Human Brain Mapping, 42(13), 4092–4101. hdps://doi.org/10.1002/hbm.25533

      Cole, J. H. (2020). Multimodality neuroimaging brain-age in UK biobank: Relationship to biomedical, lifestyle, and cognitive factors. Neurobiology of Aging, 92, 34–42. hdps://doi.org/10.1016/j.neurobiolaging.2020.03.014

      Destrieux, C., Fischl, B., Dale, A., & Halgren, E. (2010). Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature. NeuroImage, 53(1), 1–15. hdps://doi.org/10.1016/j.neuroimage.2010.06.010

      Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., Marquéz, J. R. G., Gruber, B., Lafourcade, B., Leitão, P. J., Münkemüller, T., McClean, C., Osborne, P. E., Reineking, B., Schröder, B., Skidmore, A. K., Zurell, D., & Lautenbach, S. (2013). Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1), 27–46. hdps://doi.org/10.1111/j.16000587.2012.07348.x

      Dubois, J., Galdi, P., Paul, L. K., & Adolphs, R. (2018). A distributed brain network predicts general intelligence from resting-state human neuroimaging data. Philosophical Transactions of the Royal Society B: Biological Sciences, 373(1756), 20170284. hdps://doi.org/10.1098/rstb.2017.0284

      Elliod, M. L., Knodt, A. R., Cooke, M., Kim, M. J., Melzer, T. R., Keenan, R., Ireland, D., Ramrakha, S., Poulton, R., Caspi, A., Moffid, T. E., & Hariri, A. R. (2019). General functional connectivity: Shared features of resting-state and task fMRI drive reliable and heritable individual differences in functional brain networks. NeuroImage, 189, 516–532. hdps://doi.org/10.1016/j.neuroimage.2019.01.068

      Fair, D. A., Schlaggar, B. L., Cohen, A. L., Miezin, F. M., Dosenbach, N. U. F., Wenger, K. K., Fox, M. D., Snyder, A. Z., Raichle, M. E., & Petersen, S. E. (2007). A method for using blocked and event-related fMRI data to study “resting state” functional connectivity. NeuroImage, 35(1), 396–405. hdps://doi.org/10.1016/j.neuroimage.2006.11.051

      Fischl, B. (2012). FreeSurfer. NeuroImage, 62(2), 774–781. hdps://doi.org/10.1016/j.neuroimage.2012.01.021

      Fischl, B., Salat, D. H., Busa, E., Albert, M., Dieterich, M., Haselgrove, C., van der Kouwe, A., Killiany, R., Kennedy, D., Klaveness, S., Montillo, A., Makris, N., Rosen, B., & Dale, A. M. (2002). Whole Brain Segmentation. Neuron, 33(3), 341–355. hdps://doi.org/10.1016/S0896-6273(02)00569-X

      Glasser, M. F., Smith, S. M., Marcus, D. S., Andersson, J. L. R., Auerbach, E. J., Behrens, T. E. J., Coalson, T. S., Harms, M. P., Jenkinson, M., Moeller, S., Robinson, E. C., Sotiropoulos, S. N., Xu, J., Yacoub, E., Ugurbil, K., & Van Essen, D. C. (2016). The Human Connectome Project’s neuroimaging approach. Nature Neuroscience, 19(9), 1175– 1187. hdps://doi.org/10.1038/nn.4361

      Glasser, M. F., Sotiropoulos, S. N., Wilson, J. A., Coalson, T. S., Fischl, B., Andersson, J. L., Xu, J., Jbabdi, S., Webster, M., Polimeni, J. R., Van Essen, D. C., & Jenkinson, M. (2013). The minimal preprocessing pipelines for the Human Connectome Project. NeuroImage, 80, 105–124. hdps://doi.org/10.1016/j.neuroimage.2013.04.127

      Gordon, E. M., Laumann, T. O., Adeyemo, B., Huckins, J. F., Kelley, W. M., & Petersen, S. E. (2016). Generation and Evaluation of a Cortical Area Parcellation from Resting-State Correlations. Cerebral Cortex, 26(1), 288–303. hdps://doi.org/10.1093/cercor/bhu239

      Gradon, C., Laumann, T. O., Nielsen, A. N., Greene, D. J., Gordon, E. M., Gilmore, A. W., Nelson, S. M., Coalson, R. S., Snyder, A. Z., Schlaggar, B. L., Dosenbach, N. U. F., & Petersen, S. E. (2018). Functional Brain Networks Are Dominated by Stable Group and Individual Factors, Not Cognitive or Daily Variation. Neuron, 98(2), 439-452.e5. hdps://doi.org/10.1016/j.neuron.2018.03.035

      Hahn, T., Fisch, L., Ernsting, J., Winter, N. R., Leenings, R., Sarink, K., Emden, D., Kircher, T., Berger, K., & Dannlowski, U. (2021). From ‘loose fi{ng’ to high-performance, uncertainty-aware brain-age modelling. Brain, 144(3), e31–e31. hdps://doi.org/10.1093/brain/awaa454

      Harms, M. P., Somerville, L. H., Ances, B. M., Andersson, J., Barch, D. M., Bastiani, M., Bookheimer, S. Y., Brown, T. B., Buckner, R. L., Burgess, G. C., Coalson, T. S., Chappell, M. A., Dapredo, M., Douaud, G., Fischl, B., Glasser, M. F., Greve, D. N., Hodge, C., Jamison, K. W., … Yacoub, E. (2018). Extending the Human Connectome Project across ages: Imaging protocols for the Lifespan Development and Aging projects. NeuroImage, 183, 972–984. hdps://doi.org/10.1016/j.neuroimage.2018.09.060

      Insel, T., Cuthbert, B., Garvey, M., Heinssen, R., Pine, D. S., Quinn, K., Sanislow, C., & Wang, P. (2010). Research Domain Criteria (RDoC): Toward a New Classification Framework for Research on Mental Disorders. American Journal of Psychiatry, 167(7), 748–751. hdps://doi.org/10.1176/appi.ajp.2010.09091379

      Jirsaraie, R. J., Gorelik, A. J., Gatavins, M. M., Engemann, D. A., Bogdan, R., Barch, D. M., & Sotiras, A. (2023). A systematic review of multimodal brain age studies: Uncovering a divergence between model accuracy and utility. PaUerns, 4(4), 100712. hdps://doi.org/10.1016/j.pader.2023.100712

      Jirsaraie, R. J., Kaufmann, T., Bashyam, V., Erus, G., Luby, J. L., Westlye, L. T., Davatzikos, C., Barch, D. M., & Sotiras, A. (2023). Benchmarking the generalizability of brain age models: Challenges posed by scanner variance and prediction bias. Human Brain Mapping, 44(3), 1118–1128. hdps://doi.org/10.1002/hbm.26144

      Marquand, A. F., Rezek, I., Buitelaar, J., & Beckmann, C. F. (2016). Understanding Heterogeneity in Clinical Cohorts Using Normative Models: Beyond Case-Control Studies. Biological Psychiatry, 80(7), 552–561. hdps://doi.org/10.1016/j.biopsych.2015.12.023

      Molnar, C. (2019). Interpretable Machine Learning. A Guide for Making Black Box Models Explainable. hdps://christophm.github.io/interpretable-ml-book/

      Nimon, K., Lewis, M., Kane, R., & Haynes, R. M. (2008). An R package to compute commonality coefficients in the multiple regression case: An introduction to the package and a practical example. Behavior Research Methods, 40(2), 457–466. hdps://doi.org/10.3758/BRM.40.2.457

      Pat, N., Wang, Y., Anney, R., Riglin, L., Thapar, A., & Stringaris, A. (2022). Longitudinally stable, brain-based predictive models mediate the relationships between childhood cognition and socio-demographic, psychological and genetic factors. Human Brain Mapping, hbm.26027. hdps://doi.org/10.1002/hbm.26027

      Pat, N., Wang, Y., Bartonicek, A., Candia, J., & Stringaris, A. (2022). Explainable machine learning approach to predict and explain the relationship between task-based fMRI and individual differences in cognition. Cerebral Cortex, bhac235. hdps://doi.org/10.1093/cercor/bhac235

      Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Predenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(85), 2825–2830.

      Poldrack, R. A., Huckins, G., & Varoquaux, G. (2020). Establishment of Best Practices for Evidence for Prediction: A Review. JAMA Psychiatry, 77(5), 534–540. hdps://doi.org/10.1001/jamapsychiatry.2019.3671

      Rasero, J., Sentis, A. I., Yeh, F.-C., & Verstynen, T. (2021). Integrating across neuroimaging modalities boosts prediction accuracy of cognitive ability. PLOS Computational Biology, 17(3), e1008347. hdps://doi.org/10.1371/journal.pcbi.1008347

      Robinson, E. C., Garcia, K., Glasser, M. F., Chen, Z., Coalson, T. S., Makropoulos, A., Bozek, J., Wright, R., Schuh, A., Webster, M., Huder, J., Price, A., Cordero Grande, L., Hughes, E., Tusor, N., Bayly, P. V., Van Essen, D. C., Smith, S. M., Edwards, A. D., … Rueckert, D. (2018). Multimodal surface matching with higher-order smoothness constraints. NeuroImage, 167, 453–465. hdps://doi.org/10.1016/j.neuroimage.2017.10.037

      Rokicki, J., Wolfers, T., Nordhøy, W., Tesli, N., Quintana, D. S., Alnæs, D., Richard, G., de Lange, A.-M. G., Lund, M. J., Norbom, L., Agartz, I., Melle, I., Nærland, T., Selbæk, G., Persson, K., Nordvik, J. E., Schwarz, E., Andreassen, O. A., Kaufmann, T., & Westlye, L. T. (2021). Multimodal imaging improves brain age prediction and reveals distinct abnormalities in patients with psychiatric and neurological disorders. Human Brain Mapping, 42(6), 1714–1726. hdps://doi.org/10.1002/hbm.25323

      Somerville, L. H., Bookheimer, S. Y., Buckner, R. L., Burgess, G. C., Curtiss, S. W., Dapredo, M., Elam, J. S., Gaffrey, M. S., Harms, M. P., Hodge, C., Kandala, S., Kastman, E. K., Nichols, T. E., Schlaggar, B. L., Smith, S. M., Thomas, K. M., Yacoub, E., Van Essen, D. C., & Barch, D. M. (2018). The Lifespan Human Connectome Project in Development: A large-scale study of brain connectivity development in 5–21 year olds. NeuroImage, 183, 456–468. hdps://doi.org/10.1016/j.neuroimage.2018.08.050

      Sperling, R. A., Bates, J. F., Cocchiarella, A. J., Schacter, D. L., Rosen, B. R., & Albert, M. S. (2001). Encoding novel face-name associations: A functional MRI study. Human Brain Mapping, 14(3), 129–139. hdps://doi.org/10.1002/hbm.1047

      Sripada, C., Angstadt, M., Rutherford, S., Kessler, D., Kim, Y., Yee, M., & Levina, E. (2019). Basic Units of Inter-Individual Variation in Resting State Connectomes. Scientific Reports, 9(1), Article 1. hdps://doi.org/10.1038/s41598-018-38406-5

      Sripada, C., Angstadt, M., Rutherford, S., Taxali, A., & Shedden, K. (2020). Toward a “treadmill test” for cognition: Improved prediction of general cognitive ability from the task activated brain. Human Brain Mapping, 41(12), 3186–3197. hdps://doi.org/10.1002/hbm.25007

      Tetereva, A., Li, J., Deng, J. D., Stringaris, A., & Pat, N. (2022). Capturing brain-cognition relationship: Integrating task-based fMRI across tasks markedly boosts prediction and test-retest reliability. NeuroImage, 263, 119588. hdps://doi.org/10.1016/j.neuroimage.2022.119588

      Vieira, B. H., Pamplona, G. S. P., Fachinello, K., Silva, A. K., Foss, M. P., & Salmon, C. E. G. (2022). On the prediction of human intelligence from neuroimaging: A systematic review of methods and reporting. Intelligence, 93, 101654. hdps://doi.org/10.1016/j.intell.2022.101654

      Vos De Wael, R., Benkarim, O., Paquola, C., Lariviere, S., Royer, J., Tavakol, S., Xu, T., Hong, S.J., Langs, G., Valk, S., Misic, B., Milham, M., Margulies, D., Smallwood, J., & Bernhardt, B. C. (2020). BrainSpace: A toolbox for the analysis of macroscale gradients in neuroimaging and connectomics datasets. Communications Biology, 3(1), 103. hdps://doi.org/10.1038/s42003-020-0794-7

      Woolrich, M. W., Ripley, B. D., Brady, M., & Smith, S. M. (2001). Temporal Autocorrelation in Univariate Linear Modeling of FMRI Data. NeuroImage, 14(6), 1370–1386. hdps://doi.org/10.1006/nimg.2001.0931

      Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320. hdps://doi.org/10.1111/j.1467-9868.2005.00503.x

    1. Author Response

      The following is the authors’ response to the previous reviews.

      Reviewer #2 (Public Review):

      Summary:

      In the revised manuscript, the authors aim to investigate brain-wide activation patterns following administration of the anesthetics ketamine and isoflurane, and conduct comparative analysis of these patterns to understand shared and distinct mechanisms of these two anesthetics. To this end, they perform Fos immunohistochemistry in perfused brain sections to label active nuclei, use a custom pipeline to register images to the ABA framework and quantify Fos+ nuclei, and perform multiple complementary analyses to compare activation patterns across groups.

      In the latest revision, the authors have made some changes in response to our previous comments on how to fix the analyses. However, the revised analyses were not changed correctly and remain flawed in several fundamental ways.

      Critical problems:

      (1) Before one can perform higher level analyses such as hiearchal cluster or network hub (or PC) analysis, it is fundamental to validate that you have significant differences of the raw Fos expression values in the first place. First of all, this means showing figures with the raw data (Fos expression levels) in some form in Figures 2 and 3 before showing the higher level analyses in Figures 4 and 5; this is currently switched around. Second and most importantly, when you have a large number of brain areas with large differences in mean values and variance, you need to account for this in a meaningful way. Changing to log values is a step in the right direction for mean values but does not account well for differences in variance. Indeed, considering the large variances in brain areas with high mean values and variance, it is a little difficult to believe that all brain regions, especially brain areas with low mean values, passed corrections for multiple comparisons test. We suggested Z-scores relative to control values for each brain region; this would have accounted for wide differences in mean values and variance, but this was not done. Overall, validation of anesthesia-induced differences in Fos expression levels is not yet shown.

      (a) Reordering the figures.

      Thank you for your suggestion. We have added Figure 2 (for 201 brain regions) and Figure 2—figure supplement 1 (for 53 brain regions) to demonstrate the statistical differences in raw Fos expression between KET and ISO compared to their respective control groups. These figures specifically present the raw c-Fos expression levels for both KET and ISO in the same brain areas, providing a fundamental basis for the subsequent analyses. Additionally, we have moved the original Figures 4 and 5 to Figures 3 and 4.

      (b) Z-score transformation and validation of anesthesia-induced differences in Fos expression.

      Thank you for your suggestion. Before multiple comparisons, we transformed the data into log c-Fos density and then performed Z-scores relative to control values for each brain region. Indeed, through Z-score transformation, we have identified a larger number of significantly activated brain regions in Figure 2. The number of brain regions showing significant activation increased by 100 for KET and by 39 for ISO. We have accordingly updated the results section to include these findings in Line 80-181. Besides, we have added the following content in the Statistical Analysis section in Line 489: "…In Figure 2 and Figure 2–figure supplement 1, c-Fos densities in both experimental and control groups were log-transformed. Z-scores were calculated for each brain region by normalizing these log-transformed values against the mean and standard deviation of its respective control group. This involved subtracting the control mean from the experimental value and dividing the result by the control standard deviation. For statistical analysis, Z-scores were compared to a null distribution with a zero mean, and adjustments were made for multiple comparisons using the Benjamini–Hochberg method with a 5% false discovery rate (Q)..…".

      Author response image 1.

      KET and ISO induced c-Fos expression relative to their respective control group across 201 distinct brain regions. Z-scores represent the normalized c-Fos expression in the KET and ISO groups, calculated against the mean and standard deviation from their respective control groups. Statistical analysis involved the comparison of Z-scores to a null distribution with a zero mean and adjustment for multiple comparisons using the Benjamini–Hochberg method at a 5% false discovery rate (p < 0.05, p < 0.01, **p < 0.001). n = 6, 6, 8, 6 for the home cage, ISO, saline, and KET, respectively. Missing values resulted from zero standard deviations in control groups. Brain regions are categorized into major anatomical subdivisions, as shown on the left side of the graph.

      Author response image 2.

      KET and ISO induced c-Fos expression relative to their respective control group across 53 distinct brain regions. Z-scores for c-Fos expression in the KET and ISO groups were normalized to the mean and standard deviation of their respective control groups. Statistical analysis involved the comparison of Z-scores to a null distribution with a zero mean and adjustment for multiple comparisons using the Benjamini–Hochberg method at a 5\% false discovery rate (p < 0.05, p < 0.01, **p < 0.001). Brain regions are organized into major anatomical subdivisions, as indicated on the left side of the graph.

      (2) Let's assume for a moment that the raw Fos expression analyses indicate significant differences. They used hierarchal cluster analyses as a rationale for examining 53 brain areas in all subsequent analyses of Fos expression following isoflurane versus home cage or ketamine versus saline. Instead, the authors changed to 201 brain areas with no validated rationale other than effectively saying 'we wanted to look at more brain areas'. And then later, when they examined raw Fos expression values in Figures 4 and 5, they assess 43 brain areas for ketamine and 20 brain areas for isoflurane, without any rationale for why choosing these numbers of brain areas. This is a particularly big problem when they are trying to compare effects of isoflurane versus ketamine on Fos expression in these brain areas - they did not compare the same brain areas.

      (a) Changing to 201 brain areas with validated rationale.

      Thank you for your question. We have revised the original text from “To enhance our analysis of c-Fos expression patterns induced by KET and ISO, we expanded our study to 201 subregions.” to Line 100: "…To enable a more detailed examination and facilitate clearer differentiation and comparison of the effects caused by KET and ISO, we subdivided the 53 brain regions into 201 distinct areas. This approach, guided by the standard mouse atlas available at http://atlas.brain-map.org/atlas, allowed for an in-depth analysis of the responses in various brain regions…". For hierarchal cluster analyses from 53 to 201 brain regions, Line 215: "…To achieve a more granular analysis and better discern the responses between KET and ISO, we expanded our study from the initial 53 brain regions to 201 distinct subregions…"

      (b) Compare the same brain areas for KET and ISO and the rationale for why choosing these numbers of brain areas in Figures 3 and 4.

      We apologize for the confusion and lack of clarity regarding the selection of brain regions for analysis. In Figure 2 and Figure 2—figure supplement 1, we display the c-Fos expression in the same brain regions affected by KET and ISO. In Figures 3 and 4, we applied a uniform standard to specifically report the brain areas most prominently activated by KET and ISO, respectively. As specified in Line 104: "…Compared to the saline group, KET activated 141 out of a total of 201 brain regions (Figure 2). To further identify the brain regions that are most significantly affected by KET, we calculated Cohen's d for each region to quantify the magnitude of activation and subsequently focused on those regions that had a corrected p-value below 0.05 and effect size in the top 40% (Figure 3, Figure 3—figure supplement 1)…" and Line 142: "…Using the same criteria applied to KET, which involved selecting regions with Cohen's d values in the top 40% of significantly activated areas from Figure 2, we identified 32 key brain regions impacted by ISO (Figure 4, Figure 4—figure supplement 1).…".

      Moreover, we illustrate the co-activated brain regions by KET and ISO in Figure 4C. As detailed in Lines 167-180:"…The co-activation of multiple brain regions by KET and ISO indicates that they have overlapping effects on brain functions. Examples of these effects include impacts on sensory processing, as evidenced by the activation of the PIR, ENT 1, and OT2, pointing to changes in sensory perception typical of anesthetics. Memory and cognitive functions are influenced, as indicated by the activation of the subiculum (SUB) 3, dentate gyrus (DG) 4, and RE 5. The reward and motivational systems are engaged, involving the ACB and ventral tegmental area (VTA), signaling the modulation of reward pathways 6. Autonomic and homeostatic control are also affected, as shown by areas like the lateral hypothalamic area (LHA) 7 and medial preoptic area (MPO) 8, emphasizing effects on functions such as feeding and thermoregulation. Stress and arousal responses are impacted through the activation of the paraventricular hypothalamic nucleus (PVH) 10,11 and LC 12. This broad activation pattern highlights the overlap in drug effects and the complexity of brain networks in anesthesia…". Below are the revised Figures 3 and 4.

      (1) Chapuis, J. et al. Lateral entorhinal modulation of piriform cortical activity and fine odor discrimination. J. Neurosci. 33, 13449-13459 (2013). https://doi.org:10.1523/jneurosci.1387-13.2013

      (2) Giessel, A. J. & Datta, S. R. Olfactory maps, circuits and computations. Curr. Opin. Neurobiol. 24, 120-132 (2014). https://doi.org:10.1016/j.conb.2013.09.010

      (3) Roy, D. S. et al. Distinct Neural Circuits for the Formation and Retrieval of Episodic Memories. Cell 170, 1000-1012.e1019 (2017). https://doi.org:10.1016/j.cell.2017.07.013

      (4) Sun, X. et al. Functionally Distinct Neuronal Ensembles within the Memory Engram. Cell 181, 410-423.e417 (2020). https://doi.org:10.1016/j.cell.2020.02.055

      (5) Huang, X. et al. A Visual Circuit Related to the Nucleus Reuniens for the Spatial-Memory-Promoting Effects of Light Treatment. Neuron (2021).

      (6) Al-Hasani, R. et al. Ventral tegmental area GABAergic inhibition of cholinergic interneurons in the ventral nucleus accumbens shell promotes reward reinforcement. Nat. Neurosci. 24, 1414-1428 (2021). https://doi.org:10.1038/s41593-021-00898-2

      (7) Mickelsen, L. E. et al. Single-cell transcriptomic analysis of the lateral hypothalamic area reveals molecularly distinct populations of inhibitory and excitatory neurons. Nat. Neurosci. 22, 642-656 (2019). https://doi.org:10.1038/s41593-019-0349-8

      (8) McGinty, D. & Szymusiak, R. Keeping cool: a hypothesis about the mechanisms and functions of slow-wave sleep. Trends Neurosci. 13, 480-487 (1990). https://doi.org:10.1016/0166-2236(90)90081-k

      (9) Mullican, S. E. et al. GFRAL is the receptor for GDF15 and the ligand promotes weight loss in mice and nonhuman primates. Nat. Med. 23, 1150-1157 (2017). https://doi.org:10.1038/nm.4392

      (10) Rasiah, N. P., Loewen, S. P. & Bains, J. S. Windows into stress: a glimpse at emerging roles for CRH(PVN) neurons. Physiol. Rev. 103, 1667-1691 (2023). https://doi.org:10.1152/physrev.00056.2021

      (11) Islam, M. T. et al. Vasopressin neurons in the paraventricular hypothalamus promote wakefulness via lateral hypothalamic orexin neurons. Curr. Biol. 32, 3871-3885.e3874 (2022). https://doi.org:10.1016/j.cub.2022.07.020

      (12) Ross, J. A. & Van Bockstaele, E. J. The Locus Coeruleus- Norepinephrine System in Stress and Arousal: Unraveling Historical, Current, and Future Perspectives. Front Psychiatry 11, 601519 (2020). https://doi.org:10.3389/fpsyt.2020.601519

      Author response image 3.

      Brain regions exhibiting significant activation by KET. (A) Fifty-five brain regions exhibited significant KET activation. These were chosen from the 201 regions analyzed in Figure 2, focusing on the top 40\% ranked by effect size among those with corrected p values less than 0.05. Data are presented as mean ± SEM, with p-values adjusted for multiple comparisons (p < 0.05, p < 0.01, **p < 0.001). (B) Representative immunohistochemical staining of brain regions identified in Figure 3A, with control group staining available in Figure 3—figure supplement 1. Scale bar: 200 µm.

      Author response image 4.

      Brain regions exhibiting significant activation by ISO. (A) Brain regions significantly activated by ISO were initially identified using a corrected p-value below 0.05. From these, the top 40% in effect size (Cohen’s d) were further selected, resulting in 32 key areas. p-values are adjusted for multiple comparisons (p < 0.01, *p < 0.001). (B) Representative immunohistochemical staining of brain regions identified in Figure 4A. Control group staining is available in Figure 4—figure supplement 1. Scale bar: 200 µm. Scale bar: 200 µm. (C) A Venn diagram displays 43 brain regions co-activated by KET and ISO, identified by the adjusted p-values (p < 0.05) for both KET and ISO. CTX: cerebral cortex; CNU: cerebral nuclei; TH: thalamus; HY: hypothalamus; MB: midbrain; HB: hindbrain.

      Less critical comments:

      (3) The explanation of hierarchical level's in lines 90-95 did not make sense.

      We have revised the section that initially stated in lines 90-95, "…Based on the standard mouse atlas available at http://atlas.brain-map.org/, the mouse brain was segmented into nine hierarchical levels, totaling 984 regions. The primary level consists of grey matter, the secondary of the cerebrum, brainstem, and cerebellum, and the tertiary includes regions like the cerebral cortex and cerebellar nuclei, among others, with some regions extending to the 8th and 9th levels. The fifth level comprises 53 subregions, with detailed expression levels and their respective abbreviations presented in Supplementary Figure 2…". Our revised description, now in line 91: "…Building upon the framework established in previous literature, our study categorizes the mouse brain into 53 distinct subregions1…"

      (1) Do JP, Xu M, Lee SH, Chang WC, Zhang S, Chung S, Yung TJ, Fan JL, Miyamichi K, Luo L et al: Cell type-specific long-range connections of basal forebrain circuit. Elife 2016, 5.

      (4) I am still perplexed by why the authors consider the prelimbic and infralimbic cortex 'neuroendocrine' brain areas in the abstract. In contrast, the prelimbic and infralimbic were described better in the introduction as "associated information processing" areas.

      Thank you for bringing this to our attention. We agree that classifying the prelimbic and infralimbic cortex as 'neuroendocrine' in the abstract was incorrect, which was an oversight on our part. In the revised version, as detailed in line 167, we observed an increased number of brain regions showing overlapping activation by both KET and ISO, which is depicted in Figure 4C. This extensive co-activation across various regions makes it challenging to narrowly define the functional classification of each area. Consequently, we have revised the abstract, updating this in line 21: "…KET and ISO both activate brain areas involved in sensory processing, memory and cognition, reward and motivation, as well as autonomic and homeostatic control, highlighting their shared effects on various neural pathways.…".

      (5) It looks like overall Fos levels in the control group Home (ISO) are a magnitude (~10-fold) lower than those in the control group Saline (KET) across all regions shown. This large difference seems unlikely to be due to a biologically driven effect and seems more likely to be due to a technical issue, such as differences in staining or imaging between experiments. The authors discuss this issue but did not answer whether the Homecage-ISO experiment or at least the Fos labeling and imaging performed at the same time as for the Saline-Ketamine experiment?

      Thank you for highlighting this important point. The c-Fos labeling and imaging for the Home (ISO) and Saline (KET) groups were carried out in separate sessions due to the extensive workload involved in these processes. This study processed a total of 26 brain samples. Sectioning the entire brain of each mouse required approximately 3 hours, yielding 5 slides, with each slide containing 12 to 16 brain sections. We were able to stain and image up to 20 slides simultaneously, typically comprising 2 experimental groups and 2 corresponding control groups. Imaging these 20 slides at 10x magnification took roughly 7 hours, while additional time was required for confocal imaging of specific areas of interest at 20x magnification. Given the complexity of these procedures, to ensure consistency across all experiments, they were conducted under uniform conditions. This included the use of consistent primary and secondary antibody concentrations, incubation times, and imaging parameters such as fixed light intensity and exposure time. Furthermore, in the saline and KET groups, intraperitoneal injections might have evoked pain and stress responses in mice despite four days of pre-experiment acclimation, which could have contributed to the increased c-Fos expression observed. This aspect, along with the fact that procedures were conducted in separate sessions, might have introduced some variations. Thus, we have included a note in our discussion section in Line 353: "…Despite four days of acclimation, including handling and injections, intraperitoneal injections in the saline and KET groups might still elicit pain and stress responses in mice. This point is corroborated by the subtle yet measurable variations in brain states between the home cage and saline groups, characterized by changes in normalized EEG delta/theta power (home cage: 0.05±0.09; saline: -0.03±0.11) and EMG power (home cage: -0.37±0.34; saline: 0.04±0.13), as shown in Figure 1–figure supplement 1. These changes suggest a relative increase in brain activity in the saline group compared to the home cage group, potentially contributing to the higher c-Fos expression. Additionally, despite the use of consistent parameters for c-Fos labeling and imaging across all experiments, the substantial differences observed between the saline and home cage groups might be partly attributed to the fact that the operations were conducted in separate sessions.…"

      Reviewer #3 (Public Review):

      The present study presents a comprehensive exploration of the distinct impacts of Isoflurane and Ketamine on c-Fos expression throughout the brain. To understand the varying responses across individual brain regions to each anesthetic, the researchers employ principal component analysis (PCA) and c-Fos-based functional network analysis. The methodology employed in this research is both methodical and expansive. Notably, the utilization of a custom software package to align and analyze brain images for c-Fos positive cells stands out as an impressive addition to their approach. This innovative technique enables effective quantification of neural activity and enhances our understanding of how anesthetic drugs influence brain networks as a whole.

      The primary novelty of this paper lies in the comparative analysis of two anesthetics, Ketamine and Isoflurane, and their respective impacts on brain-wide c-Fos expression. The study reveals the distinct pathways through which these anesthetics induce loss of consciousness. Ketamine primarily influences the cerebral cortex, while Isoflurane targets subcortical brain regions. This finding highlights the differing mechanisms of action employed by these two anesthetics-a top-down approach for Ketamine and a bottom-up mechanism for Isoflurane. Furthermore, this study uncovers commonly activated brain regions under both anesthetics, advancing our knowledge about the mechanisms underlying general anesthesia.

      We are thankful for your positive and insightful comments on our study. Your recognition of the study's methodology and its significance in advancing our understanding of anesthetic mechanisms is greatly valued. By comprehensively mapping c-Fos expression across a wide range of brain regions, our study reveals the distinct and overlapping impacts of these anesthetics on various brain functions, providing a valuable foundation for future research into the mechanisms of general anesthesia, potentially guiding the development of more targeted anesthetic agents and therapeutic strategies. Thus, we are confident that our work will captivate the interest of our readers.

    2. Author Response

      The following is the authors’ response to the previous reviews.

      We appreciate the reviewers for their insightful feedback, which has substantially improved our manuscript. Following the suggestions of the reviewers, we have undertaken the following major revisions:

      a. Concerning data transformation, we have adjusted the methodology in Figures 2 and 3. Instead of normalizing c-Fos density to the whole brain c-Fos density as initially described, we now normalize to the c-Fos density of the corresponding brain region in the control group. b. We have substituted the PCA approach with hierarchical clustering in Figures 2 and 3.

      c. In the discussion section, we added a subsection on study limitations, focusing on the variations in drug administration routes and anesthesia depth.

      Enclosed are our detailed responses to each of the reviewer's comments.

      Reviewer #1:

      1a. The addition of the EEG/EMG is useful, however, this information is not discussed. For instance, there are differences in EEG/EMG between the two groups (only Ket significantly increased delta/theta power, and only ISO decreased EMG power). These results should be discussed as well as the limitation of not having physiological measures of anesthesia to control for the anesthesia depth.

      1b. The possibility that the differences in fos observed may be due to the doses used should be discussed.

      1c. The possibility that the differences in fos observed may be due kinetic of anesthetic used should be discussed.

      Thank you for your suggestions. We have now discussed EEG/EMG result, limitation of not having physiological measures of anesthesia to control for the anesthesia depth, The possibility that the differences in fos observed may be due to the doses, The possibility that the differences in Fos observed may be due kinetic of anesthetic in the revised manuscript (Lines 308-331, also shown below).

      Lines 308-331: "...Our findings indicate that c-Fos expression in the KET group is significantly elevated compared to the ISO group, and the saline group exhibits notably higher c-Fos expression than the home cage group, as seen in Supplementary Figures 2 and 3. Intraperitoneal saline injections in the saline group, despite pre-experiment acclimation with handling and injections for four days, may still evoke pain and stress responses in mice. Subtle yet measurable variations in brain states between the home cage and saline groups were observed, characterized by changes in normalized EEG delta/theta power (home cage: 0.05±0.09; saline: -0.03±0.11) and EMG power (home cage: -0.37±0.34; saline: 0.04±0.13), as shown in Supplementary Figure 1. These changes suggest a relative increase in overall brain activity in the saline group compared to the home cage group, potentially contributing to the higher c-Fos expression. Although the difference in EEG power between the ISO group and the home cage control was not significant, the increase in EEG power observed in the ISO group was similar to that of KET (0.47 ± 0.07 vs 0.59 ± 0.10), suggesting that both agents may induce loss of consciousness in mice. Regarding EMG power, ISO showed a significant decrease in EMG power compared to its control group. In contrast, the KET group showed a lesser reduction in EMG power (ISO: -1.815± 0.10; KET: -0.96 ± 0.21), which may partly explain the higher overall c-Fos expression levels in the KET group. This is consistent with previous studies where ketamine doses up to 150 mg/kg increase delta power while eliciting a wakefulness-like pattern of c-Fos expression across the brain [1]. Furthermore, the observed differences in c-Fos expression may arise in part from the dosages, routes of administration, and their distinct pharmacokinetic profiles. This variation is compounded by the lack of detailed physiological monitoring, such as blood pressure, heart rate, and respiration, affecting our ability to precisely assess anesthesia depth. Future studies incorporating comprehensive physiological monitoring and controlled dosing regimens are essential to further elucidate these relationships and refine our understanding of the effects of anesthetics on brain activity"

      1. Lu J, Nelson LE, Franks N, Maze M, Chamberlin NL, Saper CB: Role of endogenous sleep-wake and analgesic systems in anesthesia. J Comp Neurol 2008, 508(4):648-662.

      2b. I am confused because Fig 2C seems to show significant decrease in %fos in the hypothalamus, midbrain and cerebellum after KET, while the author responded that " in our analysis, we did not detect regions with significant downregulation when comparing anesthetized mice with controls." Moreover the new figure in the rebuttal in response to reviewer 2 suggests that Ket increases Fos in almost every single region (green vs blue) which is not the conclusion of the paper.

      Your concern regarding the apparent discrepancy is well-founded. The inconsistency arose due to an inappropriate data transformation, which affected the interpretation. We have now rectified this by adjusting the data transformation in Figures 2 and 3. Specifically, we have recalculated the log relative c-Fos density values relative to the control group for each brain region. This revision has resolved the issue, confirming that our analysis did not detect any regions with significant downregulation in the anesthetized mice compared to controls. We have also updated the results, discussion, and methods sections of Figures 2 and 3 to accurately reflect these changes and ensure consistency with our findings.

      Author response image 1.

      Figure 2. Whole-brain distributions of c-Fos+ cells induced by ISO and KET. (A) Hierarchical clustering was performed on the log relative c-Fos density data for ISO and KET using the complete linkage method based on the Euclidean distance matrix, with clusters identified by a dendrogram cut-off ratio of 0.5. Numerical labels correspond to distinct clusters within the dendrogram. (B) Silhouette values plotted against the ratio of tree height for ISO and KET, indicating relatively higher Silhouette values at 0.5 (dashed line), which is associated with optimal clustering. (C) The number of clusters identified in each treatment condition at different ratios of the dendrogram tree height, with a cut-off level of 0.5 corresponding to 4 clusters for both ISO and KET (indicated by the dashed line). (D) The bar graph depicts Z scores for clusters in ISO and KET conditions, represented with mean values and standard errors. One-way ANOVA with Tukey's post hoc multiple comparisons. ns: no significance; ***P < 0.001. (E) Z-scored log relative density of c-Fos expression in the clustered brain regions. The order and abbreviations of the brain regions and the numerical labels correspond to those in Figure 2A. The red box denotes the cluster with the highest mean Z score in comparison to other clusters. CTX: cortex; TH: thalamus; HY: hypothalamus; MB: midbrain; HB: hindbrain.

      Author response image 2.

      Figure 3. Similarities and differences in ISO and KET activated c-Fos brain areas. (A) Hierarchical clustering was performed on the log-transformed relative c-Fos density data for ISO and KET using the complete linkage method based on the Euclidean distance matrix, with clusters identified by a dendrogram cut-off ratio of 0.5. (B) Silhouette values are plotted against the ratio of tree height from the hierarchical clustered dendrogram in Figure 3A. (C) The relationship between the number of clusters and the tree height ratio of the dendrogram for ISO and KET, with a cut-off ratio of 0.5 resulting in 3 clusters for ISO and 5 for KET (indicated by the dashed line). (D) The bar graph depicts Z scores for clusters in ISO and KET conditions, represented with mean values and standard errors. One-way ANOVA with Tukey's post hoc multiple comparisons. ns: no significance; ***P < 0.001. (E) Z-scored log relative density of c-Fos expression within the identified brain region clusters. The arrangement, abbreviations of the brain regions, and the numerical labels are in accordance with Figure 3A. The red boxes highlight brain regions that rank within the top 10 percent of Z score values. The white boxes denote brain regions with an Z score less than -2.

      1. There are still critical misinterpretations of the PCA analysis. For instance, it is mentioned that " KET is associated with the activation of cortical regions (as evidenced by positive PC1 coefficients in MOB, AON, MO, ACA, and ORB) and the inhibition of subcortical areas (indicated by negative coefficients) " as well as " KET displays cortical activation and subcortical inhibition, whereas ISO shows a contrasting preference, activating the cerebral nucleus (CNU) and the hypothalamus while inhibiting cortical areas. To reduce inter-individual variability." These interpretations are in complete contradiction with the answer 2b above that there was no region that had decreased Fos by either anesthetic.

      Thank you for bringing this to our attention. In response to your concerns, we have made significant revisions to our data analysis. We have updated our input data to incorporate log-transformed relative c-Fos density values, normalized against the control group for each brain region, as illustrated in Figures 2 and 3. Instead of PCA, we have applied this updated data to hierarchical clustering analysis. The results of these analyses are consistent with our original observation that neither anesthetic led to a decrease in Fos expression in any region.

      1. I still do not understand the rationale for the use of that metric. The use of a % of total Fos makes the data for each region dependent on the data of the other regions which wrongly leads to the conclusion that some regions are inhibited while they are not when looking at the raw data. Moreover, the interdependence of the variable (relative density) may affect the covariance structure which the PCA relies upon. Why not using the PCA on the logarithm of the raw data or on a relative density compared to the control group on a region-per-region basis instead of the whole brain?

      Thank you for your insightful suggestion. Following your advice, we have revised our approach and now utilize the logarithm of the relative density compared to the control group on a region-by-region basis. We attempted PCA analyses using the logarithm of the raw data, the logarithm of the Z-score, and the logarithm of the relative density compared to control, but none yielded distinct clusters.

      Author response image 3.

      As a result, we employed hierarchical cluster analysis. We then examined the Z-scores of the log-transformed relative c-Fos densities (Figures 2E and 3E) to assess expression levels across clusters. Our analysis revealed that neither ISO nor KET treatments led to a significant suppression of c-Fos expression in the 53 brain regions examined. In the ISO group alone, there were 10 regions that demonstrated relative suppression (Z-score < -2, indicated by white boxes) as shown in Figure 3.

      Fig. 2B: it's unclear to me why the regions are connected by a line. Such representation is normally used for time series/within-subject series. What is the rationale for the order of the regions and the use of the line? The line connecting randomly organized regions is meaningless and confusing.

      Thank you for your suggestion. We have discontinued the use of PCA calculations and have removed this figure.

      Fig 6A. The correlation matrices are difficult to interpret because of the low resolution and arbitrary order of brain regions. I recommend using hierarchical clustering and/or a combination of hierarchical clustering and anatomical organization (e.g. PMID: 31937658). While it is difficult to add the name of the regions on the graph I recommend providing supplementary figures with large high-resolution figures with the name of each brain region so the reader can actually identify the correlation between specific brain regions and the whole brain, Rationale for Metric Choice: Note that I do not dispute the choice of the log which is appropriate, it is the choice of using the relative density that I am questioning.

      Thank you for your constructive feedback. In line with your suggestion, we have implemented hierarchical clustering combined with anatomical organization as per the referenced literature. Additionally, we have updated the vector diagrams in Figure 6A to present them with greater clarity.

      Furthermore, we have revised our network modular division method based on cited literature recommendations. We used hierarchical clustering with correlation coefficients to segment the network into modules, illustrated in Figure 6—figure supplement 1. Due to the singular module structure of the KET network and the sparsity of intermodular connections in the home cage and saline networks, the assessment of network hub nodes did not employ within-module degree Z-score and participation coefficients, as these measures predominantly underscore the importance of connections within and between modules. Instead, we used degree, betweenness centrality, and eigenvector centrality to detect the hub nodes, as detailed in Figure 6—figure supplement 2. With this new approach, the hub node for the KET condition changed from SS to TeA. Corresponding updates have been made to the results section for Figure 6, as well as to the related discussions and the abstract of our paper.

      Author response image 4.

      Figure 6. Generation of anesthetics-induced networks and identification of hub regions. (A) Heatmaps display the correlations of log c-Fos densities within brain regions (CTX, CNU, TH, HY, MB, and HB) for various states (home cage, ISO, saline, KET). Correlations are color-coded according to Pearson's coefficients. The brain regions within each anatomical category are organized by hierarchical clustering of their correlation coefficients. (B) Network diagrams illustrate significant positive correlations (P < 0.05) between regions, with Pearson’s r exceeding 0.82. Edge thickness indicates correlation magnitude, and node size reflects the number of connections (degree). Node color denotes betweenness centrality, with a spectrum ranging from dark blue (lowest) to dark red (highest). The networks are organized into modules consistent with the clustering depicted in Supplementary Figure 8. Figure 6—figure supplement 1

      Author response image 5.

      Figure 6—figure supplement 1. Hierarchical clustering of brain regions under various conditions: home cage, ISO, saline, and KET. (A) Heatmaps show the relative distances among brain regions assessed in naive mice. Modules were identified by sectioning each dendrogram at a 0.7 threshold. (B) Silhouette scores plotted against the dendrogram tree height ratio for each condition, with optimal cluster definition indicated by a dashed line at a 0.7 ratio. (C) The number of clusters formed at different cutoff levels. At a ratio of 0.7, ISO and saline treatments result in three clusters, whereas home cage and KET conditions yield two clusters. (D) The mean Pearson's correlation coefficient (r) was computed from interregional correlations displayed in Figure 6A. Data were analyzed using one-way ANOVA with Tukey’s post hoc test, ***P < 0.001.

      Author response image 6.

      Figure 6—figure supplement 2. Hub region characterization across different conditions: home cage (A), ISO (B), saline (C), and KET (D) treatments. Brain regions are sorted by degree, betweenness centrality, and eigenvector centrality, with each metric presented in separate bar graphs. Bars to the left of the dashed line indicate the top 20% of regions by rank, highlighting the most central nodes within the network. Red bars signify regions that consistently appear within the top rankings for both degree and betweenness centrality across the metrics.

      1. I am still having difficulties understanding Fig. 3.

      Panel A: The lack of identification for the dots in panel A makes it impossible to understand which regions are relevant.

      Panel B: what is the metric that the up/down arrow summarizes? Fos density? Relative density? PC1/2?

      Panel C: it's unclear to me why the regions are connected by a line. Such representation is normally used for time series/within-subject series. What is the rationale for the order of the regions?

      Thank you for your patience and for reiterating your concerns regarding Figure 3.

      a. In Panel A, we have substituted the original content with a display of hierarchical clustering results, which now clearly marks each brain region. This change aids readers in identifying regions with similar expression patterns and facilitates a more intuitive understanding of the data.

      a. Acknowledging that our analysis did not reveal any significantly inhibited brain regions, we have decided to remove the previous version of Panel B from the figure.

      b. We have discontinued the use of PCA calculations and have removed this figure to avoid any confusion it may have caused. Our revised analysis focuses on hierarchical clustering, which are presented in the updated figures.

      Reviewer #2:

      1. Aside from issues with their data transformation (see below), (a) I think they have some interesting Fos counts data in Figures 4B and 5B that indicate shared and distinct activation patterns after KET vs. ISO based anesthesia. These data are far closer to the raw data than PC analyses and need to be described and analyzed in the first figures long before figures with the more abstracted PC analyses. In other words, you need to show the concrete raw data before describing the highly transformed and abstracted PC analyses. (b) This gets to the main point that when selecting brain areas for follow up analyses, these should be chosen based on the concrete Fos counts data, not the highly transformed and abstracted PC analyses.

      Thank you for your suggestions.

      a. We have added the original c-Fos cell density distribution maps for Figures 2, 3, 4, and 5 in Supplementary Figures 2 and 3 (also shown below). To maintain consistency across the document, we have updated both the y-axis label and the corresponding data in Figures 4B and 5B from 'c-Fos cell count' to 'c-Fos density'.

      b. The analyses in Figures 2 and 3 include all brain regions. Figures 4 and 5 present the brain regions with significant differences as shown in Figure 3—figure supplement 1.

      Author response image 7.

      Figure 2—figure supplement 1. The c-Fos density in 53 brain areas for different conditions. (home cage, n = 6; ISO, n = 6 mice; saline, n = 8; KET, n = 6). Each point represents the c-Fos density in a specific brain region, denoted on the y-axis with both abbreviations and full names. Data are shown as mean ± SEM. Brain regions are categorized into 12 brain structures, as indicated on the right side of the graph.

      Author response image 8.

      Figure 3—figure supplement 1. c-Fos density visualization across 201 distinct brain regions under various conditions. The graph depicts the c-Fos density levels for each condition, with data presented as mean and standard error. Brain regions with statistically significant differences are featured in Figures 4 and 5. Brain regions are organized into major anatomical subdivisions, as indicated on the left side of the graph.

      1. Now, the choice of data transformation for Fos counts is the most significant problem. First, the authors show in the response letter that not using this transformation (region density/brain density) leads to no clustering. However, they also showed the region-densities without transformation (which we appreciate) and it looks like overall Fos levels in the control group Home (ISO) are a magnitude (~10-fold) higher than those in the control group Saline (KET) across all regions shown. This large difference seems unlikely to be due to a biologically driven effect and seems more likely to be due to a technical issue, such as differences in staining or imaging between experiments. Was the Homecage-ISO experiment or at least the Fos labeling and imaging performed at the same time as for the Saline-Ketamine experiment? Please state the answer to this question in the Results section one way or the other.

      a. “Home (ISO) are a magnitude (~10-fold) higher than those in the control group saline (KET) across all regions shown.” We believe you might be indicating that compared to the home cage group (gray), the saline group (blue) shows a 10-fold higher expression (Supplementary Figure 2/3). Indeed, we observed that the total number of c-Fos cells in the home cage group is significantly lower than in the saline group. This difference may be due to reduced sleep during the light-on period (ZT 6- ZT 7.5) in the saline mice or the pain and stress response caused by intraperitoneal injection of saline. We have explained this discrepancy in the discussion section.Line 308-317(also see below)

      “…Our findings indicate that c-Fos expression in the KET group is significantly elevated compared to the ISO group, and the saline group exhibits notably higher c-Fos expression than the home cage group, as seen in Supplementary Figures 2 and 3. Intraperitoneal saline injections in the saline group, despite pre-experiment acclimation with handling and injections for four days, may still evoke pain and stress responses in mice. Subtle yet measurable variations in brain states between the home cage and saline groups were observed, characterized by changes in normalized EEG delta/theta power (home cage: 0.05±0.09; saline: -0.03±0.11) and EMG power (home cage: -0.37±0.34; saline: 0.04±0.13), as shown in Figure 1—figure supplement 1. These changes suggest a relative increase in overall brain activity in the saline group compared to the home cage group, potentially contributing to the higher c-Fos expression…”

      b. Drug administration and tissue collection for both Homecage-ISO and Saline-Ketamine groups were consistently scheduled at 13:00 and 14:30, respectively. Four mice were administered drugs and had tissues collected each day, with two from the experimental group and two from the control group, to ensure consistent sampling. The 4% PFA fixation time, sucrose dehydration time, primary and secondary antibody concentrations and incubation times, staining, and imaging parameters and equipment (exposure time for VS120 imaging was fixed at 100ms) were all conducted according to a unified protocol.

      We have included the following statement in the results section: Line 81-83, “Sample collection for all mice was uniformly conducted at 14:30 (ZT7.5), and the c-Fos labeling and imaging were performed using consistent parameters throughout all experiments. ”

      1. Second, they need to deal with this large difference in overall staining or imaging for these two (Home/ISO and Saline/KET) experiments more directly; their current normalization choice does not really account for the large overall differences in mean values and variability in Fos counts (e.g. due to labeling and imaging differences).

      3a. I think one option (not perfect but I think better than the current normalization choice) could be z-scoring each treatment to its respective control. They can analyze these z-scored data first, and then in later figures show PC analyses of these data and assess whether the two treatments separate on PC1/2. And if they don't separate, then they don't separate, and you have to go with these results.

      3b. Alternatively, they need to figure out the overall intensity distributions from the different runs (if that the main reason of markedly different counts) and adjust their thresholds for Fos-positive cell detection based on this. I would expect that the saline and HC groups should have similar levels of activation, so they could use these as the 'control' group to determine a Fos-positive intensity threshold that gets applied to the corresponding 'treatment' group.

      3c. If neither 3a nor 3b is an option then they need to show the outcomes of their analysis when using the untransformed data in the main figures (the untransformed data plots in their responses to reviewer are currently not in the main or supplementary figs) and discuss these as well.

      a. Thank you very much for your valuable suggestion. We conducted PCA analysis on the ISO and KET data after Z-scoring them with their respective control groups and did not find any significant separation.

      Author response image 9.

      As mentioned in our response to reviewer #1, we have reprocessed the raw data. Firstly, we divided the ISO and KET data by their respective control brain regions and then performed a logarithmic transformation to obtain the log relative c-Fos density. The purpose of this is to eliminate the impact of baseline differences and reduce variability. We then performed hierarchical clustering, and finally, we Z-scored the log relative c-Fos density data. The aim is to facilitate comparison of ISO and KET on the same data dimension (Figure 2 and 3).

      b. We appreciate your concerns regarding the detection thresholds for Fos-positive cells. The enclosed images, extracted from supplementary figures for Figures 4 and 5, demonstrate notable differences in c-Fos expression between saline and home cage groups in specific brain regions. These regions exhibit a discernible difference in staining intensity, with the saline group showing enhanced c-Fos expression in the PVH and PVT regions compared to the home cage group. An examination of supplementary figures for Figures 4 and 5 shows that c-Fos expression in the home cage group is consistently lower than in the saline group. This comparative analysis confirms that the discrepancies in c-Fos levels are not due to varying detection thresholds.

      Author response image 10.

      b. We have added the corresponding original data graphs to Supplementary Figures 2 and 3, and discussed the potential reasons for the significant differences between these groups in the discussion section (also shown below).

      Lines 308-317: "...Our findings indicate that c-Fos expression in the KET group is significantly elevated compared to the ISO group, and the saline group exhibits notably higher c-Fos expression than the home cage group, as seen in Supplementary Figures 2 and 3. Intraperitoneal saline injections in the saline group, despite pre-experiment acclimation with handling and injections for four days, may still evoke pain and stress responses in mice. Subtle yet measurable variations in brain states between the home cage and saline groups were observed, characterized by changes in normalized EEG delta/theta power (home cage: 0.05±0.09; saline: -0.03±0.11) and EMG power (home cage: -0.37±0.34; saline: 0.04±0.13), as shown in Figure 3—figure supplement 1. These changes suggest a relative increase in overall brain activity in the saline group compared to the home cage group, potentially contributing to the higher c-Fos expression.…”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Responses to Reviewer’s Comments:  

      To Reviewer #2:

      (1) The use of two m<sup>5</sup>C reader proteins is likely a reason for the high number of edits introduced by the DRAM-Seq method. Both ALYREF and YBX1 are ubiquitous proteins with multiple roles in RNA metabolism including splicing and mRNA export. It is reasonable to assume that both ALYREF and YBX1 bind to many mRNAs that do not contain m<sup>5</sup>C. 

      To substantiate the author's claim that ALYREF or YBX1 binds m<sup>5</sup>C-modified RNAs to an extent that would allow distinguishing its binding to non-modified RNAs from binding to m<sup>5</sup>Cmodified RNAs, it would be recommended to provide data on the affinity of these, supposedly proven, m<sup>5</sup>C readers to non-modified versus m<sup>5</sup>C-modified RNAs. To do so, this reviewer suggests performing experiments as described in Slama et al., 2020 (doi: 10.1016/j.ymeth.2018.10.020). However, using dot blots like in so many published studies to show modification of a specific antibody or protein binding, is insufficient as an argument because no antibody, nor protein, encounters nanograms to micrograms of a specific RNA identity in a cell. This issue remains a major caveat in all studies using so-called RNA modification reader proteins as bait for detecting RNA modifications in epitranscriptomics research. It becomes a pertinent problem if used as a platform for base editing similar to the work presented in this manuscript.

      The authors have tried to address the point made by this reviewer. However, rather than performing an experiment with recombinant ALYREF-fusions and m<sup>5</sup>C-modified to unmodified RNA oligos for testing the enrichment factor of ALYREF in vitro, the authors resorted to citing two manuscripts. One manuscript is cited by everybody when it comes to ALYREF as m<sup>5</sup>C reader, however none of the experiments have been repeated by another laboratory. The other manuscript is reporting on YBX1 binding to m<sup>5</sup>C-containing RNA and mentions PARCLiP experiments with ALYREF, the details of which are nowhere to be found in doi: 10.1038/s41556-019-0361-y.

      Furthermore, the authors have added RNA pull-down assays that should substitute for the requested experiments. Interestingly, Figure S1E shows that ALYREF binds equally well to unmodified and m<sup>5</sup>C-modified RNA oligos, which contradicts doi:10.1038/cr.2017.55, and supports the conclusion that wild-type ALYREF is not specific m<sup>5</sup>C binder. The necessity of including always an overexpression of ALYREF-mut in parallel DRAM experiments, makes the developed method better controlled but not easy to handle (expression differences of the plasmid-driven proteins etc.) 

      Thank you for pointing this out. First, we would like to correct our previous response: the binding ability of ALYREF to m<sup>5</sup>C-modified RNA was initially reported in doi: 10.1038/cr.2017.55, (and not in doi: 10.1038/s41556-019-0361-y), where it was observed through PAR-CLIP analysis that the K171 mutation weakens its binding affinity to m<sup>5</sup>C -modified RNA.

      Our previous experimental approach was not optimal: the protein concentration in the INPUT group was too high, leading to overexposure in the experimental group. Additionally, we did not conduct a quantitative analysis of the results at that time. In response to your suggestion, we performed RNA pull-down experiments with YBX1 and ALYREF, rather than with the pan-DRAM protein, to better validate and reproduce the previously reported findings. Our quantitative analysis revealed that both ALYREF and YBX1 exhibit a stronger affinity for m<sup>5</sup>C -modified RNAs. Furthermore, mutating the key amino acids involved in m<sup>5</sup>C recognition significantly reduced the binding affinity of both readers. These results align with previous studies (doi: 10.1038/cr.2017.55 and doi: 10.1038/s41556-019-0361-y), confirming that ALYREF and YBX1 are specific readers of m<sup>5</sup>C -modified RNAs. However, our detection system has certain limitations. Despite mutating the critical amino acids, both readers retained a weak binding affinity for m<sup>5</sup>C, suggesting that while the mutation helps reduce false positives, it is still challenging to precisely map the distribution of m<sup>5</sup>C modifications. To address this, we plan to further investigate the protein structure and function to obtain a more accurate m<sup>5</sup>C sequencing of the transcriptome in future studies. Accordingly, we have updated our results and conclusions in lines 294-299 and discuss these limitations in lines 109114.

      In addition, while the m<sup>5</sup>C assay can be performed using only the DRAM system alone, comparing it with the DRAM<sup>mut</sup> control enhances the accuracy of m<sup>5</sup>C region detection. To minimize the variations in transfection efficiency across experimental groups, it is recommended to use the same batch of transfections. This approach not only ensures more consistent results but also improve the standardization of the DRAM assay, as discussed in the section added on line 308-312.

      (2) Using sodium arsenite treatment of cells as a means to change the m<sup>5</sup>C status of transcripts through the downregulation of the two major m<sup>5</sup>C writer proteins NSUN2 and NSUN6 is problematic and the conclusions from these experiments are not warranted. Sodium arsenite is a chemical that poisons every protein containing thiol groups. Not only do NSUN proteins contain cysteines but also the base editor fusion proteins. Arsenite will inactivate these proteins, hence the editing frequency will drop, as observed in the experiments shown in Figure 5, which the authors explain with fewer m<sup>5</sup>C sites to be detected by the fusion proteins.

      The authors have not addressed the point made by this reviewer. Instead the authors state that they have not addressed that possibility. They claim that they have revised the results section, but this reviewer can only see the point raised in the conclusions. An experiment would have been to purify base editors via the HA tag and then perform some kind of binding/editing assay in vitro before and after arsenite treatment of cells.

      We appreciate the reviewer’s insightful comment. We fully agree with the concern raised. In the original manuscript, our intention was to use sodium arsenite treatment to downregulate NSUN mediated m<sup>5</sup>C levels and subsequently decrease DRAM editing efficiency, with the aim of monitoring m<sup>5</sup>C dynamics through the DRAM system. However, as the reviewer pointed out, sodium arsenite may inactivate both NSUN proteins and the base editor fusion proteins, and any such inactivation would likely result in a reduced DRAM editing.

      This confounds the interpretation of our experimental data.

      As demonstrated in Author response image 1A, western blot analysis confirmed that sodium arsenite indeed decreased the expression of fusion proteins. In addition, we attempted in vitro fusion protein purificationusing multiple fusion tags (HIS, GST, HA, MBP) for DRAM fusion protein expression, but unfortunately, we were unable to obtain purified proteins. However, using the Promega TNT T7 Rapid Coupled In Vitro Transcription/Translation Kit, we successfully purified the DRAM protein (Author response image 1B). Despite this success, subsequent in vitro deamination experiments did not yield the expected mutation results (Author response image 1C), indicating that further optimization is required. This issue is further discussed in line 314-315.

      Taken together, the above evidence supports that the experiment of sodium arsenite treatment was confusing and we determined to remove the corresponding results from the main text of the revised manuscript.

      Author response image 1.

      (3) The authors should move high-confidence editing site data contained in Supplementary Tables 2 and 3 into one of the main Figures to substantiate what is discussed in Figure 4A. However, the data needs to be visualized in another way then excel format. Furthermore, Supplementary Table 2 does not contain a description of the columns, while Supplementary Table 3 contains a single row with letters and numbers.

      The authors have not addressed the point made by this reviewer. Figure 3F shows the screening process for DRAM-seq assays and principles for screening highconfidence genes rather than the data contained in Supplementary Tables 2 and 3 of the former version of this manuscript.

      Thank you for your valuable suggestion. We have visualized the data from Supplementary Tables 2 and 3 in Figure 4A as a circlize diagram (described in lines 213-216), illustrating the distribution of mutation sites detected by the DRAM system across each chromosome. Additionally, to improve the presentation and clarity of the data, we have revised Supplementary Tables 2 and 3 by adding column descriptions, merging the DRAM-ABE and DRAM-CBE sites, and including overlapping m<sup>5</sup>C genes from previous datasets.

      Responses to Reviewer’s Comments:  

      To Reviewer #3:

      The authors have again tried to address the former concern by this reviewer who questioned the specificity of both m<sup>5</sup>C reader proteins towards modified RNA rather than unmodified RNA. The authors chose to do RNA pull down experiments which serve as a proxy for proving the specificity of ALYREF and YBX1 for m<sup>5</sup>C modified RNAs. Even though this reviewer asked for determining the enrichment factor of the reader-base editor fusion proteins (as wildtype or mutant for the identified m<sup>5</sup>C specificity motif) when presented with m<sup>5</sup>C-modified RNAs, the authors chose to use both reader proteins alone (without the fusion to an editor) as wildtype and as respective m<sup>5</sup>C-binding mutant in RNA in vitro pull-down experiments along with unmodified and m<sup>5</sup>C-modified RNA oligomers as binding substrates. The quantification of these pull-down experiments (n=2) have now been added, and are revealing that (according to SFigure 1 E and G) YBX1 enriches an RNA containing a single m<sup>5</sup>C by a factor of 1.3 over its unmodified counterpart, while ALYREF enriches by a factor of 4x. This is an acceptable approach for educated readers to question the specificity of the reader proteins, even though the quantification should be performed differently (see below).

      Given that there is no specific sequence motif embedding those cytosines identified in the vicinity of the DRAM-edits (Figure 3J and K), even though it has been accepted by now that most of the m<sup>5</sup>C sites in mRNA are mediated by NSUN2 and NSUN6 proteins, which target tRNA like substrate structures with a particular sequence enrichment, one can conclude that DRAM-Seq is uncovering a huge number of false positives. This must be so not only because of the RNA bisulfite seq data that have been extensively studied by others, but also by the following calculations: Given that the m<sup>5</sup>C/C ratio in human mRNA is 0.02-0.09% (measured by mass spec) and assuming that 1/4 of the nucleotides in an average mRNA are cytosines, an mRNA of 1.000 nucleotides would contain 250 Cs. 0.02- 0.09% m<sup>5</sup>C/C would then translate into 0.05-0.225 methylated cytosines per 250 Cs in a 1000 nt mRNA. YBX1 would bind every C in such an mRNA since there is no m<sup>5</sup>C to be expected, which it could bind with 1.3 higher affinity. Even if the mRNAs would be 10.000 nt long, YBX1 would bind to half a methylated cytosine or 2.25 methylated cytosines with 1.3x higher affinity than to all the remaining cytosines (2499.5 to 2497.75 of 2.500 cytosines in 10.000 nt, respectively). These numbers indicate a 4999x to 1110x excess of cytosine over m<sup>5</sup>C in any substrate RNA, which the "reader" can bind as shown in the RNA pull-downs on unmodified RNAs. This reviewer spares the reader of this review the calculations for ALYREF specificity, which is slightly higher than YBX1. Hence, it is up to the capable reader of these calculations to follow the claim that this minor affinity difference allows the unambiguous detection of the few m<sup>5</sup>C sites in mRNA be it in the endogenous scenario of a cell or as fusion-protein with a base editor attached? 

      We sincerely appreciate the reviewer’s rigorous analysis. We would like to clarify that in our RNA pulldown assays, we indeed utilized the full DRAM system (reader protein fused to the base editor) to reflect the specificity of m<sup>5</sup>C recognition. As previously suggested by the reviewer, to independently validate the m<sup>5</sup>C-binding specificity of ALYREF and YBX1, we performed separate pulldown experiments with wild-type and mutant reader proteins (without the base editor fusion) using both unmodified and m<sup>5</sup>C-modified RNA substrates. This approach aligns with established methodologies in the field (doi:10.1038/cr.2017.55 and doi: 10.1038/s41556-019-0361-y). We have revised the Methods section (line 230) to explicitly describe this experimental design.

      Although the m<sup>5</sup>C/C ratios in LC/MS-assayed mRNA are relatively low (ranging from 0.02% to 0.09%), as noted by the reviewer, both our data and previous studies have demonstrated that ALYREF and YBX1 preferentially bind to m<sup>5</sup>C-modified RNAs over unmodified RNAs, exhibiting 4-fold and 1.3-fold enrichment, respectively (Supplementary Figure 1E–1G). Importantly, this specificity is further enhanced in the DRAM system through two key mechanisms: first, the fusion of reader proteins to the deaminase restricts editing to regions near m<sup>5</sup>C sites, thereby minimizing off-target effects; second, background editing observed in reader-mutant or deaminase controls (e.g., DRAM<sup>mut</sup>-CBE in Figure 2D) is systematically corrected for during data analysis.

      We agree that the theoretical challenge posed by the vast excess of unmodified cytosines. However, our approach includes stringent controls to alleviate this issue. Specifically, sites identified in NSUN2/NSUN6 knockout cells or reader-mutant controls are excluded (Figure 3F), which significantly reduces the number of false-positive detections. Additionally, we have observed deamination changes near high-confidence m<sup>5</sup>C methylation sites detected by RNA bisulfite sequencing, both in first-generation and high-throughput sequencing data. This observation further substantiates the validity of DRAM-Seq in accurately identifying m<sup>5</sup>C sites.

      We fully acknowledge that residual false positives may persist due to the inherent limitations of reader protein specificity, as discussed in line 299-301 of our manuscript. To address this, we plan to optimize reader domains with enhanced m<sup>5</sup>C binding (e.g., through structure-guided engineering), which is also previously implemented in the discussion of the manuscript.

      The reviewer supports the attempt to visualize the data. However, the usefulness of this Figure addition as a readable presentation of the data included in the supplement is up to debate.

      Thank you for your kind suggestion. We understand the reviewer's concern regarding data visualization. However, due to the large volume of DRAM-seq data, it is challenging to present each mutation site and its characteristics clearly in a single figure. Therefore, we chose to categorize the data by chromosome, which not only allows for a more organized presentation of the DRAM-seq data but also facilitates comparison with other database entries. Additionally, we have updated Supplementary Tables 2 and 3 to provide comprehensive information on the mutation sites. We hope that both the reviewer and editors will understand this approach. We will, of course, continue to carefully consider the reviewer's suggestions and explore better ways to present these results in the future.

      (3) A set of private Recommendations for the Authors that outline how you think the science and its presentation could be strengthened

      NEW COMMENTS to TEXT:

      Abstract:

      "5-Methylcytosine (m<sup>5</sup>C) is one of the major post-transcriptional modifications in mRNA and is highly involved in the pathogenesis of various diseases."

      In light of the increasing use of AI-based writing, and the proof that neither DeepSeek nor ChatGPT write truthfully statements if they collect metadata from scientific abstracts, this sentence is utterly misleading.

      m<sup>5</sup>C is not one of the major post-transcriptional modifications in mRNA as it is only present with a m<sup>5</sup>C/C ratio of 0.02- 0.09% as measured by mass-spec. Also, if m<sup>5</sup>C is involved in the pathogenesis of various diseases, it is not through mRNA but tRNA. No single published work has shown that a single m<sup>5</sup>C on an mRNA has anything to do with disease. Every conclusion that is perpetuated by copying the false statements given in the many reviews on the subject is based on knock-out phenotypes of the involved writer proteins. This reviewer wishes that the authors would abstain from the common practice that is currently flooding any scientific field through relentless repetitions in the increasing volume of literature which perpetuate alternative facts.

      We sincerely appreciate the reviewer’s insightful comments. While we acknowledge that m<sup>5</sup>C is not the most abundant post-transcriptional modification in mRNA, we believe that research into m<sup>5</sup>C modification holds considerable value. Numerous studies have highlighted its role in regulating gene expression and its potential contribution to disease progression. For example, recent publications have demonstrated that m<sup>5</sup>C modifications in mRNA can influence cancer progression, lipid metabolism, and other pathological processes (e.g., PMID: 37845385; 39013911; 39924557; 38042059; 37870216).

      We fully agree with the reviewer on the importance of maintaining scientific rigor in academic writing. While m<sup>5</sup>C is not the most abundant RNA modification, we cannot simply draw a conclusion that the level of modification should be the sole criterion for assessing its biological significance. However, to avoid potential confusion, we have removed the word “major”.

      COMMENTS ON FIGURE PRESENTATION:

      Figure 2D:

      The main text states: "DRAM-CBE induced C to U editing in the vicinity of the m<sup>5</sup>C site in AP5Z1 mRNA, with 13.6% C-to-U editing, while this effect was significantly reduced with APOBEC1 or DRAM<sup>mut</sup>-CBE (Fig.2D)." The Figure does not fit this statement. The seq trace shows a U signal of about 1/3 of that of C (about 30%), while the quantification shows 20+ percent

      Thank you for your kind suggestion. Upon visual evaluation, the sequencing trace in the figure appears to suggest a mutation rate closer to 30% rather than 22%. However, relying solely on the visual interpretation of sequencing peaks is not a rigorous approach. The trace on the left represents the visualization of Sanger sequencing results using SnapGene, while the quantification on the right is derived from EditR 1.0.10 software analysis of three independent biological replicates. The C-to-U mutation rates calculated were 22.91667%, 23.23232%, and 21.05263%, respectively. To further validate this, we have included the original EditR analysis of the Sanger sequencing results for the DRAM-CBE group used in the left panel of Figure 2D (see Author response image 2). This analysis confirms an m<sup>5</sup>C fraction (%) of 22/(22+74) = 22.91667, and the sequencing trace aligns well with the mutation rate we reported in Figure 2D. In conclusion, the data and conclusions presented in Figure 2D are consistent and supported by the quantitative analysis.

      Author response image 2.

      Figure 4B: shows now different numbers in Venn-diagrams than in the same depiction, formerly Figure 4A

      We sincerely thank the reviewer for pointing out this issue, and we apologize for not clearly indicating the changes in the previous version of the manuscript. In response to the initial round of reviewer comments, we implemented a more stringent data filtering process (as described in Figure 3F and method section) : "For high-confidence filtering, we further adjusted the parameters of Find_edit_site.pl to include an edit ratio of 10%–60%, a requirement that the edit ratio in control samples be at least 2-fold higher than in NSUN2 or NSUN6knockout samples, and at least 4 editing events at a given site." As a result, we made minor adjustments to the Venn diagram data in Figure 4A, reducing the total number of DRAM-edited mRNAs from 11,977 to 10,835. These changes were consistently applied throughout the manuscript, and the modifications have been highlighted for clarity. Importantly, these adjustments do not affect any of the conclusions presented in the manuscript.

      Figure 4B and D: while the overlap of the DRAM-Seq data with RNA bisulfite data might be 80% or 92%, it is obvious that the remaining data DRAM seq suggests a detection of additional sites of around 97% or 81.83%. It would be advised to mention this large number of additional sites as potential false positives, unless these data were normalized to the sites that can be allocated to NSUN2 and NSUN6 activity (NSUN mutant data sets could be substracted).

      Thank you for pointing this out. The Venn diagrams presented in Figure 4B and D already reflect the exclusion of potential false-positive sites identified in methyltransferasedeficient datasets, as described in our experimental filtering process, and they represent the remaining sites after this stringent filtering. However, we acknowledge that YBX1 and ALYREF, while preferentially binding to m<sup>5</sup>C-modified RNA, also exhibit some affinity for unmodified RNA. Although we employed rigorous controls, including DRAM<sup>mut</sup> and deaminase groups, to minimize false positives, the possibility of residual false positives cannot be entirely ruled out. Addressing this limitation would require even more stringent filtering methods, as discussed in lines 299–301 of the manuscript. We are committed to further optimizing the DRAM system to enhance the accuracy of transcriptome-wide m<sup>5</sup>C analysis in future studies.

      SFigure 1: It is clear that the wild type version of both reader proteins are robustly binding to RNA that does not contain m<sup>5</sup>C. As for the calculations of x-fold affinity loss of RNA binding using both ALYREF -mut or YBX1 -mut, this reviewer asks the authors to determine how much less the mutated versions of the proteins bind to a m<sup>5</sup>C-modified RNAs. Hence, a comparison of YBX1 versus YBX1 -mut (ALYREF versus ALYREF -mut) on the same substrate RNA with the same m<sup>5</sup>C-modified position would allow determining the contribution of the so-called modification binding pocket in the respective proteins to their RNA binding. The way the authors chose to show the data presently is misleading because what is compared is the binding of either the wild type or the mutant protein to different RNAs.

      We appreciate the reviewer’s valuable feedback and apologize for any confusion caused by the presentation of our data. We would like to clarify the rationale behind our approach. The decision to present the wild-type and mutant reader proteins in separate panels, rather than together, was made in response to comments from Reviewer 2. Below, we provide a detailed explanation of our experimental design and its justification.

      First, we confirmed that YBX1 and ALYREF exhibit stronger binding affinity to m<sup>5</sup>Cmodified RNA compared to unmodified RNA, establishing their role as m<sup>5</sup>C reader proteins. Next, to validate the functional significance of the DRAM<sup>mut</sup> group, we demonstrated that mutating key amino acids in the m<sup>5</sup>C-binding pocket significantly reduces the binding affinity of YBX1<sup>mut</sup> and ALYREF<sup>mut</sup> to m<sup>5</sup>C-modified RNA. This confirms that the DRAM<sup>mut</sup> group effectively minimizes false-positive results by disrupting specific m<sup>5</sup>C interactions.

      Crucially, in our pull-down experiments, both the wild-type and mutant proteins (YBX1/YBX1<sup>mut</sup> and ALYREF/ALYREF<sup>mut</sup>) were incubated with the same RNA sequences. To avoid any ambiguity, we have included the specific RNA sequence information in the Methods section (lines 463–468). This ensures a assessment of the reduced binding affinity of the mutant versions relative to the wild-type proteins, even though they are presented in separate panels.

      We hope this explanation clarifies our approach and demonstrates the robustness of our findings. We sincerely appreciate the reviewer’s understanding and hope this addresses their concerns.

      SFigure 2C: first two panels are duplicates of the same image.

      Thank you for pointing this out. We sincerely apologize for incorrectly duplicating the images. We have now updated Supplementary Figure 2C with the correct panels and have provided the original flow cytometry data for the first two images. It is important to note that, as demonstrated by the original data analysis, the EGFP-positive quantification values (59.78% and 59.74%) remain accurate. Therefore, this correction does not affect the conclusions of our study. Thank you again for bringing this to our attention.

      Author response image 3.

      SFigure 4B: how would the PCR product for NSUN6 be indicative of a mutation? The used primers seem to amplify the wildtype sequence.

      Thank you for your kind suggestion. In our NSUN6<sup>-/-</sup> cell line, the NSUN6 gene is only missing a single base pair (1bp) compared to the wildtype, which results in frame shift mutation and reduction in NSUN6 protein expression. We fully agree with the reviewer that the current PCR gel electrophoresis does not provide a clear distinction of this 1bp mutation. To better illustrate our experimental design, we have included a schematic representation of the knockout sequence in SFigure 4B. Additionally, we have provided the original sequencing data, and the corresponding details have been added to lines 151-153 of the manuscript for further clarification.

      Author response image 4.

      SFigure 4C: the Figure legend is insufficient to understand the subfigure.

      Thank you for your valuable suggestion. To improve clarity, we have revised the figure legend for SFigure 4C, as well as the corresponding text in lines 178-179. We have additionally updated the title of SFigure 4 for better clarity. The updated SFigure 4C now demonstrates that the DRAM-edited mRNAs exhibit a high degree of overlap across the three biological replicates.

      SFigure 4D: the Figure legend is insufficient to understand the subfigure.

      Thank you for your kind suggestion. We have revised the figure legend to provide a clearer explanation of the subfigure. Specifically, this figure illustrates the motif analysis derived from sequences spanning 10 nucleotides upstream and downstream of DRAMedited sites mediated by loci associated with NSUN2 or NSUN6. To enhance clarity, we have also rephrased the relevant results section (lines 169-175) and the corresponding discussion (lines 304-307).

      SFigure 7: There is something off with all 6 panels. This reviewer can find data points in each panel that do not show up on the other two panels even though this is a pairwise comparison of three data sets (file was sent to the Editor) Available at https://elife-rp.msubmit.net/elife-rp_files/2025/01/22/00130809/02/130809_2_attach_27_15153.pdf

      Response: We thank the reviewer for pointing this out. We would like to clarify the methodology behind this analysis. In this study, we conducted pairwise comparisons of the number of DRAM-edited sites per gene across three biological replicates of DRAM-ABE or DRAM-CBE, visualized as scatterplots. Each data point in the plots corresponds to a gene, and while the same gene is represented in all three panels, its position may vary vertically or horizontally across the panels. This variation arises because the number of mutation sites typically differs between replicates, making it unlikely for a data point to occupy the exact same position in all panels. A similar analytical approach has been used in previous studies on m6A (PMID: 31548708). To address the reviewer’s concern, we have annotated the corresponding positions of the questioned data points with arrows in Author response image 5.

      Author response image 5.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      By identifying a loss of function mutant of IQCH in infertile patient, Ruan et al. shows that IQCH is essential for spermiogenesis by generating a knockout mouse model of IQCH. Similar to infertile patient with mutant of IQCH, Iqch knockout mice are characterized by a cracked flagellar axoneme and abnormal mitochondrial structure. Mechanistically, IQCH regulates the expression of RNA-binding proteins (especially HNRPAB), which are indispensable for spermatogenesis.

      Although this manuscript contains a potentially interesting piece of work that delineates a mechanism of IQCH that associates with spermatogenesis, this reviewer feels that a number of issues require clarification and re-evaluation for a better understanding of the role of IQCH in spermatogenesis.

      Line 251 - 253, "To elucidate the molecular mechanism by which IQCH regulates male fertility, we performed liquid chromatography tandem mass spectrometry (LC‒MS/MS) analysis using mouse sperm lysates and detected 288 interactors of IQCH (Figure 5-source data 1)."

      The reviewer had already raised significant concerns regarding the text above, noting that "LC‒MS/MS analysis using mouse sperm lysates" would not identify interactors of IQCH. However, this issue was not addressed in the revised manuscript. In the Methods section detailing LC-MS/MS, the authors stated that it was conducted on "eluates obtained from IP". However, there was no explanation provided on how IP for LC-MS/MS was performed. Additionally, it was unclear whether LC-MS or LC-MS/MS was utilized. The primary concern is that if LC‒MS/MS was conducted for the IP of IQCH, IQCH itself should have been detected in the results; however, as indicated by Figure 5-source data 1, IQCH was not listed.

      Thanks to reviewer’s comments. Additional details regarding the IP protocol for LC-MS/MS analysis have been included in the methods section in the revised manuscript. Furthermore, we apologize for the previous inconsistencies in the terminology used for LC-MS/MS and have now ensured its consistent usage throughout the document. Regarding the primary concern about the absence of IQCH in Figure 5-source data 1, our study only showed identifying proteins that interact with IQCH, not IQCH itself. Additionally, we conducted co-IP experiments to validate the interactions identified by LC-MS/MS analysis. Actually, we identified the IQCH itself by LC-MS/MS analysis (Author response table 1).

      Author response table 1.

      Results of the LC-MS/MS analysis.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      The authors should know what experiments have been done for the studies.

      We apologize for our oversights. The method for RNA-binding protein immunoprecipitation (RIP) has been detailed in the revised manuscript.

      Typos still remain in the text, e.g., line 253, "Fiugre".

      We are sorry for the spelling errors. We have engaged professional editing services to refine our manuscript.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Review:

      This manuscript by Yue et al. aims to understand the molecular mechanisms underlying the better reproductive outcomes of Tibetans at high altitude by characterizing the transcriptome and histology of full-term placenta of Tibetans and compare them to those Han Chinese at high elevations.

      The approach is innovative, and the data collected are valuable for testing hypotheses regarding the contribution of the placenta to better reproductive success of populations that adapted to hypoxia. The authors identified hundreds of differentially expressed genes (DEGs) between Tibetans and Han, including the EPAS1 gene that harbors the strongest signals of genetic adaptation. The authors also found that such differential expression is more prevalent and pronounced in the placentas of male fetuses than those of female fetuses, which is particularly interesting, as it echoes with the more severe reduction in birth weight of male neonates at high elevation observed by the same group of researchers (He et al., 2022).

      This revised manuscript addressed several concerns raised by reviewers in last round. However, we still find the evidence for natural selection on the identified DEGs--as a group--to be very weak, despite more convincing evidence on a few individual genes, such as EPAS1 and EGLN1.

      The authors first examined the overlap between DEGs and genes showing signals of positive selection in Tibetans and evaluated the significance of a larger overlap than expected with a permutation analysis. A minor issue related to this analysis is that the p-value is inflated, as the authors are counting permutation replicates with MORE genes in overlap than observed, yet the more appropriate way is counting replicates with EQUAL or MORE overlapping genes. Using the latter method of p-value calculation, the "sex-combined" and "female-only" DEGs will become non-significantly enriched in genes with evidence of selection, and the signal appears to solely come from male-specific DEGs. A thornier issue with this type of enrichment analysis is whether the condition on placental expression is sufficient, as other genomic or transcriptomic features (e.g., expression level, local sequence divergence level) may also confound the analysis.

      According to the suggested methods, we counted the replicates with equal or more overlapping genes than observed (≥4 for the “combined” set; ≥9 for the “male-only” set; ≥0 for the “female-only” set). We found that the overlaps between DEGs and TSNGs were significantly enriched only in the “male-only” set (p-value < 1e-4, counting 0 time from 10,000 permutations), but not in the “female-only” set (p-value = 1, counting 10,000 time from 10,000 permutations), or “combined” set (p-value = 0.0603, counting 603 time from 10,000 permutations) (see Table R1 below).

      We updated this information in the revised manuscript, including Results, Methods, and Figure S9.

      Author response table 1.

      Permutation analysis of the overlapped genes between DEGs and TSNGs.

      The authors next aimed to detect polygenic signals of adaptation of gene expression by applying the PolyGraph method to eQTLs of genes expressed in the placenta (Racimo et al 2018). This approach is ambitious but problematic, as the method is designed for testing evidence of selection on single polygenic traits. The expression levels of different genes should be considered as "different traits" with differential impacts on downstream phenotypic traits (such as birth weight). As a result, the eQTLs of different genes cannot be naively aggregated in the calculation of the polygenic score, unless the authors have a specific, oversimplified hypothesis that the expression increase of all genes with identified eQTL will improve pregnancy outcome and that they are equally important to downstream phenotypes. In general, PolyGraph method is inapplicable to eQTL data, especially those of different genes (but see Colbran et al 2023 Genetics for an example where the polygenic score is used for testing selection on the expression of individual genes).

      We would recommend removal of these analyses and focus on the discussion of individual genes with more compelling evidence of selection (e.g., EPAS1, EGLN1).

      According to the suggestion, we removed these analyses in the revised manuscript.

    1. Author Response

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      In this paper, the authors developed an image analysis pipeline to automacally idenfy individual neurons within a populaon of fluorescently tagged neurons. This applicaon is opmized to deal with mul-cell analysis and builds on a previous soware version, developed by the same team, to resolve individual neurons from whole-brain imaging stacks. Using advanced stascal approaches and several heuriscs tailored for C. elegans anatomy, the method successfully idenfies individual neurons with a fairly high accuracy. Thus, while specific to C. elegans, this method can become instrumental for a variety of research direcons such as in-vivo single-cell gene expression analysis and calcium-based neural acvity studies.

      Thank you.

      Reviewer #2 (Public Review):

      The authors succeed in generalizing the pre-alignment procedure for their cell idenficaon method to allow it to work effecvely on data with only small subsets of cells labeled. They convincingly show that their extension accurately idenfies head angle, based on finding auto florescent ssue and looking for a symmetric l/r axis. They demonstrate method works to allow the idenficaon of a parcular subset of neurons. Their approach should be a useful one for researchers wishing to idenfy subsets of head neurons in C. elegans, and the ideas might be useful elsewhere.

      The authors also assess the relave usefulness of several atlases for making identy predicons. They atempt to give some addional general insights on what makes a good atlas, but here insights seem less clear as available data does not allow for experiments that cleanly decouple: 1. the number of examples in the atlas 2. the completeness of the atlas. and 3. the match in strain and imaging modality discussed. In the presented experiments the custom atlas, besides the strain and imaging modality mismatches discussed is also the only complete atlas with more than one example. The neuroPAL atlas, is an imperfect stand in, since a significant fracon of cells could not be idenfied in these data sets, making it a 60/40 mix of Openworm and a hypothecal perfect neuroPAL comparison. This waters down general insights since it is unclear if the performance is driven by strain/imaging modality or these difficules creang a complete neuroPal atlas. The experiments do usefully explore the volume of data needed. Though generalizaon remains to be shown the insight is useful for future atlas building that for the specific (small) set of cells labeled in the experiments 5-10 examples is sufficient to build a accurate atlas.

      The reviewer brings up an interesting point. As the reviewer noted, given the imperfection of the datasets (ours and others’), it is possible that artifacts from incomplete atlases can interfere with the assessment of the performances of different atlases. To address this, as the reviewer suggested, we have searched the literature and found two sets of data that give specific coordinates of identified neurons (both using NeuroPAL). We compared the performance of the atlases derived from these datasets to the strain-specific atlases, and the original conclusion stands. Details are now included in the revised manuscript (Figure 3- figure supplement 2).

      Recommendaons for the authors:

      Reviewer #1 (Recommendaons For The Authors):

      I appreciate the new mosaic analysis (Fig. 3 -figure suppl 2). Please fix the y-axis ck label that I believe should be 0.8 (instead of 0.9).

      We thank the reviewer for spotting the typo. We have fixed the error.

      **Reviewer #2 (Recommendaons For The Authors):

      Though I'm not familiar with the exact quality of GT labels in available neuroPAL data I know increasing volumes of published data is available. Comparison with a complete neuroPAL atlas, and a similar assessment on atlas size as made with the custom atlas would to my mind qualitavely increase the general insights on atlas construcon.

      We thank the reviewer for the insightful suggestion. We have newly constructed several other NeuroPAL atlases by incorporating neuron positional data from two other published data: [Yemini E. et al. NeuroPAL: A Multicolor Atlas for Whole-Brain Neuronal Identification in C. elegans. Cell. 2021 Jan 7;184(1):272-288.e11] and [Skuhersky, M. et al. Toward a more accurate 3D atlas of C. elegans neurons. BMC Bioinformatics 23, 195 (2022)].

      Interestingly, we found that the two new atlases (NP-Yemini and NP-Skuhersky) have significantly different values of PA, LR, DV, and angle relationships for certain cells compared to the OpenWorm and glr-1 atlases. For example, in both the NP atlases, SMDD is labeled as being anterior to AIB, which is the opposite of the SMDD-AIB relationship in the glr-1 atlas.

      Because this relationship (and other similar cases) were missing in our original NeuroPAL atlas (NP-Chaudhary), the addition of these two NeuroPAL datasets to our NeuroPAL atlas dramatically changed the atlas. As a result, incorporating the published data sets into the NeuroPAL atlas (NP-all) actually decreased the average prediction accuracy to 44%, while the average accuracy of original NeuroPAL atlas (NP-Chaudhary) was 57%. The atlas based on the Yemini et al. data alone (NP-Yemini) had 43% accuracy, and the atlas based on the Skuhersky et al. data alone (NP-Skuhersky) had 38% accuracy.

      For the rest of our analysis, we focused on comparing the NeuroPAL atlas that resulted in the highest accuracy against other atlases in figure 3 (NP-Chaudhary). Therefore, we have added Figure 3- figure supplement 2 and the following sentence in the discussion. “Several other NeuroPAL atlases from different data sources were considered, and the atlas that resulted in the highest neuron ID correspondence was selected (Figure 3- figure supplement 2).”

      Author response image 1.

      Figure3- figure supplement 2. Comparison of neuron ID correspondences resulng from addional atlases- atlases driven from NeuroPAL neuron posional data from mulple sources (Chaudhary et al., Yemini et al., and Skuhersky et al.) in red compared to other atlases in Figure 3. Two sample t-tests were performed for stascal analysis. The asterisk symbol denotes a significance level of p<0.05, and n.s. denotes no significance. OW: atlas driven by data from OpenWorm project, NP-source: NeuroPAL atlas driven by data from the source. NP-Chaudhary atlas corresponds to NeuroPAL atlas in Figure 3.

      80% agreement among manual idenficaons seems low to me for a relavely small, (mostly) known set of cells, which seems to cast into doubt ground truth idenes based on a best 2 out of 3 vote. The authors menon 3% of cell idenes had total disagreement and were excluded, what were the fracon unanimous and 2/3? Are there any further insights about what limited human performance in the context of this parcular idenficaon task?

      We closely looked into the manual annotation data. The fraction of cells in unanimous, two thirds, and no agreement are approximately 74%, 20%, and 6%, respectively. We made the corresponding change in the manuscript from 3% to 6%. Indeed, we identified certain patterns in labels that were more likely to be disagreed upon. First, cells in close proximity to each other, such as AVE and RMD, were often switched from annotator to annotator. Second, cells in the posterior part of the cluster, such as RIM, AVD, AVB, were more variable in positions, so their identities were not clear at times. Third, annotators were more likely to disagree on cells whose expressions are rare and low, and these include AIB, AVJ, and M1. These observations agree with our results in figure 4c.

    2. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This research advance arctile describes a valuable image analysis method to identify individual cells (neurons) within a population of fluorescently labeled cells in the nematode C. elegans. The findings are solid and the method succeeds to identify cells with high precision. The method will be valuable to the C. elegans research community.

      Public Reviews:

      Reviewer #1 (Public Review):

      In this paper, the authors developed an image analysis pipeline to automatically identify individual neurons within a population of fluorescently tagged neurons. This application is optimized to deal with multi-cell analysis and builds on a previous software version, developed by the same team, to resolve individual neurons from whole-brain imaging stacks. Using advanced statistical approaches and several heuristics tailored for C. elegans anatomy, the method successfully identifies individual neurons with a fairly high accuracy. Thus, while specific to C. elegans, this method can become instrumental for a variety of research directions such as in-vivo single-cell gene expression analysis and calcium-based neural activity studies.

      The analysis procedure depends on the availability of an accurate atlas that serves as a reference map for neural positions. Thus, when imaging a new reporter line without fair prior knowledge of the tagged cells, such an atlas may be very difficult to construct. Moreover, usage of available reference atlases, constructed based on other databases, is not very helpful (as shown by the authors in Fig 3), so for each new reporter line a de-novo atlas needs to be constructed.

      We thank the reviewer for pointing out a place where we can use some clarification. While in principle that every new reporter line would need fair prior knowledge, atlases are either already available or not difficult to construct. If one can make the assumption that the anatomy of a particular line is similar to existing atlases (Yemini 2021,Nejatbakhsh 2023,Toyoshima 2020), the cell ID can be immediately performed. Even in the case that one suspects the anatomy may have changes from existing atlases (e.g. in the case of examining mutants), existing atlases can serve as a starting point to provide a draft ID, which facilitates manual annotation. Once manual annotations on ~5 animals are available as we have shown in this work (which is a manageable number in practice), this new dataset can be used to build an updated atlas that can be used for future inferences. We have added this discussion in the manuscript: “If one determines that the anatomy of a particular animal strain is substantially different from existing atlases, new atlases can be easily constructed using existing atlases as starting points.” (page 18).

      I have a few comments that may help to better understand the potential of the tool to become handy.

      1. I wonder the degree by which strain mosaicism affects the analysis (Figs 1-4) as it was performed on a non-integrated reporter strain. As stated, for constructing the reference atlas, the authors used worms in which they could identify the complete set of tagged neurons. But how senstiive is the analysis when assaying worms with different levels of mosaicism? Are the results shown in the paper stem from animals with a full neural set expression? Could the authors add results for which the assayed worms show partial expression where only 80%, 70%, 50% of the cells population are observed, and how this will affect idenfication accuracy? This may be important as many non-integrated reporter lines show high mosaic patterns and may therefore not be suitable for using this analytic method. In that sense, could the authors describe the mosaic degree of their line used for validating the method.

      We appreciate the reviewer for this comment. We want to clarify that most of the worms used in the construction of the atlas are indeed affected by mosaicism and thus do not express the full set of candidate neurons. We have added such a plot as requested (Figure 3 – figure supplement 2, copied below). Our data show that there is no correlation between the fraction of cells expressed in a worm and neuron ID correspondence. We agree with the reviewer this additional insight may be helpful; we have modified the text to include this discussion: “Note that we observed no correlation between the degree of mosaicism and neuron ID correspondence (Figure 3- figure supplement 2).” (page 10).

      Author response image 1.

      No correlation between the degree of mosaicism (fraction of cells expressed in the worm) and neuron ID correspondence.

      1. For the gene expression analysis (Fig 5), where was the intensity of the GFP extracted from? As it has no nuclear tag, the protein should be cytoplasmic (as seen in Fig 5a), but in Fig 5c it is shown as if the region of interest to extract fluorescence was nuclear. If fluorescence was indeed extracted from the cytoplasm, then it will be helpful to include in the software and in the results description how this was done, as a huge hurdle in dissecting such multi-cell images is avoiding crossreads between adjacent/intersecting neurons.

      For this work, we used nuclear-localized RFP co-expressed in the animal, and the GFP intensities were extracted from the same region RFP intensities were extracted. If cytosolic reporters are used, one would imagine a membrane label would be necessary to discern the border of the cells. We clarified our reagents and approach in the text: “The segmentation was done on the nuclear-localized mCherry signals, and GFP intensities were extracted from the same region.” (page21).

      1. In the same mater: In the methods, it is specified that the strain expressing GCAMP was also used in the gene expression analysis shown in Figure 5. But the calcium indicator may show transient intensities depending on spontaneous neural activity during the imaging. This will introduce a significant variability that may affect the expression correlation analysis as depicted in Figure 5.

      We apologize for the error in text. The strain used in the gene expression analysis did not express GCaMP. We did not analyze GCaMP expression in figure 5. We have corrected the error in the methods.

      Reviewer #2 (Public Review):

      The authors succeed in generalizing the pre-alignment procedure for their cell idenfication method to allow it to work effectively on data with only small subsets of cells labeled. They convincingly show that their extension accurately identifies head angle, based on finding auto fluorescent tissue and looking for a symmetric l/r axis. They demonstrate that the method works to identify known subsets of neurons with varying accuracy depending on the nature of underlying atlas data. Their approach should be a useful one for researchers wishing to identify subsets of head neurons in C. elegans, for example in whole brain recording, and the ideas might be useful elsewhere.

      The authors also strive to give some general insights on what makes a good atlas. It is interesting and valuable to see (at least for this specific set of neurons) that 5-10 ideal examples are sufficient. However, some critical details would help in understanding how far their insights generalize. I believe the set of neurons in each atlas version are matched to the known set of cells in the sparse neuronal marker, however this critical detail isn't explicitly stated anywhere I can see.

      This is an important point. We have made text modifications to make it clear to the readers that for all atlases, the number of entities (candidate list) was kept consistent as listed in the methods. In the results section under “CRF_ID 2.0 for automatic cell annotation in multi-cell images,” we added the following sentence: “Note that a truncated candidate list can be used for subse-tspecific cell ID if the neuronal expression is known” (page 3). In the methods section, we added the following sentence: “For multi-cell neuron predictions on the glr-1 strain, a truncated atlas containing only the above 37 neurons was used to exclude neuron candidates that are irrelevant for prediction” (Page 20).

      In addition, it is stated that some neuron positions are missing in the neuropal data and replaced with the (single) position available from the open worm atlas. It should be stated how many neurons are missing and replaced in this way (providing weaker information).

      We modified the text in the result section as follows: “Eight out of 37 candidate neurons are missing in the neuroPAL atlas, which means 40% of the pairwise relationships of neurons expressing the glr-1p::NLS-mcherry transgene were not augmented with the NeuroPAL data but were assigned the default values from the OpenWorm atlas” (page 10).

      It also is not explicitly stated that the putative identities for the uncertain cells (designated with Greek letters) are used to sample the neuropal data. Large numbers of openworm single positions or if uncertain cells are misidentified forcing alignment against the positions of nearby but different cells would both handicap the neuropal atlas relative to the matched florescence atlas. This is an important question since sufficient performance from an ideal neuropal atlas (subsampled) would avoid the need for building custom atlases per strain.

      The putative identities are not used to sample the NeuroPAL data. They were used in the glr-1 multi-cell case to indicate low confidence in manual identification/annotation. For all steps of manual annotation and CRF_ID predictions, we used real neuron labels, and the Greek labels were used for reporting purposes only. It is true that the OpenWorm values (40% of the atlas) would be a handicap for the neuroPAL atlas. This is mainly due to the difficulty of obtaining NeuroPAL data as it requires 3-color fluorescence microscopy and significant time and labor to annotate the large set of neurons. This is one reason to take a complementary approach as we do in this paper.

      Reviewer #1 (Recommendations For The Authors):

      1. Figure 3, there is a confusion in the legend relating to panels c-e (e.g. panel c is neuron ID accuracy but it is described per panel e in the legend.

      We made the necessary changes.

      1. Figure 3, were statistical tests performed for panels d-e? if so, and the outcome was not significant, then it might be good to indicate this in the legend.

      We have added results of statistical tests in the legend as the following sentence: “All distributions in panel d and e had a p-value of less than 0.0001 for one sample t-test against zero.” One sample t-tests were performed because what is plotted already represents each atlas’ differences to the glr-1 25 dataset atlas, we didn’t think the statistical analyses between the other atlases would add significant value.

      1. Figure 4, no asterisks are shown in the figure so it is possible to remove the sentence in the legend describing what the asterisk stands for.

      Thank you. We made the necessary changes.

      Reviewer #2 (Recommendations For The Authors):

      Comparison with deep learning approaches could be more nuanced and structured, the authors (prior) approach extended here combines a specific set of comparative relationship measurements with a general optimization approach for matching based on comparative expectations. Other measurements could be used whether explicit (like neighbor expectations) or learned differences in embeddings. These alternate measurements would both need to be extensively re-calibrated for different sets of cells but might provide significant performance gains. In addition deep learning approaches don't solve the optimization part of the matching problem, so the authors approach seems to bring something strong to the table even if one is committed to learned methods (necessary I suspect for human level performance in denser cell sets than the relatively small number here). A more complete discussion of these themes might better frame the impact of the work and help readers think about the advantages and disadvantages or different methods for their own data.

      We thank the reviewer for bringing up this point. We apologize perhaps not making the point clearer in the original submission. This extension of the original work (Chaudhary et al) is not changing the CRF-based framework, but only augmenting the approach with a better defined set of axes (solely because in multicell and not whole-brain datasets, the sparsity of neurons degrades the axis definition and consequently the neuron ID predictions). We are not fundamentally changing the framework, and therefore all the advantages (over registration-based approaches for example) also apply here. The other purpose of this paper is to demonstrate a couple of use-cases for gene expression analysis, which is common in studies in C. elegans (and other organisms). We hope that by showing a use-case others can see how this approach is useful for their own applications.

      We have clarified these points in the paper (page 18). “The fundamental framework has not been changed from CRF_ID 1.0, and therefore the advantages of CRF_ID outlined in the original work apply for CRF_ID 2.0 as well.”

      The atribution of anatomical differences to strain is interesting, but seems purely speculative, and somewhat unlikely. I would suspect the fundamentally more difficult nature of aligning N items to M>>N items in an atlas accounts for the differences in using the neuroPAL vs custom atlas here. If this is what is meant, it could be stated more clearly.

      It is important to note that the same neuron candidate list (listed in methods) was used for all atlases, so there is no difference among the atlases in terms of the number of cells in the query vs. candidate list. In other words, the same values for M and for N are used regardless of the reference atlas used.

      We have preliminary data indicating differences between the NeuroPAL and custom atlas. For instance, the NeuroPAL atlas scales smaller than the custom glr-1 atlas. Since direct comparisons of the different atlases are beyond the scope of this paper, we will leave the exact comparisons for future work. We suspect that the differences are from a combination of differences in anatomy and imaging conditions. While NeuroPAL atlas may not be exactly fitting for the custom dataset, it can serve as a good starting point for guesses when no custom atlases are available, as we have discussed earlier (response to Public Comments from Reviewer 1 Point 1). As explained earlier, we have added these discussions in the paper (see page 18).

      I was also left wondering if the random removal of landmarks had to be adjusted in this work given it is (potentially) helping cope with not just occasional weak cells but the systematic loss of most of the cells in the atlas. If the parameters of this part of the algorithm don't influence the success for N to M>>N alignment (here when the neuroPAL or OpenWorm atlas is used) this seems interesting in itself and worth discussing. Conversely, if these parameters were opitmized for the matched atlas and used for the others, this would seem to bias performance results.

      We may have failed to make this clear in the main text. As we have stated in our responses in the public review section, we do systematically limit the neuron labels in the candidate list to neurons that are known to be expressed by the promotor. The candidate list, which is kept consistent for all atlases, has more neurons than cells in the query, so it is always an N-to-M matching where M>N. We did not use landmarks, but such usage is possible and will only improve the matching.

      We have attempted to clarify these points in the manuscript. In the results section under “CRF_ID 2.0 for automatic cell annotation in multi-cell images,” we added the following sentence: “Note that a truncated candidate list can be used for subset-specific cell ID if the neuronal expression is known” (page 3). In the methods section, we added the following sentence: “For multi-cell neuron predictions on the glr-1 strain, a truncated atlas containing only the above 37 neurons was used to exclude neuron candidates that are irrelevant for prediction” (Page 20).

    1. Author Response

      The following is the authors’ response to the previous reviews.

      We thank the reviewers for collectively highlighting our study as “interesting and timely” and as making significant advances regarding the functional role of Orai in the activity of central dopaminergic neurons underlying the development of Drosophila flight behaviour. We hope that based on the revisions detailed below the data supporting our findings will be considered complete.

      Reviewer 1:

      • In this revision, the authors have addressed most points using text changes but there is still one important issue that continues to be inadequately addressed. This relates to point 1.

      If Set2 is acting downstream of SOCE, it is not clear to me how STIM1 over expression rescues Set2-dependent downstream responses in flies that do not have Set2. It seems that if STIM1 over-expression, which would presumably enhance SOCE, largely rescues Set2-dependent effector responses in the Set2RNAi flies, then the proposed pathway cannot be true (because if Set2 is downstream of SOCE, it shouldn't matter whether SOCE is boosted in flies that lack Set2). This discrepancy is not explained. Does STIM1 over-expression somehow restore Set2 expression in the Set2RNAi flies?

      Ans: Based on the requirement of Orai-mediated Ca2+ entry for Set2 expression (THD’>OraiE180A neurons, Figure 2C) we had indeed proposed that rescue of flight in Set2RNAi flies by STIMOE is because Set2 expression in Set2RNAi flies is restored by STIMOE. However, we agree that this has not been tested experimentally. Since these data are supportive but not essential to our findings here, we have removed data demonstrating flight rescue of Set2RNAi by STIMOE from Figure 2 – supplement 5 and associated text from the revised manuscript. We plan to investigate the effect of STIMOE on Set2 in the context of Drosophila dopaminergic neurons in the future.

      Reviewer 2:

      The manuscript analyses the functional role of Orai in the excitability of central dopaminergic neurons in Drosophila. The authors answer the previous concerns, but several important issues have not been experimentally tested. Especially, the lack of characterization of SOCE or calcium release from the intracellular calcium stores limits considerably the impact of the study. They comment on a number of technical problems but, taking into account the nature of the study, based on Orai and SOCE, the lack of these experimental data reduces the relevance of the study. Below are some specific comments:

      1. The response to question 1 is unconvincing. The authors do not demonstrate experimentally that STIM over-expression enhances SOCE or how excess SOCE might overcome the loss of SET2.

      Ans: The reason we have not performed experiments in this manuscript to investigate SOCE in STIM overexpression condition is two-fold. Firstly, extensive characterisation of SOCE by STIM overexpression in Drosophila pupal neurons forms part of an earlier publication (Chakraborty and Hasan, Front. Mol. Neurosci, 2017). A graph from Chakraborty and Hasan, 2017 where SOCE was measured in primary cultures of pupal neurons from an IP3R mutant (S224F/G1891S) of Drosophila. Reduced SOCE in IP3R mutant neurons (red trace) was restored by overexpression of STIM (black trace). The green trace is of wild-type neurons with STIM overexpression and the grey trace with STIMRNAi. Similar experiments were performed with Orai+STIM overexpression and the rescue in SOCE was compared with STIM overexpression in pupal neurons of wild type and IP3R mutant S224F/G1891S. See Chakraborty and Hasan, 2017 (Front. Mol. Neurosci. 10:111. doi: 10.3389/fnmol.2017.00111)

      2) Secondly, rescue by STIMOE is supportive but not essential to the findings of this manuscript which relate primarily to the analysis of an Orai-dependent transcriptional feed-back mechanism acting via Trl and Set2 in flight promoting dopaminergic neurons (See Fig 2C where we demonstrate that OraiE180A expression in THD’ neurons brings down Set2 expression).

      We agree that we have not demonstrated how loss of Set2 can be compensated by STIM overexpression. Therefore, we have now removed the supplementary data relating to STIM rescue of Set2RNAi (THD’>Set2RNAi; STIMOE) flight phenotypes since as mentioned above it was supportive but not essential to the main theme of the manuscript. Consistent with this, we have also removed rescue of flight in TrlRNAi by STIMOE (Figure 4C).

      1. The authors do not present a characterization of SOCE in the cells investigated expressing native Orai or the dominant negative OraiE180A mutant yet. They comment on some technical problems for in situ determination or using culture cells but, apparently, in previous studies they have reported some results.

      Ans: We respectfully submit that characterisation of SOCE in cells expressing native Orai and OraiE180A from primary cultures of Drosophila pupal dopaminergic neurons, form part of an earlier publication (Pathak, T., et al., (2015). The Journal of Neuroscience, 35, 13784–13799. https://doi.org/10.1523/jneurosci.1680-15.2015). As mentioned in lines 80-84 the dopaminergic neurons studied here (THD’) are a subset of the dopaminergic neurons studied in the Pathak et al., 2015 publication (TH). As evident in Figure 2 panels B-D expression of OraiE180A in dopaminergic neurons abrogates SOCE.

      In this study we have focused on identifying the molecular mechanism by which OraiE180A expression and concomitant loss of cellular Ca2+ signals (Figure 3B, 3C) affects dopaminergic neuron function. In lines 270-274 (page 10) we have stated the technical reason why Ca2+ measurements made in this study from ex-vivo brain preps measure a composite of ER-Ca2+ release and SOCE. Our observation that the measured Ca2+ response is significantly attenuated in cells expressing OraiE180A leads us to the conclusion that we are indeed measuring an SOCE component in the ex-vivo brain preps. This is also explained in ‘Limitations of the study’.

      1. Concerning the question about the STIM:Orai stoichiometry the authors answer that "We agree that STIM-Orai stoichiometry is essential for SOCE, and propose that the rescue backgrounds possess sufficient WT Orai, which is recruited by the excess STIM to mediate the rescue"; however, again, this is not experimentally tested.

      Ans: To address this point we have now measured relative stoichiometries of STIM and Orai mRNA by qPCR under WT conditions in Drosophila THD’ neurons at 72 hr APF. The observed stoichiometry as per these measurements is STIM:Orai =1.6:1 (~8:5). These data are in relative agreement with the normalised read counts of STIM and Orai in THD’ neurons in the RNAseq performed and described in Fig 1F. The qPCR (A) and RNAseq (B) measures of STIM and Orai are appended below.

      Author response image 1.

      In comparison to the numerous studies investigating structural, biophysical and cellular characterisation of Orai channels in heterologous systems, there are fewer studies which have traced systemic implications of Orai function through multiple tiers of investigation including organismal behaviour. Leveraging the wealth of genetic resources available in Drosophila, we have attempted this here. While we respectfully agree that questions pertaining to the stoichiometries of STIM/Orai proteins are indeed relevant to cellular regulation of SOCE, we submit they may be better suited for investigation in heterologous systems involving cell culture, or with in-vitro systems with purified recombinant proteins, or indeed using computational and modelling approaches. None of these methods fall within the scope of our current investigation which is to understand how by Orai mediated Ca2+ entry regulates developmental maturation of Drosophila flight promoting dopaminergic neurons.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews: 

      Reviewer #1 (Public review): 

      The authors investigated the role of the C. elegans Flower protein, FLWR-1, in synaptic transmission, vesicle recycling, and neuronal excitability. They confirmed that FLWR-1 localizes to synaptic vesicles and the plasma membrane and facilitates synaptic vesicle recycling at neuromuscular junctions. They observed that hyperstimulation results in endosome accumulation in flwr-1 mutant synapses, suggesting that FLWR-1 facilitates the breakdown of endocytic endosomes. Using tissue-specific rescue experiments, the authors showed that expressing FLWR-1 in GABAergic neurons restored the aldicarb-resistant phenotype of flwr-1 mutants to wild-type levels. By contrast, cholinergic neuron expression did not rescue aldicarb sensitivity at all. They also showed that FLWR-1 removal leads to increased Ca<sup>2+</sup> signaling in motor neurons upon photo-stimulation. From these findings, the authors conclude that FLWR-1 helps maintain the balance between excitation and inhibition (E/I) by preferentially regulating GABAergic neuronal excitability in a cell-autonomous manner. 

      Overall, the work presents solid data and interesting findings, however the proposed cell-autonomous model of GABAergic FLWR-1 function may be overly simplified in my opinion. 

      Most of my previous comments have been addressed; however, two issues remain. 

      (1) I appreciate the authors' efforts conducting additional aldicarb sensitivity assays that combine muscle-specific rescue with either cholinergic or GABergic neuron-specific expression of FLWR-1. In the revised manuscript, they conclude, "This did not show any additive effects to the pure neuronal rescues, thus FLWR-1 effects on muscle cell responses to cholinergic agonists must be cellautonomous." However, I find this interpretation confusing for the reasons outlined below. 

      Figure 1 - Figure Supplement 3B shows that muscle-specific FLWR-1 expression in flwr-1 mutants significantly restores aldicarb sensitivity. However, when FLWR-1 is co-expressed in both cholinergic neurons and muscle, the worms behave like flwr-1 mutants and no rescue is observed. Similarly, cholinergic FLWR-1 alone fails to restore aldicarb sensitivity (shown in the previous manuscript).

      This data is still shown in the manuscript, Fig. 3D. We interpreted our finding in the muscle/cholinergic co-rescue experiment as meaning, that FLWR-1 in cholinergic neurons over-compensates, so worms should be resistant, and the rescuing effect of muscle FLWR-1 is therefore cancelled. But it is true, if this were the case, why does the pure cholinergic rescue not show over-compensation? We added a sentence to acknowledge this inconsistency and we added a sentence in the discussion (see also below, comment 1) of reviewer #2).

      These observations indicate a non-cell-autonomous interaction between cholinergic neurons and muscle, rather than a strictly muscle cell-autonomous mechanism. In other words, FLWR-1 expressed in cholinergic neurons appears to negate or block the rescue effect of muscle-expressed FLWR-1. Therefore, FLWR-1 could play a more complex role in coordinating physiology across different tissues. This complexity may affect interpretations of Ca<sup>2+</sup> dynamics and/or functional data, particularly in relation to E/I balance, and thus warrants careful discussion or further investigation. 

      For the Ca<sup>2+</sup> dynamics, we think the effects of flwr-1 are likely very immediate, as the imaging assay relies on a sensor expressed directly in the neurons or muscles under study, and not on indirect phenotypes as muscle contraction and behavior, that depend on an interplay of several cell types influencing each other.

      (2) The revised manuscript includes new GCaMP analyses restricted to synaptic puncta. The authors mention that "we compared Ca<sup>2+</sup> signals in synaptic puncta versus axon shafts, and did not find any differences," concluding that "FLWR-1's impact is local, in synaptic boutons." This is puzzling: the similarity of Ca<sup>2+</sup> signals in synaptic regions and axon shafts seems to indicate a more global effect on Ca<sup>2+</sup> dynamics or may simply reflect limited temporal resolution in distinguishing local from global signals due to rapid Ca<sup>2+</sup> diffusion. The authors should clarify how they reached the conclusion that FLWR-1 has a localized impact at synaptic boutons, given that synaptic and axonal signals appear similar. Based on the presented data, the evidence supporting a local effect of FLWR-1 on Ca<sup>2+</sup> dynamics appears limited.

      We apologize, here we simply overlooked this misleading wording in our rebuttal letter. The data we mentioned, showing no obvious difference in axon vs. bouton, are shown below, including time constants for the onset and the offset of the stimulus (data is peak normalized for better visualization):

      Author response image 1.

      One can see that axonal Ca<sup>2+</sup> signals may rise a bit slower than synaptic Ca<sup>2+</sup> signals, as expected for Ca<sup>2+</sup> entering the boutons, and then diffusing out into the axon. The loss of FLWR1 does not affect this. However, the temporal resolution of the used GCaMP6f sensor is ca. 200 ms to reach peak, and the decay time (to t1/2) is ca. 400 ms (PMID: 23868258). Thus, it would be difficult to see effects based on Ca<sup>2+</sup> diffusion using this assay. For the decay, this is similar for both axon and synapse, while flwr-1 mutants do not reduce Ca<sup>2+</sup> as much as wt. In the axon, there is a seemingly slightly slower reduction in flwr-1 mutants, however, given the kinetics of the sensor, this is likely not a meaningful difference. Therefore, we wrote we did not find differences. The interpretation should not have been that the impact of FLWR-1 is local. It may be true if one could image this at faster time scales, i.e. if there is more FLWR-1 localized in boutons (as indicated by our data showing FLWR-1 enrichment in boutons; Fig. 3), and when considering its possible effect on MCA-3 localization (and assuming that MCA-3 is the active player in Ca<sup>2+</sup> removal), i.e. FLWR-1 recruiting MCA-3 to boutons (Fig. 9C, D).  

      Reviewer #2 (Public review): 

      Summary: 

      The Flower protein is expressed in various cell types, including neurons. Previous studies in flies have proposed that Flower plays a role in neuronal endocytosis by functioning as a Ca<sup>2+</sup> channel. However, its precise physiological roles and molecular mechanisms in neurons remain largely unclear. This study employs C. elegans as a model to explore the function and mechanism of FLWR-1, the C. elegans homolog of Flower. This study offers intriguing observations that could potentially challenge or expand our current understanding of the Flower protein. Nevertheless, further clarification or additional experiments are required to substantiate the study's conclusions. 

      Strengths: 

      A range of approaches was employed, including the use of a flwr-1 knockout strain, assessment of cholinergic synaptic activity via analyzing aldicarb (a cholinesterase inhibitor) sensitivity, imaging Ca<sup>2+</sup> dynamics with GCaMP3, analyzing pHluorin fluorescence, examination of presynaptic ultrastructure by EM, and recording postsynaptic currents at the neuromuscular junction. The findings include notable observations on the effects of flwr-1 knockout, such as increased Ca<sup>2+</sup> levels in motor neurons, changes in endosome numbers in motor neurons, altered aldicarb sensitivity, and potential involvement of a Ca<sup>2+</sup>-ATPase and PIP2 binding in FLWR-1's function. 

      The authors have adequately addressed most of my previous concerns, however, I recommend minor revisions to further strengthen the study's rigor and interpretation: 

      Major suggestions 

      (1) This study relies heavily on aldicarb assays to support its conclusions. While these assays are valuable, their results may not fully align with direct assessment of neurotransmitter release from motor neurons. For instance, prior work has shown that two presynaptic modulators identified through aldicarb sensitivity assays exhibited no corresponding electrophysiological defects at the neuromuscular junction (Liu et al., J Neurosci 27: 10404-10413, 2007). Similarly, at least one study from the Kaplan lab has noted discrepancies between aldicarb assays and electrophysiological analyses. The authors should consider adding a few sentences in the Discussion to acknowledge this limitation and the potential caveats of using aldicarb assays, especially since some of the aldicarb assay results in this study are not easily interpretable. 

      Aldicarb assays have been used very successfully in identifying mutants with defects in chemical synaptic transmission, and entire genetic screens have been conducted this way. The reviewer is right, one needs to realize that it is the balance of excitation and inhibition at the NMJ of C. elegans, which underlies the effects on the rate of aldicarb-induced paralysis, not just cholinergic transmission. I.e. if a given mutant affects cholinergic and GABAergic transmission differently, things become difficult to interpret, particularly if also muscle physiology is affected. Therefore, we combined mutant analyses with cell-type specific rescue. We acknowledge that results are nonetheless difficult to interpret. We thus added a sentence in the first paragraph of the discussion.

      (2) The manuscript states, "Elevated Ca<sup>2+</sup> levels were not further enhanced in a flwr-1;mca-3 double mutant." (lines 549-550). However, Figure 7C does not include statistical comparisons between the single and double mutants of flwr-1 and mca-3. Please add the necessary statistical analysis to support this statement. 

      Because we only marked significant differences in that figure, and n.s. was not shown. This was stated in the figure legend.

      (3) The term "Ca<sup>2+</sup> influx" should be avoided, as this study does not provide direct evidence (e.g. voltage-clamp recordings of Ca<sup>2+</sup> inward currents in motor neurons) for an effect of the flwr-1 mutation of Ca<sup>2+</sup> influx. The observed increase in neuronal GCaMP signals in response to optogenetic activation of ChR2 may result from, or be influenced by, Ca<sup>2+</sup> mobilization from of intracellular stores. For example, optogenetic stimulation could trigger ryanodine receptor-mediated Ca<sup>2+</sup> release from the ER via calcium-induced calcium release (CICR) or depolarization-induced calcium release (DICR). It would be more appropriate to describe the observed increase in Ca<sup>2+</sup> signal as "Ca<sup>2+</sup> elevation" rather than increased "Ca<sup>2+</sup> influx". 

      Ok, yes, we can do this, we referred by ‘influx’ to cytosolic Ca<sup>2+</sup>, that fluxes into the cytosol, be it from the internal stores or the extracellular. Extracellular influx, more or less, inevitably will trigger further influx from internal stores, to our understanding. We changed this to “elevated Ca<sup>2+</sup> levels” or “Ca<sup>2+</sup> level rise” or “Ca<sup>2+</sup> level increase”.

      Recommendations for the authors: 

      Reviewer #1 (Recommendations for the authors):

      A thorough discussion on the impact of cell-autonomous versus non-cell-autonomous effects is necessary. 

      Revise and clarify the distinction between local and global Ca²⁺ changes. 

      see above.

      Reviewer #2 (Recommendations for the authors): 

      Minor suggestions 

      (1) In "Few-Ubi was shown to facilitate recovery of neurons following intense synaptic activity (Yao et al.,....." (lines 283-284), please specify which aspects of neuronal recovery are influenced by the Flower protein. 

      We added “refilling of SV pools”.

      (2) The abbreviation "Few-Ubi" is used for the Drosophila Flower protein (e.g., line 283, Figure 1A, and Figure 8A). Please clarify what "Ubi" stands for and verify whether its inclusion in the protein name is appropriate.

      This is inconsistent across the literature, sometimes Fwe-Ubi is also referred to as FweA. We now added this term. Ubi refers to ubiquitous (“Therefore, we named this isoform fweubi because it is expressed ubiquitously in imaginal discs“) (Rhiner 2010)

      (3) The manuscript uses "pflwr-1" (line 303 and elsewhere) to denote the flwr-1 promoter. This notation could be misleading, as it may be interpreted as a gene name. Please consider using either "flwr-1p" or "Pflwr-1" instead. Additionally, ensure proper italicization of gene names throughout the manuscript. 

      We changed this throughout. We will change to italicized at proof stage, it would be too timeconsuming to spot these incidents now.

      (4) The authors tagged the C-terminus of FLWR-1 by GFP (lines 321). The fusion protein is referred to as "GFP::FLWR-1" throughout the manuscript. Please verify whether "FLWR-1::GFP" would be the more appropriate designation.

      Thank you, yes, we changed this in the text, GFP is indeed N-terminal.

      (5) In "This did not show any additive effects...." (line 363), please clarify what "This" refers to. 

      Altered to “The combined rescues did not show any additive effects…”

      (6) In "..., supporting our previous finding of increased neurotransmitter release in GABAergic neurons" (lines 412-413), please provide a citation for the referenced previous study.

      This refers to our aldicarb data within this paper, just further up in the text. We removed “previous”.

      (7) Figure 4C, D examines the effect of flwr-1 mutation on body length in the genetic background of the unc-29 mutation, which selectively disrupts the levamisole-sensitive acetylcholine receptor. Please comment on the rationale for implicating only the levamisole receptor rather than the nicotinic acetylcholine receptor in muscle cells. 

      This was because we used a behavioral assay. Despite the fact that the homopentameric ACR16/N-AChR mediate about 2/3 of the peak currents in response to acute ACh application to the NMJ (e.g. Almedom et al., EMBO J, 2009), the acr-16 mutant has virtually no behavioral / locomotion phenotype. Likely, this is because the heteropentameric, UNC-29 containing LAChR, while only contributing 1/3 of the peak current, desensitizes much more slowly and thus unc-29 mutants show a severe behavioral phenotype (uncoordinated locomotion, etc.). We thus did not expect a major effect when performing the behavoral assay in acr-16 mutants and thus chose the unc-29 mutant background.

      (8) In "we found no evidence ....insertion into the PM (Yao et al., 2009)", It appears that the cited paper was not authored by any of the current manuscript. Please confirm whether this citation is correctly attributed. 

      This sentence was arranged in a misleading way, we did not mean that we authored this paper. It was change in the text: “While a facilitating role of Flower in endocytosis appears to be conserved in C. elegans, in contrast to previous findings from Drosophila (Yao et al., 2009), we found no evidence that FLWR-1 conducts Ca<sup>2+</sup> upon insertion into the PM.”

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Summary:

      This study examines to what extent this phenomenon varies based on the visibility of the saccade target. Visibility is defined as the contrast level of the target with respect to the noise background, and it is related to the signal-to-noise ratio of the target. A more visible target facilitates the oculomotor behavior planning and execution, however, as speculated by the authors, it can also benefit foveal prediction even if the foveal stimulus visibility is maintained constant. Remarkably, the authors show that presenting a highly visible saccade target is beneficial for foveal vision as detection of stimuli with an orientation similar to that of the saccade target is improved, the lower is the saccade target visibility, the less prominent is this effect.

      Strengths:

      The results are convincing and the research methodology is technically sound.

      Weaknesses:

      It is still unclear why the pre-saccadic enhancement would oscillate for targets with higher opacity levels, and what would be the benefit of this oscillatory pattern. The authors do not speculate too much on this and loosely relate it to feedback processes, which are characterized by neural oscillations in a similar range.

      We thank the reviewer for their assessment. We intentionally decided to describe the oscillatory pattern without claiming to be able to pinpoint its origin. The finding was incidental and, based on psychophysical data alone, we would not feel comfortable doing anything but loosely relating it to potential mechanisms on an explicitly speculative basis. In the potential explanation we provide in the manuscript, the oscillatory pattern would likely not serve a benefit–rather, it would constitute an innate consequence and, thus, a coincidental perceptual signature of potential feedback processes.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, the authors ran a dual task. Subjects monitored a peripheral location for a target onset (to generate a saccade to), and they also monitored a foveal location for a foveal probe. The foveal probe could be congruent or incongruent with the orientation of the peripheral target. In this study, the authors manipulated the conspicuity of the peripheral target, and they saw changes in performance in the foveal task. However, the changes were somewhat counterintuitive.

      We regret that our findings remain counterintuitive to the reviewer even after our extensive explanations in the previous revision round and the corresponding changes in the manuscript. We repeat that both the decrease in foveal Hit Rates and the increase in foveal enhancement with increasing target contrast were expected and preregistered prior to data collection.

      Strengths:

      The authors use solid analysis methods and careful experimental design.

      Comments on revisions:

      The authors have addressed my previous comments.

      One minor thing is that I am confused by their assertion that there was no smoothing in the manuscript (other than the newly added time course analysis). Figure 3A and Figure 6 seem to have smoothing to me.

      When the reviewer suggested that the “data appear too excessively smoothed” in the first revision, we assumed that they were referring to pre-saccadic foveal Hit and False Alarm rates, not to fitted distributions. As we state in the legend of Figure 3A (as well as in Figures 6 and S1), the “smoothed” curves constitute the probability density distributions of our raw data. Concerning the energy maps resulting from reverse correlation analyses, we described our proceeding in detail in our initial article (Kroell & Rolfs, 2022): 

      “Using this method, we obtained filter responses for 260 SF*ori combinations per noise image (Figure 6 in Materials and methods, ‘Stimulus analysis’). SFs ranged from 0.33 to 1.39 cpd (in 20 equal increments). Orientations ranged from –90–90° (in 13 equal increments). To normalize the resulting energy maps, we z-transformed filter responses using the mean and standard deviation of filter responses from the set of images presented in a certain session. To obtain more fine-grained maps, we applied 2D linear interpolations by iteratively halving the interval between adjacent values 4 times in each dimension. To facilitate interpretability, we flipped the energy maps of trials in which the target was oriented to the left. In all analyses and plots,+45° thus corresponds to the target’s orientation while –45° corresponds to the other potential probe orientation. Filter responses for all response types are provided at https://osf.io/v9gsq/.”

      We have added a pointer to this explanation to the current manuscript (see line 836).

      Another minor comment is related to the comment of Reviewer 1 about oscillations. Another possible reason for what looks like oscillations is saccadic inhibition. when the foveal probe appears, it can reset the saccade generation process. when aligned to saccade onset, this appears like a characteristic change in different parameters that is time-locked to saccade onset (about a 100 ms earlier). So, maybe the apparent oscillation is a manifestation of such resetting and it's not really an oscillation. so, I agree with Reviewer 1 about removing the oscillation sentence from the abstract.

      While we understand that a visible probe will result in saccadic inhibition (White & Rolfs, 2016), we are unsure how a resetting of the saccade generation process should manifest in increased perceptual enhancement of a specific, peripheral target orientation in the presaccadic fovea. Moreover, as we describe in our initial article (Kroell & Rolfs, 2022), we updated the background noise image every 50 ms and embedded our probe stimulus into the surrounding noise using smooth orientation filters and raised cosine masks to avoid a disruptive influence of probe appearance on movement planning and execution (Hanning, Deubel, & Szinte, 2019). And indeed, we demonstrated that the appearance of the foveal probe did not disrupt saccade preparation, that is, did not increase saccade latencies compared to ‘probe absent’ trials in which no foveal probe was presented (see Kroell & Rolfs, 2022; sections “Parameters of included saccades in Experiment 1” and “Parameters of included saccades in Experiment 2”). In the current submission, saccade latencies in ‘probe present’ trials exceeded saccade latencies in ‘probe absent’ trials by a mere 4.7±2.3 ms. Additionally, to inspect the variation of saccade execution frequency directly, we aligned the number of saccade generation instances to the onset of the foveal probe stimulus (see Author response image 1). In line with what we described in a previous paradigm employing flickering bandpass filtered noise patches (Kroell & Rolfs, 2021; 10.1016/j.cortex.2021.02.021), we observed a regular variation in saccade execution frequency that reflected the duration of an individual background noise image (50 ms in this investigation). In other words, the repeated dips in saccadic frequency are likely caused by the flickering background noise and not the onset of the foveal probe which would produce a single dip ~100 ms after probe onset. Given these results, we do not see a straight-forward explanation for how the variation of saccade execution frequency in 20 Hz intervals would boost peripheral-to-foveal feature prediction before the saccade in ~10 Hz intervals. Nonetheless, we removed the sentence referencing oscillations from the Abstract.

      Author response image 1.

       

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Overall, The authors did a good job in addressing the points I raised. Two new sections were added to the manuscript, one to address how the mechanisms of foveal predictions would play out in natural viewing conditions, and another one examining more in depth the potential neural mechanisms implicated in foveal predictions. I found these two sections to be quite speculative, and at points, a bit convoluted but could help the reader get the bigger picture. I still do not have a clear sense of why the pre-saccadic enhancement would oscillate for targets with higher opacity levels, and what would be the benefit of this oscillatory pattern. The authors do not speculate too much on this and loosely relate it to feedback processes, which are characterized by neural oscillations in a similar range.  

      Please see our response to ‘Weaknesses’.

      I still find this a loose connection and would suggest removing the following phrase from the abstract "Interestingly, the temporal frequency of these oscillations corresponded to the frequency range typically associated with neural feedback signaling". 

      We have removed this phrase.

      Finally, the authors should specify how much of this oscillation is due to oscillations in HR of cong vs. oscillations in HR of incongruent trials or both.

      We fitted separate polynomials to congruent and incongruent Hit Rates instead of their difference. Peaks in enhancement relied on both, oscillatory increases in congruent Hit Rates and simultaneous decreases in incongruent Hit Rates. In other words, enhancement peaks appear to reflect a foveal enhancement of target-congruent feature information along with a concurrent suppression of target-incongruent features. We added this paragraph and Figure 4 to the Results section.

      Additional changes:

      Two figures had accidentally been labeled as Figure 5 in our first revision. We corrected the figure legends and all corresponding figure references in the text.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      As to the exceptionally minor issue, namely, correction for multiple statistical tests (minor because the data and the error are presented in the text). We have now conducted one-way ANOVA to back the data displayed in Fig 4A., and Supp. Figs 19 and 21. In each case ANOVA revealed a highly significant difference among means: Dunnett’s post hoc test was then used to test each result against SBW25, with the multiple comparisons corrected for in the analysis.

      This resulted in changes to the description of the statistical analysis in the following captions:

      To Figure 4.

      Where we previously referred to paired t-tests we now state:  ANOVA revealed a highly significant difference among means [F<sub>7,16</sub> = 8.19, p < 0.001] with Dunnett’s post-hoc test adjusted for multiple comparisons showing that five genotypes (*) differ significantly (p < 0.05) from SBW25.

      To Supplementary Figure 19.

      Where we previously referred to paired t-tests we now state: ANOVA revealed a highly significant difference among means [F<sub>7,16</sub> = 16.74, p < 0.001] with Dunnett’s post-hoc test adjusted for multiple comparisons showing that three genotypes (*) differ significantly (p < 0.05) from SBW25.

      To Supplementary Figure 21.

      Where we previously referred to paired t-tests we now state:  ANOVA revealed a highly significant difference among means [F<sub>7,89</sub> = 9.97, p < 0.0001] with Dunnett’s post-hoc test adjusted for multiple comparisons showing that SBW25 ∆mreB and SBW25 ∆PFLU4921-4925 are significantly different (*) from SBW25 (p < 0.05).


      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Summary: 

      The authors performed experimental evolution of MreB mutants that have a slow-growing round phenotype and studied the subsequent evolutionary trajectory using analysis tools from molecular biology. It was remarkable and interesting that they found that the original phenotype was not restored (most common in these studies) but that the round phenotype was maintained. 

      Strengths: 

      The finding that the round phenotype was maintained during evolution rather than that the original phenotype, rod-shaped cells, was recovered is interesting. The paper extensively investigates what happens during adaptation with various different techniques. Also, the extensive discussion of the findings at the end of the paper is well thought through and insighXul. 

      Weaknesses: 

      I find there are three general weaknesses: 

      (1) Although the paper states in the abstract that it emphasizes "new knowledge to be gained" it remains unclear what this concretely is. On page 4 they state 3 three research questions, these could be more extensively discussed in the abstract. Also, these questions read more like genetics questions while the paper is a lot about cell biological findings. 

      Thank you for drawing attention to the unnecessary and gratuitous nature of the last sentence of the Abstract. We are in agreement. It has been modified, and we have taken  advantage of additional word space to draw attention to the importance of the two competing (testable) hypotheses laid out in the Discussion. 

      As to new knowledge, please see the Results and particularly the Discussion. But beyond this, and as recognised by others, there is real value for cell biology in seeing how (and whether) selection can compensate for effects that are deleterious to fitness. The results will very often depart from those delivered from, for example, suppressor analyses, or bottom up engineering. 

      In the work recounted in our paper, we chose to focus – by way of proof-of principle – on the most commonly observed mutations, namely, those within pbp1A.  But beyond this gene, we detected mutations  in other components of the cell shape / division machinery whose connections are not yet understood and which are the focus of on-going investigation.  

      As to the three questions posed at the end of the Introduction, the first concerns whether selection can compensate for deleterious effects of deleting mreB (a question that pertains to evolutionary aspects); the second seeks understanding of genetic factors; the third aims to shed light on the genotype-to-phenotype map (which is where the cell biology comes into play).  Given space restrictions, we cannot see how we could usefully expand, let alone discuss, the three questions raised at the end of the Introduction in restrictive space available in the Abstract.   

      (2) It is not clear to me from the text what we already know about the restoration of MreB loss from suppressors studies (in the literature). Are there suppressor screens in the literature and which part of the findings is consistent with suppressor screens and which parts are new knowledge?  

      As stated in the Introduction, a previous study with B. subtilis (which harbours three MreB isoforms and where the isoform named “MreB” is essential for growth under normal conditions), suppressors of MreB lethality were found to occur in ponA, a class A penicillin binding protein (Kawai et al., 2009). This led to recognition that MreB plays a role in recruiting Pbp1A to the lateral cell wall. On the other hand, Patel et al. (2020) have shown that deletion of classA PBPs leads to an up-regulation of rod complex activity. Although there is a connection between rod complex and class A PBPs, a further study has shown that the two systems work semi-autonomously (Cho et al., 2016). 

      Our work confirms a connection between MreB and Pbp1A, and has shed new light on how this interaction is established by means of natural selection, which targets the integrity of cell wall. Indeed, the Rod complex and class A PBPs have complementary activities in the building of the cell wall with each of the two systems able to compensate for the other in order to maintain cell wall integrity. Please see the major part of the Discussion. In terms of specifics, the connection between mreB and pbp1A (shown by Kawai et al (2009)) is indirect because it is based on extragenic transposon insertions. In our study, the genetic connection is mechanistically demonstrated.  In addition, we capture that the evolutionary dynamics is rapid and we finally enriched understanding of the genotype-to-phenotype map.

      (3) The clarity of the figures, captions, and data quantification need to be improved.  

      Modifications have been implemented. Please see responses to specific queries listed below.

      Reviewer #2 (Public Review): 

      Yulo et al. show that deletion of MreB causes reduced fitness in P. fluorescens SBW25 and that this reduction in fitness may be primarily caused by alterations in cell volume. To understand the effect of cell volume on proliferation, they performed an evolution experiment through which they predominantly obtained mutations in pbp1A that decreased cell volume and increased viability. Furthermore, they provide evidence to propose that the pbp1A mutants may have decreased PG cross-linking which might have helped in restoring the fitness by rectifying the disorganised PG synthesis caused by the absence of MreB. Overall this is an interesting study. 

      Queries: 

      Do the small cells of mreB null background indeed have no DNA? It is not apparent from the DAPI images presented in Supplementary Figure 17. A more detailed analysis will help to support this claim. 

      It is entirely possible that small cells have no DNA, because if cell division is aberrant then division can occur prior to DNA segregation resulting in cells with no DNA. It is clear from microscopic observation that both small and large cells do not divide. It is, however, true, that we are unable to state – given our measures of DNA content – that small cells have no DNA. We have made this clear on page 13, paragraph 2.

      What happens to viability and cell morphology when pbp1A is removed in the mreB null background? If it is actually a decrease in pbp1A activity that leads to the rescue, then pbp1A- mreB- cells should have better viability, reduced cell volume and organised PG synthesis. Especially as the PG cross-linking is almost at the same level as the T362 or D484 mutant.  

      Please see fitness data in Supp. Fig. 13. Fitness of ∆mreBpbp1A is no different to that caused by a point mutation. Cells remain round.  

      What is the status of PG cross-linking in ΔmreB Δpflu4921-4925 (Line 7)? 

      This was not analysed as the focus of this experiment was PBPs. A priori, there is no obvious reason to suspect that ∆4921-25 (which lacks oprD) would be affected in PBP activity.

      What is the morphology of the cells in Line 2 and Line 5? It may be interesting to see if PG cross-linking and cell wall synthesis is also altered in the cells from these lines. 

      The focus of investigation was restricted to L1, L4 and L7. Indeed, it would be interesting to look at the mutants harbouring mutations in :sZ, but this is beyond scope of the present investigation (but is on-going). The morphology of L2 and L5 are shown in Supp. Fig. 9.

      The data presented in 4B should be quantified with appropriate input controls. 

      Band intensity has now been quantified (see new Supp. Fig .20). The controls are SBW25, SBW25∆pbp1A, SBW25 ∆mreB and SBW25 ∆mreBpbp1A as explained in the paper.

      What are the statistical analyses used in 4A and what is the significance value? 

      Our oversight. These were reported in Supp. Fig. 19, but should also have been presented in Fig. 4A. Data are means of three biological replicates. The statistical tests are comparisons between each mutant and SBW25, and assessed by paired t-tests.  

      A more rigorous statistical analysis indicating the number of replicates should be done throughout. 

      We have checked and made additions where necessary and where previously lacking. In particular, details are provided in Fig. 1E, Fig. 4A and Fig. 4B. For Fig. 4C we have produced quantitative measures of heterogeneity in new cell wall insertion. These are reported in Supp. Fig. 21 (and referred to in the text and figure caption) and show that patterns of cell wall insertion in ∆mreB are highly heterogeneous.

      Reviewer #3 (Public Review): 

      This paper addresses an understudied problem in microbiology: the evolution of bacterial cell shape. Bacterial cells can take a range of forms, among the most common being rods and spheres. The consensus view is that rods are the ancestral form and spheres the derived form. The molecular machinery governing these different shapes is fairly well understood but the evolutionary drivers responsible for the transition between rods and spheres are not. Enter Yulo et al.'s work. The authors start by noting that deletion of a highly conserved gene called MreB in the Gram-negative bacterium Pseudomonas fluorescens reduces fitness but does not kill the cell (as happens in other species like E. coli and B. subtilis) and causes cells to become spherical rather than their normal rod shape. They then ask whether evolution for 1000 generations restores the rod shape of these cells when propagated in a rich, benign medium. 

      The answer is no. The evolved lineages recovered fitness by the end of the experiment, growing just as well as the unevolved rod-shaped ancestor, but remained spherical. The authors provide an impressively detailed investigation of the genetic and molecular changes that evolved. Their leading results are: 

      (1) The loss of fitness associated with MreB deletion causes high variation in cell volume among sibling cells after cell division. 

      (2) Fitness recovery is largely driven by a single, loss-of-function point mutation that evolves within the first ~250 generations that reduces the variability in cell volume among siblings. 

      (3) The main route to restoring fitness and reducing variability involves loss of function mutations causing a reduction of TPase and peptidoglycan cross-linking, leading to a disorganized cell wall architecture characteristic of spherical cells. 

      The inferences made in this paper are on the whole well supported by the data. The authors provide a uniquely comprehensive account of how a key genetic change leads to gains in fitness and the spectrum of phenotypes that are impacted and provide insight into the molecular mechanisms underlying models of cell shape. 

      Suggested improvements and clarifications include: 

      (1) A schematic of the molecular interactions governing cell wall formation could be useful in the introduction to help orient readers less familiar with the current state of knowledge and key molecular players. 

      We understand that this would be desirable, but there are numerous recent reviews with detailed schematics that we think the interested reader would be better consulting. These are referenced in the text.

      (2) More detail on the bioinformatics approaches to assembling genomes and identifying the key compensatory mutations are needed, particularly in the methods section. This whole subject remains something of an art, with many different tools used. Specifying these tools, and the parameter settings used, will improve transparency and reproducibility, should it be needed. 

      We overlooked providing this detail, which has now been corrected by provision of more information in the Materials and Methods. In short we used Breseq, the clonal option, with default parameters. Additional analyses were conducted using Genieous. The BreSeq output files are provided https://doi.org/10.17617/3.CU5SX1 (which include all read data).

      (3) Corrections for multiple comparisons should be used and reported whenever more than one construct or strain is compared to the common ancestor, as in Supplementary Figure 19A (relative PG density of different constructs versus the SBW25 ancestor). 

      The data presented in Supp Fig 19A (and Fig 4A) do not involve multiple comparisons. In each instance the comparison is between SBW25 and each of the different mutants. A paired t-test is thus appropriate.

      (4) The authors refrain from making strong claims about the nature of selection on cell shape, perhaps because their main interest is the molecular mechanisms responsible. However, I think more can be said on the evolutionary side, along two lines. First, they have good evidence that cell volume is a trait under strong stabilizing selection, with cells of intermediate volume having the highest fitness. This is notable because there are rather few examples of stabilizing selection where the underlying mechanisms responsible are so well characterized. Second, this paper succeeds in providing an explanation for how spherical cells can readily evolve from a rod-shaped ancestor but leaves open how rods evolved in the first place. Can the authors speculate as to how the complex, coordinated system leading to rods first evolved? Or why not all cells have lost rod shape and become spherical, if it is so easy to achieve? These are important evolutionary questions that remain unaddressed. The manuscript could be improved by at least flagging these as unanswered questions deserving of further attention. 

      These are interesting points, but our capacity to comment is entirely speculative. Nonetheless, we have added an additional paragraph to the Discussion that expresses an opinion that has yet to receive attention:

      “Given the complexity of the cell wall synthesis machinery that defines rod-shape in bacteria, it is hard to imagine how rods could have evolved prior to cocci. However, the cylindrical shape offers a number of advantages. For a given biomass (or cell volume), shape determines surface area of the cell envelope, which is the smallest surface area associated with the spherical shape. As shape sets the surface/volume ratio, it also determines the ratio between supply (proportional to the surface) and demand (proportional to cell volume). From this point of view, it is more efficient to be cylindrical (Young 2006). This also holds for surface attachment and biofilm formation (Young 2006). But above all, for growing cells, the ratio between supply and demand is constant in rod shaped bacteria, whereas it decreases for cocci. This requires that spherical cells evolve complex regulatory networks capable of maintaining the correct concentration of cellular proteins despite changes in surface/volume ratio. From this point of view, rod-shaped bacteria offer opportunities to develop unsophisticated regulatory networks.”

      why not all cells have lost rod shape and become spherical.

      Please see Kevin Young’s 2006 review on the adaptive significance of cell shape

      The value of this paper stems both from the insight it provides on the underlying molecular model for cell shape and from what it reveals about some key features of the evolutionary process. The paper, as it currently stands, provides more on which to chew for the molecular side than the evolutionary side. It provides valuable insights into the molecular architecture of how cells grow and what governs their shape. The evolutionary phenomena emphasized by the authors - the importance of loss-of-function mutations in driving rapid compensatory fitness gains and that multiple genetic and molecular routes to high fitness are often available, even in the relatively short time frame of a few hundred generations - are well understood phenomena and so arguably of less broad interest. The more compelling evolutionary questions concern the nature and cause of stabilizing selection (in this case cell volume) and the evolution of complexity. The paper misses an opportunity to highlight the former and, while claiming to shed light on the latter, provides rather little useful insight. 

      Thank you for these thoughts and comments. However, we disagree that the experimental results are an overlooked opportunity to discuss stabilising selection. Stabilising selection occurs when selection favours a particular phenotype causing a reduction in underpinning population-level genetic diversity. This is not happening when selection acts on SBW25 ∆mreB leading to a restoration of fitness. Driving the response are biophysical factors, primarily the critical need to balance elongation rate with rate of septation. This occurs without any change in underlying genetic diversity.  

      Recommendations for the authors:  

      Reviewer 1 (Recommendations for the Authors): 

      Hereby my suggestion for improvement of the quantification of the data, the figures, and the text. 

      -  p 14, what is the unit of elongation rate?  

      At first mention we have made clear that the unit is given in minutes^-1

      -  p 14, please give an error bar for both p=0.85 and f=0.77, to be able to conclude they are different 

      Error on the probability p is estimated at the 95% confidence interval by the formula:1.96 , where N is the total number of cells. This has been added in the paragraph p »probability » of the Image Analysis section in the Material and Methods. 

      We also added errors on p measurement in the main text.

      -  p 14, all the % differences need an errorbar 

      The error bars and means are given in Fig 3C and 3D.

      -  Figure 1B adds units to compactness, and what does it represent? Is the cell size the estimated volume (that is mentioned in the caption)? Shouldn't the datapoints have error bars? 

      Compactness is defined in the “Image Analysis” section of the Material and Methods. It is a dimensionless parameter. The distribution of individual cell shapes / sizes are depicted in Fig 1B. Error does arise from segmentation, but the degree of variance (few pixels) is much smaller than the representations of individual cells shown.

      -  Figure 1C caption, are the 50.000 cells? 

      Correct. Figure caption has been altered.

      -  Figure 1D, first the elongation rate is described as a volume per minute, but now, looking at the units it is a rate, how is it normalized? 

      Elongation rate is explained in the Materials and Methods (see the image analysis section) and is not volume per minute. It is dV/dt = r*V (the unit of r is min^-1). Page 9 includes specific mention of the unit of r.

      -  Figure 1E, how many cells (n) per replicate? 

      Our apologies. We have corrected the figure caption that now reads:

      “Proportion of live cells in ancestral SBW25 (black bar) and ΔmreB (grey bar) based on LIVE/DEAD BacLight Bacterial Viability Kit protocol. Cells were pelleted at 2,000 x g for 2 minutes to preserve ΔmreB cell integrity. Error bars are means and standard deviation of three biological replicates (n>100).”

      -  Figure 1G, how does this compare to the wildtype 

      The volume for wild type SBW25 is 3.27µm^3 (within the “white zone”). This is mentioned in the text.

      -  Figure 2B, is this really volume, not size? And can you add microscopy images? 

      The x-axis is volume (see Materials and Methods, subsection image analysis). Images are available in Supp. Fig. 9.

      -  Figure 3A what does L1, L4 and L7 refer too? Is it correct that these same lines are picked for WT and delta_mreB 

      Thank you for pointing this out. This was an earlier nomenclature. It was shorthand for the mutants that are specified everywhere else by genotype and has now been corrected. 

      -  Figure 3c: either way write out p, so which probability, or you need a simple cartoon that is plotted. 

      The value p is the probability to proceed to the next generation and is explained in Materials and Methods  subsection image analysis.  We feel this is intuitive and does not require a cartoon. We nonetheless added a sentence to the Materials and Methods to aid clarity.

      -  Figure 4B can you add a ladder to the gel? 

      No ladder was included, but the controls provide all the necessary information. The band corresponding to PBP1A is defined by presence in SBW25, but absence in SBW25 ∆pbp1A.

      -  Figure 4c, can you improve the quantification of these images? How were these selected and how well do they represent the community? 

      We apologise for the lack of quantitative description for data presented in Fig 4C. This has now been corrected. In brief, we measured the intensity of fluorescent signal from between 10 and 14 cells and computed the mean and standard deviation of pixel intensity for each cell. To rule out possible artifacts associated with variation of the mean intensity, we calculated the ratio of the standard deviation divided by the square root of the mean. These data reveal heterogeneity in cell wall synthesis and provide strong statistical support for the claim that cell wall synthesis in ∆mreB is significantly more heterogeneous than the control. The data are provided in new Supp. Fig. 21. 

      Minor comments: 

      -  It would be interesting if the findings of this experimental evolution study could be related to comparative studies (if these have ever been executed).  

      Little is possible, but Hendrickson and Yulo published a portion of the originally posted preprint separately. We include a citation to that paper. 

      -  p 13, halfway through the page, the second paragraph lacks a conclusion, why do we care about DNA content? 

      It is a minor observation that was included by way of providing a complete description of cell phenotype.  

      -  p 17, "suggesting that ... loss-of-function", I do no not understand what this is based upon. 

      We show that the fitness of a pbp1A deletion is indistinguishable from the fitness of one of the pbp1A point mutants. This fact establishes that the point mutation had the same effects as a gene deletion thus supporting the claim that the point mutations identified during the course of the selection experiment decrease (or destroy) PBP1A function.

      -  p 25, at the top of the page: do you have a reference for the statement that a disorganized cell wall architecture is suited to the topology of spherical cells? 

      The statement is a conclusion that comes from our reasoning. It stems from the fact that it is impossible to entirely map the surface of a sphere with parallel strands.

    1. Author Response

      The following is the authors’ response to the previous reviews.

      To the Senior Editor and the Reviewing Editor:

      We sincerely appreciate the valuable comments provided by the reviewers, the reviewing editor, and the senior editor. After carefully reviewing and considering the comments, we have addressed the key concerns raised by the reviewers and made appropriate modifications to the article in the revised manuscript.

      The main revisions made to the manuscript are as follows:

      1) We have added comparison experiments with TNDM (see Fig. 2 and Fig. S2).

      2) We conducted new synthetic experiments to demonstrate that our conclusions are not a by-product of d-VAE (see Fig. S2 and Fig. S11).

      3) We have provided a detailed explanation of how our proposed criteria, especially the second criterion, can effectively exclude the selection of unsuitable signals.

      4) We have included a semantic overview figure of d-VAE (Fig. S1) and a visualization plot of latent variables (Fig. S13).

      5) We have elaborated on the model details of d-VAE, as well as the hyperparameter selection and experimental settings of other comparison models.

      We believe these revisions have significantly improved the clarity and comprehensibility of the manuscript. Thank you for the opportunity to address these important points.

      Reviewer #1

      Q1: “First, the model in the paper is almost identical to an existing VAE model (TNDM) that makes use of weak supervision with behaviour in the same way [1]. This paper should at least be referenced. If the authors wish they could compare their model to TNDM, which combines a state space model with smoothing similar to LFADS. Given that TNDM achieves very good behaviour reconstructions, it may be on par with this model without the need for a Kalman filter (and hence may achieve better separation of behaviour-related and unrelated dynamics).”

      Our model significantly differs from TNDM in several aspects. While TNDM also constrains latent variables to decode behavioral information, it does not impose constraints to maximize behavioral information in the generated relevant signals. The trade-off between the decoding and reconstruction capabilities of generated relevant signals is the most significant contribution of our approach, which is not reflected in TNDM. In addition, the backbone network of signal extraction and the prior distribution of the two models are also different.

      It's worth noting that our method does not require a Kalman filter. Kalman filter is used for post hoc assessment of the linear decoding ability of the generated signals. Please note that extracting and evaluating relevant signals are two distinct stages.

      Heeding your suggestion, we have incorporated comparison experiments involving TNDM into the revised manuscript. Detailed information on model hyperparameters and training settings can be found in the Methods section in the revised manuscripts.

      Thank you for your valuable feedback.

      Q2: “Second, in my opinion, the claims regarding identifiability are overstated - this matters as the results depend on this to some extent. Recent work shows that VAEs generally suffer from identifiability problems due to the Gaussian latent space [2]. This paper also hints that weak supervision may help to resolve such issues, so this model as well as TNDM and CEBRA may indeed benefit from this. In addition however, it appears that the relative weight of the KL Divergence in the VAE objective is chosen very small compared to the likelihood (0.1%), so the influence of the prior is weak and the model may essentially learn the average neural trajectories while underestimating the noise in the latent variables. This, in turn, could mean that the model will not autoencode neural activity as well as it should, note that an average R2 in this case will still be high (I could not see how this is actually computed). At the same time, the behaviour R2 will be large simply because the different movement trajectories are very distinct. Since the paper makes claims about the roles of different neurons, it would be important to understand how well their single trial activities are reconstructed, which can perhaps best be investigated by comparing the Poisson likelihood (LFADS is a good baseline model). Taken together, while it certainly makes sense that well-tuned neurons contribute more to behaviour decoding, I worry that the very interesting claim that neurons with weak tuning contain behavioural signals is not well supported.”

      We don’t think our distilled signals are average neural trajectories without variability. The quality of reconstructing single trial activities can be observed in Figure 3i and Figure S4. Neural trajectories in Fig. 3i and Fig. S4 show that our distilled signals are not average neural trajectories. Furthermore, if each trial activity closely matched the average neural trajectory, the Fano Factor (FF) should theoretically approach 0. However, our distilled signals exhibit a notable departure from this expectation, as evident in Figure 3c, d, g, and f. Regarding the diminished influence of the KL Divergence: Given that the ground truth of latent variable distribution is unknown, even a learned prior distribution might not accurately reflect the true distribution. We found the pronounced impact of the KL divergence would prove detrimental to the decoding and reconstruction performance. As a result, we opt to reduce the weight of the KL divergence term. Even so, KL divergence can still effectively align the distribution of latent variables with the distribution of prior latent variables, as illustrated in Fig. S13. Notably, our goal is extracting behaviorally-relevant signals from given raw signals rather than generating diverse samples from the prior distribution. When aim to separating relevant signals, we recommend reducing the influence of KL divergence. Regarding comparing the Poisson likelihood: We compared Poisson log-likelihood among different methods (except PSID since their obtained signals have negative values), and the results show that d-VAE outperforms other methods.

      Author response image 1.

      Regarding how R2 is computed: , where and denote ith sample of raw signals, ith sample of distilled relevant signals, and the mean of raw signals. If the distilled signals exactly match the raw signals, the sum of squared error is zero, thus R2=1. If the distilled signals always are equal to R2=0. If the distilled signals are worse than the mean estimation, R2 is negative, negative R2 is set to zero.

      Thank you for your valuable feedback.

      Q3: “Third, and relating to this issue, I could not entirely follow the reasoning in the section arguing that behavioural information can be inferred from neurons with weak selectivity, but that it is not linearly decodable. It is right to test if weak supervision signals bleed into the irrelevant subspace, but I could not follow the explanations. Why, for instance, is the ANN decoder on raw data (I assume this is a decoder trained fully supervised) not equal in performance to the revenant distilled signals? Should a well-trained non-linear decoder not simply yield a performance ceiling? Next, if I understand correctly, distilled signals were obtained from the full model. How does a model perform trained only on the weakly tuned neurons? Is it possible that the subspaces obtained with the model are just not optimally aligned for decoding? This could be a result of limited identifiability or model specifics that bias reconstruction to averages (a well-known problem of VAEs). I, therefore, think this analysis should be complemented with tests that do not depend on the model.”

      Regarding “Why, for instance, is the ANN decoder on raw data (I assume this is a decoder trained fully supervised) not equal in performance to the relevant distilled signals? Should a well-trained non-linear decoder not simply yield a performance ceiling?”: In fact, the decoding performance of raw signals with ANN is quite close to the ceiling. However, due to the presence of significant irrelevant signals in raw signals, decoding models like deep neural networks are more prone to overfitting when trained on noisy raw signals compared to behaviorally-relevant signals. Consequently, we anticipate that the distilled signals will demonstrate superior decoding generalization. This phenomenon is evident in Fig. 2 and Fig. S1, where the decoding performance of the distilled signals surpasses that of the raw signals, albeit not by a substantial margin.

      Regarding “Next, if I understand correctly, distilled signals were obtained from the full model. How does a model perform trained only on the weakly tuned neurons? Is it possible that the subspaces obtained with the model are just not optimally aligned for decoding?”:Distilled signals (involving all neurons) are obtained by d-VAE. Subsequently, we use ANN to evaluate the performance of smaller and larger R2 neurons. Please note that separating and evaluating relevant signals are two distinct stages.

      Regarding the reasoning in the section arguing that smaller R2 neurons encode rich information, we would like to provide a detailed explanation:

      1) After extracting relevant signals through d-VAE, we specifically selected neurons characterized by smaller R2 values (Here, R2 signifies the proportion of neuronal activity variance explained by the linear encoding model, calculated using raw signals). Subsequently, we employed both KF and ANN to assess the decoding performance of these neurons. Remarkably, our findings revealed that smaller R2 neurons, previously believed to carry limited behavioral information, indeed encode rich information.

      2) In a subsequent step, we employed d-VAE to exclusively distill the raw signals of these smaller R2 neurons (distinct from the earlier experiment where d-VAE processed signals from all neurons). We then employed KF and ANN to evaluate the distilled smaller R2 neurons. Interestingly, we observed that we could not attain the same richness of information solely through the use of these smaller R2 neurons.

      3) Consequently, we put forth and tested two hypotheses: First, that larger R2 neurons introduce additional signals into the smaller R2 neurons that do not exist in the real smaller R2 neurons. Second, that larger R2 neurons aid in restoring the original appearance of impaired smaller R2 neurons. Our proposed criteria and synthetic experiments substantiate the latter scenario.

      Thank you for your valuable feedback.

      Q4: “Finally, a more technical issue to note is related to the choice to learn a non-parametric prior instead of using a conventional Gaussian prior. How is this implemented? Is just a single sample taken during a forward pass? I worry this may be insufficient as this would not sample the prior well, and some other strategy such as importance sampling may be required (unless the prior is not relevant as it weakly contributed to the ELBO, in which case this choice seems not very relevant). Generally, it would be useful to see visualisations of the latent variables to see how information about behaviour is represented by the model.”

      Regarding "how to implement the prior?": Please refer to Equation 7 in the revised manuscript; we have added detailed descriptions in the revised manuscript.

      Regarding "Generally, it would be useful to see visualizations of the latent variables to see how information about behavior is represented by the model.": Note that our focus is not on latent variables but on distilled relevant signals. Nonetheless, at your request, we have added the visualization of latent variables in the revised manuscript. Please see Fig. S13 for details.

      Thank you for your valuable feedback.

      Recommendations: “A minor point: the word 'distill' in the name of the model may be a little misleading - in machine learning the term refers to the construction of smaller models with the same capabilities.

      It should be useful to add a schematic picture of the model to ease comparison with related approaches.”

      In the context of our model's functions, it operates as a distillation process, eliminating irrelevant signals and retaining the relevant ones. Although the name of our model may be a little misleading, it faithfully reflects what our model does.

      I have added a schematic picture of d-VAE in the revised manuscript. Please see Fig. S1 for details.

      Thank you for your valuable feedback.

      Reviewer #2

      Q1: “Is the apparently increased complexity of encoding vs decoding so unexpected given the entropy, sparseness, and high dimensionality of neural signals (the "encoding") compared to the smoothness and low dimensionality of typical behavioural signals (the "decoding") recorded in neuroscience experiments? This is the title of the paper so it seems to be the main result on which the authors expect readers to focus. ”

      We use the term "unexpected" due to the disparity between our findings and the prior understanding concerning neural encoding and decoding. For neural encoding, as we said in the Introduction, in previous studies, weakly-tuned neurons are considered useless, and smaller variance PCs are considered noise, but we found they encode rich behavioral information. For neural decoding, the nonlinear decoding performance of raw signals is significantly superior to linear decoding. However, after eliminating the interference of irrelevant signals, we found the linear decoding performance is comparable to nonlinear decoding. Rooted in these findings, which counter previous thought, we employ the term "unexpected" to characterize our observations.

      Thank you for your valuable feedback.

      Q2: “I take issue with the premise that signals in the brain are "irrelevant" simply because they do not correlate with a fixed temporal lag with a particular behavioural feature hand-chosen by the experimenter. As an example, the presence of a reward signal in motor cortex [1] after the movement is likely to be of little use from the perspective of predicting kinematics from time-bin to time-bin using a fixed model across trials (the apparent definition of "relevant" for behaviour here), but an entire sub-field of neuroscience is dedicated to understanding the impact of these reward-related signals on future behaviour. Is there method sophisticated enough to see the behavioural "relevance" of this brief, transient, post-movement signal? This may just be an issue of semantics, and perhaps I read too much into the choice of words here. Perhaps the authors truly treat "irrelevant" and "without a fixed temporal correlation" as synonymous phrases and the issue is easily resolved with a clarifying parenthetical the first time the word "irrelevant" is used. But I remain troubled by some claims in the paper which lead me to believe that they read more deeply into the "irrelevancy" of these components.”

      In this paper, we employ terms like ‘behaviorally-relevant’ and ‘behaviorally-irrelevant’ only regarding behavioral variables of interest measured within a given task, such as arm kinematics during a motor control task. A similar definition can be found in the PSID[1].

      Thank you for your valuable feedback.

      [1] Sani, Omid G., et al. "Modeling behaviorally relevant neural dynamics enabled by preferential subspace identification." Nature Neuroscience 24.1 (2021): 140-149.

      Q3: “The authors claim the "irrelevant" responses underpin an unprecedented neuronal redundancy and reveal that movement behaviors are distributed in a higher-dimensional neural space than previously thought." Perhaps I just missed the logic, but I fail to see the evidence for this. The neural space is a fixed dimensionality based on the number of neurons. A more sparse and nonlinear distribution across this set of neurons may mean that linear methods such as PCA are not effective ways to approximate the dimensionality. But ultimately the behaviourally relevant signals seem quite low-dimensional in this paper even if they show some nonlinearity may help.”

      The evidence for the “useless” responses underpin an unprecedented neuronal redundancy is shown in Fig. 5a, d and Fig. S9a. Specifically, the sum of the decoding performance of smaller R2 neurons and larger R2 neurons is significantly greater than that of all neurons for relevant signals (red bar), demonstrating that movement parameters are encoded very redundantly in neuronal population. In contrast, we can not find this degree of neural redundancy in raw signals (purple bar).

      The evidence for the “useless” responses reveal that movement behaviors are distributed in a higher-dimensional neural space than previously thought is shown in the left plot (involving KF decoding) of Fig. 6c, f and Fig. S9f. Specifically, the improvement of KF using secondary signals is significantly higher than using raw signals composed of the same number of dimensions as the secondary signals. These results demonstrate that these dimensions, spanning roughly from ten to thirty, encode much information, suggesting that behavioral information exists in a higher-dimensional subspace than anticipated from raw signals.

      Thank you for your valuable feedback.

      Q5: “there is an apparent logical fallacy that begins in the abstract and persists in the paper: "Surprisingly, when incorporating often-ignored neural dimensions, behavioral information can be decoded linearly as accurately as nonlinear decoding, suggesting linear readout is performed in motor cortex." Don't get me wrong: the equivalency of linear and nonlinear decoding approaches on this dataset is interesting, and useful for neuroscientists in a practical sense. However, the paper expends much effort trying to make fundamental scientific claims that do not feel very strongly supported. This reviewer fails to see what we can learn about a set of neurons in the brain which are presumed to "read out" from motor cortex. These neurons will not have access to the data analyzed here. That a linear model can be conceived by an experimenter does not imply that the brain must use a linear model. The claim may be true, and it may well be that a linear readout is implemented in the brain. Other work [2,3] has shown that linear readouts of nonlinear neural activity patterns can explain some behavioural features. The claim in this paper, however, is not given enough”

      Due to the limitations of current observational methods and our incomplete understanding of brain mechanisms, it is indeed challenging to ascertain the specific data the brain acquires to generate behavior and whether it employs a linear readout. Conventionally, the neural data recorded in the motor cortex do encode movement behaviors and can be used to analyze neural encoding and decoding. Based on these data, we found that the linear decoder KF achieves comparable performance to that of the nonlinear decoder ANN on distilled relevant signals. This finding has undergone validation across three widely used datasets, providing substantial evidence. Furthermore, we conducted experiments on synthetic data to show that this conclusion is not a by-product of our model. In the revised manuscript, we added a more detailed description of this conclusion.

      Thank you for your valuable feedback.

      Q6: “Relatedly, I would like to note that the exercise of arbitrarily dividing a continuous distribution of a statistic (the "R2") based on an arbitrary threshold is a conceptually flawed exercise. The authors read too much into the fact that neurons which have a low R2 w.r.t. PDs have behavioural information w.r.t. other methods. To this reviewer, it speaks more about the irrelevance, so to speak, of the preferred direction metric than anything fundamental about the brain.”

      We chose the R2 threshold in accordance with the guidelines provided in reference [1]. It's worth mentioning that this threshold does not exert any significant influence on the overall conclusions.

      Thank you for your valuable feedback.

      [1] Inoue, Y., Mao, H., Suway, S.B., Orellana, J. and Schwartz, A.B., 2018. Decoding arm speed during reaching. Nature communications, 9(1), p.5243.

      Q7: “I am afraid I may be missing something, as I did not understand the fano factor analysis of Figure 3. In a sense the behaviourally relevant signals must have lower FF given they are in effect tied to the temporally smooth (and consistent on average across trials) behavioural covariates. The point of the original Churchland paper was to show that producing a behaviour squelches the variance; naturally these must appear in the behaviourally relevant components. A control distribution or reference of some type would possibly help here.”

      We agree that including reference signals could provide more context. The Churchland paper said stimulus onset can lead to a reduction in neural variability. However, our experiment focuses specifically on the reaching process, and thus, we don't have comparative experiments involving different types of signals.

      Thank you for your valuable feedback.

      Q8: “The authors compare the method to LFADS. While this is a reasonable benchmark as a prominent method in the field, LFADS does not attempt to solve the same problem as d-VAE. A better and much more fair comparison would be TNDM [4], an extension of LFADS which is designed to identify behaviourally relevant dimensions.”

      We have added the comparison experiments with TNDM in the revised manuscript (see Fig. 2 and Fig. S2). The details of model hyperparameters and training settings can be found in the Methods section in the revised manuscripts.

      Thank you for your valuable feedback.

      Reviewer #3

      Q1.1: “TNDM: LFADS is not the best baseline for comparison. The authors should have compared with TNDM (Hurwitz et al. 2021), which is an extension of LFADS that (unlike LFADS) actually attempts to extract behaviorally relevant factors by adding a behavior term to the loss. The code for TNDM is also available on Github. LFADS is not even supervised by behavior and does not aim to address the problem that d-VAE aims to address, so it is not the most appropriate comparison. ”

      We have added the comparison experiments with TNDM in the revised manuscript (see Fig. 2 and Fig. S2). The details of model hyperparameters and training settings can be found in the Methods section in the revised manuscripts.

      Thank you for your valuable feedback.

      Q1.2: “LFADS: LFADS is a sequential autoencoder that processes sections of data (e.g. trials). No explanation is given in Methods for how the data was passed to LFADS. Was the moving averaged smoothed data passed to LFADS or the raw spiking data (at what bin size)? Was a gaussian loss used or a poisson loss? What are the trial lengths used in each dataset, from which part of trials? For dataset C that has back-to-back reaches, was data chopped into segments? How long were these segments? Were the edges of segments overlapped and averaged as in (Keshtkaran et al. 2022) to avoid noisy segment edges or not? These are all critical details that are not explained. The same details would also be needed for a TNDM comparison (comment 1.1) since it has largely the same architecture as LFADS.

      It is also critical to briefly discuss these fundamental differences between the inputs of methods in the main text. LFADS uses a segment of data whereas VAE methods just use one sample at a time. What does this imply in the results? I guess as long as VAEs outperform LFADS it is ok, but if LFADS outperforms VAEs in a given metric, could it be because it received more data as input (a whole segment)? Why was the factor dimension set to 50? I presume it was to match the latent dimension of the VAE methods, but is the LFADS factor dimension the correct match for that to make things comparable?

      I am also surprised by the results. How do the authors justify LFADS having lower neural similarity (fig 2d) than VAE methods that operate on single time steps? LFADS is not supervised by behavior, so of course I don't expect it to necessarily outperform methods on behavior decoding. But all LFADS aims to do is to reconstruct the neural data so at least in this metric it should be able to outperform VAEs that just operate on single time steps? Is it because LFADS smooths the data too much? This is important to discuss and show examples of. These are all critical nuances that need to be discussed to validate the results and interpret them.”

      Regarding “Was the moving averaged smoothed data passed to LFADS or the raw spiking data (at what bin size)? Was a gaussian loss used or a poisson loss?”: The data used by all models was applied to the same preprocessing procedure. That is, using moving averaged smoothed data with three bins, where the bin size is 100ms. For all models except PSID, we used a Poisson loss.

      Regrading “What are the trial lengths used in each dataset, from which part of trials? For dataset C that has back-to-back reaches, was data chopped into segments? How long were these segments? Were the edges of segments overlapped and averaged as in (Keshtkaran et al. 2022) to avoid noisy segment edges or not?”:

      For datasets A and B, a trial length of eighteen is set. Trials with lengths below the threshold are zero-padded, while trials exceeding the threshold are truncated to the threshold length from their starting point. In dataset A, there are several trials with lengths considerably longer than that of most trials. We found that padding all trials with zeros to reach the maximum length (32) led to poor performance. Consequently, we chose a trial length of eighteen, effectively encompassing the durations of most trials and leading to the removal of approximately 9% of samples. For dataset B (center-out), the trial lengths are relatively consistent with small variation, and the maximum length across all trials is eighteen. For dataset C, we set the trial length as ten because we observed the video of this paradigm and found that the time for completing a single trial was approximately one second. The segments are not overlapped.

      Regarding “Why was the factor dimension set to 50? I presume it was to match the latent dimension of the VAE methods, but is the LFADS factor dimension the correct match for that to make things comparable?”: We performed a grid search for latent dimensions in {10,20,50} and found 50 is the best.

      Regarding “I am also surprised by the results. How do the authors justify LFADS having lower neural similarity (fig 2d) than VAE methods that operate on single time steps? LFADS is not supervised by behavior, so of course I don't expect it to necessarily outperform methods on behavior decoding. But all LFADS aims to do is to reconstruct the neural data so at least in this metric it should be able to outperform VAEs that just operate on single time steps? Is it because LFADS smooths the data too much?”: As you pointed out, we found that LFADS tends to produce excessively smooth and consistent data, which can lead to a reduction in neural similarity.

      Thank you for your valuable feedback.

      Q1.3: “PSID: PSID is linear and uses past input samples to predict the next sample in the output. Again, some setup choices are not well justified, and some details are left out in the 1-line explanation given in Methods.

      Why was a latent dimension of 6 chosen? Is this the behaviorally relevant latent dimension or the total latent dimension (for the use case here it would make sense to set all latent states to be behaviorally relevant)? Why was a horizon hyperparameter of 3 chosen? First, it is important to mention fundamental parameters such as latent dimension for each method in the main text (not just in methods) to make the results interpretable. Second, these hyperparameters should be chosen with a grid search in each dataset (within the training data, based on performance on the validation part of the training data), just as the authors do for their method (line 779). Given that PSID isn't a deep learning method, doing a thorough grid search in each fold should be quite feasible. It is important that high values for latent dimension and a wider range of other hyperparmeters are included in the search, because based on how well the residuals (x_i) for this method are shown predict behavior in Fig 2, the method seems to not have been used appropriately. I would expect ANN to improve decoding for PSID versus its KF decoding since PSID is fully linear, but I don't expect KF to be able to decode so well using the residuals of PSID if the method is used correctly to extract all behaviorally relevant information from neural data. The low neural reconstruction in Fid 2d could also partly be due to using too small of a latent dimension.

      Again, another import nuance is the input to this method and how differs with the input to VAE methods. The learned PSID model is a filter that operates on all past samples of input to predict the output in the "next" time step. To enable a fair comparison with VAE methods, the authors should make sure that the last sample "seen" by PSID is the same as then input sample seen by VAE methods. This is absolutely critical given how large the time steps are, otherwise PSID might underperform simply because it stopped receiving input 300ms earlier than the input received by VAE methods. To fix this, I think the authors can just shift the training and testing neural time series of PSID by 1 sample into the past (relative to the behavior), so that PSID's input would include the input of VAE methods. Otherwise, VAEs outperforming PSID is confounded by PSID's input not including the time step that was provided to VAE.”

      Thanks for your suggestions for letting PSID see the current neural observations. We did it per your suggestions and then performed a grid search for the hyperparameters for PSID. Specifically, we performed a grid search for the horizon hyperparameter in {2,3,4,5,6,7}. Since the relevant latent dimension should be lower than the horizon times the dimension of behavior variables (two-dimensional velocity in this paper) and increasing the dimension will reach performance saturation, we directly set the relevant latent dimensions as the maximum. The horizon number of datasets A, B, C, and synthetic datasets is 7, 6, 6 and 5, respectively.

      And thus the latent dimension of datasets A, B, and C and the synthetic dataset is 14, 12, 12 and 10, respectively.

      Our experiments show that KF can decode information from irrelevant signals obtained by PSID. Although PSID extracts the linear part of raw signals, KF can still use the linear part of the residuals for decoding. The low reconstruction performance of PSID may be because the relationship between latent variables and neural signals is linear, and the relationship between latent variables and behaviors is also linear; this is equivalent to the linear relationship between behaviors and neural signals, and linear models can only explain a small fraction of neural signals.

      Thank you for your valuable feedback.

      Q1.4: “CEBRA: results for CEBRA are incomplete. Similarity to raw signals is not shown. Decoding of behaviorally irrelevant residuals for CEBRA is not shown. Per Fig. S2, CEBRA does better or similar ANN decoding in datasets A and C, is only slightly worse in Dataset B, so it is important to show the other key metrics otherwise it is unclear whether d-VAE has some tangible advantage over CEBRA in those 2 datasets or if they are similar in every metric. Finally, it would be better if the authors show the results for CEBRA on Fig. 2, just as is done for other methods because otherwise it is hard to compare all methods.”

      CEBRA is a non-generative model, this model cannot generate behaviorally-relevant signals. Therefore, we only compared the decoding performance of latent embeddings of CEBRA and signals of d-VAE.

      Thank you for your valuable feedback.

      Q2: “Given the fact that d-VAE infers the latent (z) based on the population activity (x), claims about properties of the inferred behaviorally relevant signals (x_r) that attribute properties to individual neurons are confounded.

      The authors contrast their approach to population level approaches in that it infers behaviorally relevant signals for individual neurons. However, d-VAE is also a population method as it aggregates population information to infer the latent (z), from which behaviorally relevant part of the activity of each neuron (x_r) is inferred. The authors note this population level aggregation of information as a benefit of d-VAE, but only acknowledge it as a confound briefly in the context of one of their analyses (line 340): "The first is that the larger R2 neurons leak their information to the smaller R2 neurons, causing them contain too much behavioral information". They go on to dismiss this confounding possibility by showing that the inferred behaviorally relevant signal of each neuron is often most similar to its own raw signals (line 348-352) compared with all other neurons. They also provide another argument specific to that result section (i.e., residuals are not very behavior predictive), which is not general so I won't discuss it in depth here. These arguments however do not change the basic fact that d-VAE aggregates information from other neurons when extracting the behaviorally relevant activity of any given neuron, something that the authors note as a benefit of d-VAE in many instances. The fact that d-VAE aggregates population level info to give the inferred behaviorally relevant signal for each neuron confounds several key conclusions. For example, because information is aggregated across neurons, when trial to trial variability looks smoother after applying d-VAE (Fig 3i), or reveals better cosine tuning (Fig 3b), or when neurons that were not very predictive of behavior become more predictive of behavior (Fig 5), one cannot really attribute the new smoother single trial activity or the improved decoding to the same single neurons; rather these new signals/performances include information from other neurons. Unless the connections of the encoder network (z=f(x)) is zero for all other neurons, one cannot claim that the inferred rates for the neuron are truly solely associated with that neuron. I believe this a fundamental property of a population level VAE, and simply makes the architecture unsuitable for claims regarding inherent properties of single neurons. This confound is partly why the first claim in the abstract are not supported by data: observing that neurons that don't predict behavior very well would predict it much better after applying d-VAE does not prove that these neurons themselves "encode rich[er] behavioral information in complex nonlinear ways" (i.e., the first conclusion highlighted in the abstract) because information was also aggregated from other neurons. The other reason why this claim is not supported by data is the characterization of the encoding for smaller R2 neurons as "complex nonlinear", which the method is not well equipped to tease apart from linear mappings as I explain in my comment 3.”

      We acknowledge that we cannot obtain the exact single neuronal activity that does not contain any information from other neurons. However, we believe our model can extract accurate approximation signals of the ground truth relevant signals. These signals preserve the inherent properties of single neuronal activity to some extent and can be used for analysis at the single-neuron level.

      We believe d-VAE is a reasonable approach to extract effective relevant signals that preserve inherent properties of single neuronal activity for four key reasons:

      1) d-VAE is a latent variable model that adheres to the neural population doctrine. The neural population doctrine posits that information is encoded within interconnected groups of neurons, with the existence of latent variables (neural modes) responsible for generating observable neuronal activity [1, 2]. If we can perfectly obtain the true generative model from latent variables to neuronal activity, then we can generate the activity of each neuron from hidden variables without containing any information from other neurons. However, without a complete understanding of the brain’s encoding strategies (or generative model), we can only get the approximation signals of the ground truth signals.

      2) After the generative model is established, we need to infer the parameters of the generative model and the distribution of latent variables. During the inference process, inference algorithms such as variational inference or EM algorithms will be used. Generally, the obtained latent variables are also approximations of the real latent variables. When inferring the latent variables, it is inevitable to aggregation the information of the neural population, and latent variables are derived through weighted combinations of neuronal populations [3].

      This inference process is consistent with that of d-VAE (or VAE-based models).

      3) Latent variables are derived from raw neural signals and used to explain raw neural signals. Considering the unknown ground truth of latent variables and behaviorally-relevant signals, it becomes evident that the only reliable reference at the signal level is the raw signals. A crucial criterion for evaluating the reliability of latent variable models (including latent variables and generated relevant signals) is their capability to effectively explain the raw signals [3]. Consequently, we firmly maintain the belief that if the generated signals closely resemble the raw signals to the greatest extent possible, in accordance with an equivalence principle, we can claim that these obtained signals faithfully retain the inherent properties of single neurons. d-VAE explicitly constrains the generated signal to closely resemble the raw signals. These results demonstrate that d-VAE can extract effective relevant signals that preserve inherent properties of single neuronal activity.

      Based on the above reasons, we hold that generating single neuronal activities with the VAE framework is a reasonable approach. The remaining question is whether our model can obtain accurate relevant signals in the absence of ground truth. To our knowledge, in cases where the ground truth of relevant signals is unknown, there are typically two approaches to verifying the reliability of extracted signals:

      1) Conducting synthetic experiments where the ground truth is known.

      2) Validation based on expert knowledge (Three criteria were proposed in this paper). Both our extracted signals and key conclusions have been validated using these two approaches.

      Next, we will provide a detailed response to the concerns regarding our first key conclusion that smaller R2 neurons encode rich information.

      We acknowledge that larger R2 neurons play a role in aiding the reconstruction of signals in smaller R2 neurons through their neural activity. However, considering that neurons are correlated rather than independent entities, we maintain the belief that larger R2 neurons assist damaged smaller R2 neurons in restoring their original appearance. Taking image denoising as an example, when restoring noisy pixels to their original appearance, relying solely on the noisy pixels themselves is often impractical. Assistance from their correlated, clean neighboring pixels becomes necessary.

      The case we need to be cautious of is that the larger R2 neurons introduce additional signals (m) that contain substantial information to smaller R2 neurons, which they do not inherently possess. We believe this case does not hold for two reasons. Firstly, logically, adding extra signals decreases the reconstruction performance, and the information carried by these additional signals is redundant for larger R2 neurons, thus they do not introduce new information that can enhance the decoding performance of the neural population. Therefore, it seems unlikely and unnecessary for neural networks to engage in such counterproductive actions. Secondly, even if this occurs, our second criterion can effectively exclude the selection of these signals. To clarify, if we assume that x, y, and z denote the raw, relevant, and irrelevant signals of smaller R2 neurons, with x=y+z, and the extracted relevant signals become y+m, the irrelevant signals become z-m in this case. Consequently, the irrelevant signals contain a significant amount of information. It's essential to emphasize that this criterion holds significant importance in excluding undesirable signals.

      Furthermore, we conducted a synthetic experiment to show that d-VAE can indeed restore the damaged information of smaller R2 neurons with the help of larger R2 neurons, and the restored neuronal activities are more similar to ground truth compared to damaged raw signals. Please see Fig. S11a,b for details.

      Thank you for your valuable feedback.

      [1] Saxena, S. and Cunningham, J.P., 2019. Towards the neural population doctrine. Current opinion in neurobiology, 55, pp.103-111.

      [2] Gallego, J.A., Perich, M.G., Miller, L.E. and Solla, S.A., 2017. Neural manifolds for the control of movement. Neuron, 94(5), pp.978-984.

      [3] Cunningham, J.P. and Yu, B.M., 2014. Dimensionality reduction for large-scale neural recordings. Nature neuroscience, 17(11), pp.1500-1509.

      Q3: “Given the nonlinear architecture of the VAE, claims about the linearity or nonlinearity of cortical readout are confounded and not supported by the results.

      The inference of behaviorally relevant signals from raw signals is a nonlinear operation, that is x_r=g(f(x)) is nonlinear function of x. So even when a linear KF is used to decode behavior from the inferred behaviorally relevant signals, the overall decoding from raw signals to predicted behavior (i.e., KF applied to g(f(x))) is nonlinear. Thus, the result that decoding of behavior from inferred behaviorally relevant signals (x_r) using a linear KF and a nonlinear ANN reaches similar accuracy (Fig 2), does not suggest that a "linear readout is performed in the motor cortex", as the authors claim (line 471). The authors acknowledge this confound (line 472) but fail to address it adequately. They perform a simulation analysis where the decoding gap between KF and ANN remains unchanged even when d-VAE is used to infer behaviorally relevant signals in the simulation. However, this analysis is not enough for "eliminating the doubt" regarding the confound. I'm sure the authors can also design simulations where the opposite happens and just like in the data, d-VAE can improve linear decoding to match ANN decoding. An adequate way to address this concern would be to use a fully linear version of the autoencoder where the f(.) and g(.) mappings are fully linear. They can simply replace these two networks in their model with affine mappings, redo the modeling and see if the model still helps the KF decoding accuracy reach that of the ANN decoding. In such a scenario, because the overall KF decoding from original raw signals to predicted behavior (linear d-VAE + KF) is linear, then they could move toward the claim that the readout is linear. Even though such a conclusion would still be impaired by the nonlinear reference (d-VAE + ANN decoding) because the achieved nonlinear decoding performance could always be limited by network design and fitting issues. Overall, the third conclusion highlighted in the abstract is a very difficult claim to prove and is unfortunately not supported by the results.”

      We aim to explore the readout mechanism of behaviorally-relevant signals, rather than raw signals. Theoretically, the process of removing irrelevant signals should not be considered part of the inherent decoding mechanisms of the relevant signals. Assuming that the relevant signals we extracted are accurate, the conclusion of linear readout is established. On the synthetic data where the ground truth is known, our distilled signals show a significant improvement in neural similarity to the ground truth when compared to raw signals (refer to Fig. S2l). This observation demonstrates that our distilled signals are accurate approximations of the ground truth. Furthermore, on the three widely-used real datasets, our distilled signals meet the stringent criteria we have proposed (see Fig. 2), also providing strong evidence for their accuracy.

      Regarding the assertion that we could create simulations in which d-VAE can make signals that are inherently nonlinearly decodable into linearly decodable ones: In reality, we cannot achieve this, as the second criterion can rule out the selection of such signals. Specifically,z=x+y=n^2+y, where z, x, y, and n denote raw signals, relevant signals, irrelevant signals and latent variables. If the relevant signals obtained by d-VAE are n, then these signals can be linear decoded accurately. However, the corresponding irrelevant signals are n^2-n+z; thus, irrelevant signals will have much information, and these extracted relevant signals will not be selected. Furthermore, our synthetic experiments offer additional evidence supporting the conclusion that d-VAE does not make inherently nonlinearly decodable signals become linearly decodable ones. As depicted in Fig. S11c, there exists a significant performance gap between KF and ANN when decoding the ground truth signals of smaller R2 neurons. KF exhibits notably low performance, leaving substantial room for compensation by d-VAE. However, following processing by d-VAE, KF's performance of distilled signals fails to surpass its already low ground truth performance and remains significantly inferior to ANN's performance. These results collectively confirm that our approach does not convert signals that are inherently nonlinearly decodable into linearly decodable ones, and the conclusion of linear readout is not a by-product by d-VAE.

      Regarding the suggestion of using linear d-VAE + KF, as discussed in the Discussion section, removing the irrelevant signals requires a nonlinear operation, and linear d-VAE can not effectively separate relevant and irrelevant signals.

      Thank you for your valuable feedback.

      Q4: “The authors interpret several results as indications that "behavioral information is distributed in a higher-dimensional subspace than expected from raw signals", which is the second main conclusion highlighted in the abstract. However, several of these arguments do not convincingly support that conclusion.

      4.1) The authors observe that behaviorally relevant signals for neurons with small principal components (referred to as secondary) have worse decoding with KF but better decoding with ANN (Fig. 6b,e), which also outperforms ANN decoding from raw signals. This observation is taken to suggest that these secondary behaviorally relevant signals encode behavior information in highly nonlinear ways and in a higher dimensions neural space than expected (lines 424 and 428). These conclusions however are confounded by the fact that A) d-VAE uses nonlinear encoding, so one cannot conclude from ANN outperforming KF that behavior is encoded nonlinearly in the motor cortex (see comment 3 above), and B) d-VAE aggregates information across the population so one cannot conclude that these secondary neurons themselves had as much behavior information (see comment 2 above).

      4.2) The authors observe that the addition of the inferred behaviorally relevant signals for neurons with small principal components (referred to as secondary) improves the decoding of KF more than it improves the decoding of ANN (red curves in Fig 6c,f). This again is interpreted similarly as in 4.1, and is confounded for similar reasons (line 439): "These results demonstrate that irrelevant signals conceal the smaller variance PC signals, making their encoded information difficult to be linearly decoded, suggesting that behavioral information exists in a higher-dimensional subspace than anticipated from raw signals". This is confounded by because of the two reasons explained in 4.1. To conclude nonlinear encoding based on the difference in KF and ANN decoding, the authors would need to make the encoding/decoding in their VAE linear to have a fully linear decoder on one hand (with linear d-VAE + KF) and a nonlinear decoder on the other hand (with linear d-VAE + ANN), as explained in comment 3.

      4.3) From S Fig 8, where the authors compare cumulative variance of PCs for raw and inferred behaviorally relevant signals, the authors conclude that (line 554): "behaviorally-irrelevant signals can cause an overestimation of the neural dimensionality of behaviorally-relevant responses (Supplementary Fig. S8)." However, this analysis does not really say anything about overestimation of "behaviorally relevant" neural dimensionality since the comparison is done with the dimensionality of "raw" signals. The next sentence is ok though: "These findings highlight the need to filter out relevant signals when estimating the neural dimensionality.", because they use the phrase "neural dimensionality" not "neural dimensionality of behaviorally-relevant responses".”

      Questions 4.1 and 4.2 are a combination of Q2 and Q3. Please refer to our responses to Q2 and Q3.

      Regarding question 4.3 about “behaviorally-irrelevant signals can cause an overestimation of the neural dimensionality of behaviorally-relevant responses”: Previous studies usually used raw signals to estimate the neural dimensionality of specific behaviors. We mean that using raw signals, which include many irrelevant signals, will cause an overestimation of the neural dimensionality. We have modified this sentence in the revised manuscripts.

      Thank you for your valuable feedback.

      Q5: “Imprecise use of language in many places leads to inaccurate statements. I will list some of these statements”

      5.1) In the abstract: "One solution is to accurately separate behaviorally-relevant and irrelevant signals, but this approach remains elusive due to the unknown ground truth of behaviorally-relevant signals". This statement is not accurate because it implies no prior work does this. The authors should make their statement more specific and also refer to some goal that existing linear (e.g., PSID) and nonlinear (e.g., TNDM) methods for extracting behaviorally relevant signals fail to achieve.

      5.2) In the abstract: "we found neural responses previously considered useless encode rich behavioral information" => what does "useless" mean operationally? Low behavior tuning? More precise use of language would be better.

      5.3) "... recent studies (Glaser 58 et al., 2020; Willsey et al., 2022) demonstrate nonlinear readout outperforms linear readout." => do these studies show that nonlinear "readout" outperforms linear "readout", or just that nonlinear models outperform linear models?

      5.4) Line 144: "The first criterion is that the decoding performance of the behaviorally-relevant signals (red bar, Fig.1) should surpass that of raw signals (the red dotted line, Fig.1).". Do the authors mean linear decoding here or decoding in general? If the latter, how can something extracted from neural surpass decoding of neural data, when the extraction itself can be thought of as part of decoding? The operational definition for this "decoding performance" should be clarified.

      5.5) Line 311: "we found that the dimensionality of primary subspace of raw signals (26, 64, and 45 for datasets A, B, and C) is significantly higher than that of behaviorally-relevant signals (7, 13, and 9), indicating that behaviorally-irrelevant signals lead to an overestimation of the neural dimensionality of behaviorally-relevant signals." => here the dimensionality of the total PC space (i.e., primary subspace of raw signals) is being compared with that of inferred behaviorally-relevant signals, so the former being higher does not indicate that neural dimensionality of behaviorally-relevant signals was overestimated. The former is simply not behavioral so this conclusion is not accurate.

      5.6) Section "Distilled behaviorally-relevant signals uncover that smaller R2 neurons encode rich behavioral information in complex nonlinear ways". Based on what kind of R2 are the neurons grouped? Behavior decoding R2 from raw signals? Using what mapping? Using KF? If KF is used, the result that small R2 neurons benefit a lot from d-VAE could be somewhat expected, given the nonlinearity of d-VAE: because only ANN would have the capacity to unwrap the nonlinear encoding of d-VAE as needed. If decoding performance that is used to group neurons is based on data, regression to the mean could also partially explain the result: the neurons with worst raw decoding are most likely to benefit from a change in decoder, than neurons that already had good decoding. In any case, the R2 used to partition and sort neurons should be more clearly stated and reminded throughout the text and I Fig 3.

      5.7) Line 346 "...it is impossible for our model to add the activity of larger R2 neurons to that of smaller R2 neurons" => Is it really impossible? The optimization can definitely add small-scale copies of behaviorally relevant information to all neurons with minimal increase in the overall optimization loss, so this statement seems inaccurate.

      5.8) Line 490: "we found that linear decoders can achieve comparable performance to that of nonlinear decoders, providing compelling evidence for the presence of linear readout in the motor cortex." => inaccurate because no d-VAE decoding is really linear, as explained in comment 3 above.

      5.9) Line 578: ". However, our results challenge this idea by showing that signals composed of smaller variance PCs nonlinearly encode a significant amount of behavioral information." => inaccurate as results are confounded by nonlinearity of d-VAE as explained in comment 3 above.

      5.10) Line 592: "By filtering out behaviorally-irrelevant signals, our study found that accurate decoding performance can be achieved through linear readout, suggesting that the motor cortex may perform linear readout to generate movement behaviors." => inaccurate because it us confounded by the nonlinearity of d-VAE as explained in comment 3 above.”

      Regarding “5.1) In the abstract: "One solution is to accurately separate behaviorally-relevant and irrelevant signals, but this approach remains elusive due to the unknown ground truth of behaviorally-relevant signals". This statement is not accurate because it implies no prior work does this. The authors should make their statement more specific and also refer to some goal that existing linear (e.g., PSID) and nonlinear (e.g., TNDM) methods for extracting behaviorally relevant signals fail to achieve”:

      We believe our statement is accurate. Our primary objective is to extract accurate behaviorally-relevant signals that closely approximate the ground truth relevant signals. To achieve this, we strike a balance between the reconstruction and decoding performance of the generated signals, aiming to effectively capture the relevant signals. This crucial aspect of our approach sets it apart from other methods. In contrast, other methods tend to emphasize the extraction of valuable latent neural dynamics. We have provided elaboration on the distinctions between d-VAE and other approaches in the Introduction and Discussion sections.

      Thank you for your valuable feedback.

      Regarding “5.2) In the abstract: "we found neural responses previously considered useless encode rich behavioral information" => what does "useless" mean operationally? Low behavior tuning? More precise use of language would be better.”:

      In the analysis of neural signals, smaller variance PC signals are typically seen as noise and are often discarded. Similarly, smaller R2 neurons are commonly thought to be dominated by noise and are not further analyzed. Given these considerations, we believe that the term "considered useless" is appropriate in this context. Thank you for your valuable feedback.

      Regarding “5.3) "... recent studies (Glaser 58 et al., 2020; Willsey et al., 2022) demonstrate nonlinear readout outperforms linear readout." => do these studies show that nonlinear "readout" outperforms linear "readout", or just that nonlinear models outperform linear models?”:

      In this paper, we consider the two statements to be equivalent. Thank you for your valuable feedback.

      Regarding “5.4) Line 144: "The first criterion is that the decoding performance of the behaviorally-relevant signals (red bar, Fig.1) should surpass that of raw signals (the red dotted line, Fig.1).". Do the authors mean linear decoding here or decoding in general? If the latter, how can something extracted from neural surpass decoding of neural data, when the extraction itself can be thought of as part of decoding? The operational definition for this "decoding performance" should be clarified.”:

      We mean the latter, as we said in the section “Framework for defining, extracting, and separating behaviorally-relevant signals”, since raw signals contain too many behaviorally-irrelevant signals, deep neural networks are more prone to overfit raw signals than relevant signals. Therefore the decoding performance of relevant signals should surpass that of raw signals. Thank you for your valuable feedback.

      Regarding “5.5) Line 311: "we found that the dimensionality of primary subspace of raw signals (26, 64, and 45 for datasets A, B, and C) is significantly higher than that of behaviorally-relevant signals (7, 13, and 9), indicating that behaviorally-irrelevant signals lead to an overestimation of the neural dimensionality of behaviorally-relevant signals." => here the dimensionality of the total PC space (i.e., primary subspace of raw signals) is being compared with that of inferred behaviorally-relevant signals, so the former being higher does not indicate that neural dimensionality of behaviorally-relevant signals was overestimated. The former is simply not behavioral so this conclusion is not accurate.”: In practice, researchers usually used raw signals to estimate the neural dimensionality. We mean that using raw signals to do this would overestimate the neural dimensionality. Thank you for your valuable feedback.

      Regarding “5.6) Section "Distilled behaviorally-relevant signals uncover that smaller R2 neurons encode rich behavioral information in complex nonlinear ways". Based on what kind of R2 are the neurons grouped? Behavior decoding R2 from raw signals? Using what mapping? Using KF? If KF is used, the result that small R2 neurons benefit a lot from d-VAE could be somewhat expected, given the nonlinearity of d-VAE: because only ANN would have the capacity to unwrap the nonlinear encoding of d-VAE as needed. If decoding performance that is used to group neurons is based on data, regression to the mean could also partially explain the result: the neurons with worst raw decoding are most likely to benefit from a change in decoder, than neurons that already had good decoding. In any case, the R2 used to partition and sort neurons should be more clearly stated and reminded throughout the text and I Fig 3.”:

      When employing R2 to characterize neurons, it indicates the extent to which neuronal activity is explained by the linear encoding model [1-3]. Smaller R2 neurons have a lower capacity for linearly tuning (encoding) behaviors, while larger R2 neurons have a higher capacity for linearly tuning (encoding) behaviors. Specifically, the approach involves first establishing an encoding relationship from velocity to neural signal using a linear model, i.e., y=f(x), where f represents a linear regression model, x denotes velocity, and y denotes the neural signal. Subsequently, R2 is utilized to quantify the effectiveness of the linear encoding model in explaining neural activity. We have provided a comprehensive explanation in the revised manuscript. Thank you for your valuable feedback.

      [1] Collinger, J.L., Wodlinger, B., Downey, J.E., Wang, W., Tyler-Kabara, E.C., Weber, D.J., McMorland, A.J., Velliste, M., Boninger, M.L. and Schwartz, A.B., 2013. High-performance neuroprosthetic control by an individual with tetraplegia. The Lancet, 381(9866), pp.557-564.

      [2] Wodlinger, B., et al. "Ten-dimensional anthropomorphic arm control in a human brain− machine interface: difficulties, solutions, and limitations." Journal of neural engineering 12.1 (2014): 016011.

      [3] Inoue, Y., Mao, H., Suway, S.B., Orellana, J. and Schwartz, A.B., 2018. Decoding arm speed during reaching. Nature communications, 9(1), p.5243.

      Regarding Questions 5.7, 5.8, 5.9, and 5.10:

      We believe our conclusions are solid. The reasons can be found in our replies in Q2 and Q3. Thank you for your valuable feedback.

      Q6: “Imprecise use of language also sometimes is not inaccurate but just makes the text hard to follow.

      6.1) Line 41: "about neural encoding and decoding mechanisms" => what is the definition of encoding/decoding and how do these differ? The definitions given much later in line 77-79 is also not clear.

      6.2) Line 323: remind the reader about what R2 is being discussed, e.g., R2 of decoding behavior using KF. It is critical to know if linear or nonlinear decoding is being discussed.

      6.3) Line 488: "we found that neural responses previously considered trivial encode rich behavioral information in complex nonlinear ways" => "trivial" in what sense? These phrases would benefit from more precision, for example: "neurons that may seem to have little or no behavior information encoded". The same imprecise word ("trivial") is also used in many other places, for example in the caption of Fig S9.

      6.4) Line 611: "The same should be true for the brain." => Too strong of a statement for an unsupported claim suggesting the brain does something along the lines of nonlin VAE + linear readout.

      6.5) In Fig 1, legend: what is the operational definition of "generating performance"? Generating what? Neural reconstruction?”

      Regarding “6.1) Line 41: "about neural encoding and decoding mechanisms" => what is the definition of encoding/decoding and how do these differ? The definitions given much later in line 77-79 is also not clear.”:

      We would like to provide a detailed explanation of neural encoding and decoding. Neural encoding means how neuronal activity encodes the behaviors, that is, y=f(x), where y denotes neural activity and, x denotes behaviors, f is the encoding model. Neural decoding means how the brain decodes behaviors from neural activity, that is, x=g(y), where g is the decoding model. For further elaboration, please refer to [1]. We have included references that discuss the concepts of encoding and decoding in the revised manuscript. Thank you for your valuable feedback.

      [1] Kriegeskorte, Nikolaus, and Pamela K. Douglas. "Interpreting encoding and decoding models." Current opinion in neurobiology 55 (2019): 167-179.

      Regarding “6.2) Line 323: remind the reader about what R2 is being discussed, e.g., R2 of decoding behavior using KF. It is critical to know if linear or nonlinear decoding is being discussed.”:

      This question is the same as Q5.6. Please refer to the response to Q5.6. Thank you for your valuable feedback.

      Regarding “6.3) Line 488: "we found that neural responses previously considered trivial encode rich behavioral information in complex nonlinear ways" => "trivial" in what sense? These phrases would benefit from more precision, for example: "neurons that may seem to have little or no behavior information encoded". The same imprecise word ("trivial") is also used in many other places, for example in the caption of Fig S9.”:

      We have revised this statement in the revised manuscript. Thanks for your recommendation.

      Regarding “6.4) Line 611: "The same should be true for the brain." => Too strong of a statement for an unsupported claim suggesting the brain does something along the lines of nonlin VAE + linear readout.”

      We mean that removing the interference of irrelevant signals and decoding the relevant signals should logically be two stages. We have revised this statement in the revised manuscript. Thank you for your valuable feedback.

      Regarding “6.5) In Fig 1, legend: what is the operational definition of "generating performance"? Generating what? Neural reconstruction?””:

      We have replaced “generating performance” with “reconstruction performance” in the revised manuscript. Thanks for your recommendation.

      Q7: “In the analysis presented starting in line 449, the authors compare improvement gained for decoding various speed ranges by adding secondary (small PC) neurons to the KF decoder (Fig S11). Why is this done using the KF decoder, when earlier results suggest an ANN decoder is needed for accurate decoding from these small PC neurons? It makes sense to use the more accurate nonlinear ANN decoder to support the fundamental claim made here, that smaller variance PCs are involved in regulating precise control”

      Because when the secondary signal is superimposed on the primary signal, the enhancement in KF performance is substantial. We wanted to explore in which aspect of the behavior the KF performance improvement is mainly reflected. In comparison, the improvement of ANN by the secondary signal is very small, rendering the exploration of the aforementioned questions inconsequential. Thank you for your valuable feedback.

      Q8: “A key limitation of the VAE architecture is that it doesn't aggregate information over multiple time samples. This may be why the authors decided to use a very large bin size of 100ms and beyond that smooth the data with a moving average. This limitation should be clearly stated somewhere in contrast with methods that can aggregate information over time (e.g., TNDM, LFADS, PSID) ”

      We have added this limitation in the Discussion in the revised manuscript. Thanks for your recommendation.

      Q9: “Fig 5c and parts of the text explore the decoding when some neurons are dropped. These results should come with a reminder that dropping neurons from behaviorally relevant signals is not technically possible since the extraction of behaviorally relevant signals with d-VAE is a population level aggregation that requires the raw signal from all neurons as an input. This is also important to remind in some places in the text for example:

      • Line 498: "...when one of the neurons is destroyed."

      • Line 572: "In contrast, our results show that decoders maintain high performance on distilled signals even when many neurons drop out."”

      We want to explore the robustness of real relevant signals in the face of neuron drop-out. The signals our model extracted are an approximation of the ground truth relevant signals and thus serve as a substitute for ground truth to study this problem. Thank you for your valuable feedback.

      Q10: “Besides the confounded conclusions regarding the readout being linear (see comment 3 and items related to it in comment 5), the authors also don't adequately discuss prior works that suggest nonlinearity helps decoding of behavior from the motor cortex. Around line 594, a few works are discussed as support for the idea of a linear readout. This should be accompanied by a discussion of works that support a nonlinear encoding of behavior in the motor cortex, for example (Naufel et al. 2019; Glaser et al. 2020), some of which the authors cite elsewhere but don't discuss here.”

      We have added this discussion in the revised manuscript. Thanks for your recommendation.

      Q11: “Selection of hyperparameters is not clearly explained. Starting line 791, the authors give some explanation for one hyperparameter, but not others. How are the other hyperparameters determined? What is the search space for the grid search of each hyperparameter? Importantly, if hyperparameters are determined only based on the training data of each fold, why is only one value given for the hyperparameter selected in each dataset (line 814)? Did all 5 folds for each dataset happen to select exactly the same hyperparameter based on their 5 different training/validation data splits? That seems unlikely.”

      We perform a grid search in {0.001, 0.01,0.1,1} for hyperparameter beta. And we found that 0.001 is the best for all datasets. As for the model parameters, such as hidden neuron numbers, this model capacity has reached saturation decoding performance and does not influence the results.

      Regarding “Importantly, if hyperparameters are determined only based on the training data of each fold, why is only one value given for the hyperparameter selected in each dataset (line 814)? Did all 5 folds for each dataset happen to select exactly the same hyperparameter based on their 5 different training/validation data splits”: We selected the hyperparameter based on the average performance of 5 folds data on validation sets. The selected value denotes the one that yields the highest average performance across the 5 folds data.

      Thank you for your valuable feedback.

      Q12: “d-VAE itself should also be explained more clearly in the main text. Currently, only the high-level idea of the objective is explained. The explanation should be more precise and include the idea of encoding to latent state, explain the relation to pip-VAE, explain inputs and outputs, linearity/nonlinearity of various mappings, etc. Also see comment 1 above, where I suggest adding more details about other methods in the main text.”

      Our primary objective is to delve into the encoding and decoding mechanisms using the separated relevant signals. Therefore, providing an excessive amount of model details could potentially distract from the main focus of the paper. In response to your suggestion, we have included a visual representation of d-VAE's structure, input, and output (see Fig. S1) in the revised manuscript, which offers a comprehensive and intuitive overview. Additionally, we have expanded on the details of d-VAE and other methods in the Methods section.

      Thank you for your valuable feedback.

      Q13: “In Fig 1f and g, shouldn't the performance plots be swapped? The current plots seem counterintuitive. If there is bias toward decoding (panel g), why is the irrelevant residual so good at decoding?”

      The placement of the performance plots in Fig. 1f and 1g is accurate. When the model exhibits a bias toward decoding, it prioritizes extracting the most relevant features (latent variables) for decoding purposes. As a consequence, the model predominantly generates signals that are closely associated with these extracted features. This selective signal extraction and generation process may result in the exclusion of other potentially useful information, which will be left in the residuals. To illustrate this concept, consider the example of face recognition: if a model can accurately identify an individual using only the person's eyes (assuming these are the most useful features), other valuable information, such as details of the nose or mouth, will be left in the residuals, which could also be used to identify the individual.

      Thank you for your valuable feedback.

    1. Author Response:

      The following is the authors’ response to the previous reviews.

      We carefully read through the second-round reviews and the additional reviews. To us, the review process is somewhat unusual and very much dominated by referee 2, who aggressively insists that we mixed up the trigeminal nucleus and inferior olive and that as a consequence our results are meaningless. We think the stance of referee 2 and the focus on one single issue (the alleged mix-up of trigeminal nucleus and inferior olive) is somewhat unfortunate, leaves out much of our findings and we debated at length on how to deal with further revisions. In the end, we decided to again give priority to addressing the criticism of referees 2, because it is hard to go on with a heavily attacked paper without resolving the matter at stake. The following is a summary of, what we did:

      Additional experimental work:

      (1) We checked if the peripherin-antibody indeed reliably identifies climbing fibers.

      To this end, we sectioned the elephant cerebellum and stained sections with the peripherin-antibody. We find: (i) the cerebellar white matter is strongly reactive for peripherin-antibodies, (ii) cerebellar peripherin-antibody staining of has an axonal appearance. (iii) Cerebellar Purkinje cell somata appear to be ensheated by peripherin-antibody staining. (iv) We observed that the peripherin-antibody reactivity gradually decreases from Purkinje cell somata to the pia in the cerebellar molecular layer. This work is shown in our revised Figure 2. All these four features align with the distribution of climbing fibers (which arrive through the white matter, are axons, ensheat Purkinje cell somata, and innervate Purkinje cell proximally not reaching the pia). In line with previous work, which showed similar cerebellar staining patterns in several species (Errante et al. 1998), we conclude that elephant climbing fibers are strongly reactive for peripherin-antibodies.

      (2) We delineated the elephant olivo-cerebellar tract.

      The strong peripherin-antibody reactivity of elephant climbing fibers enabled us to delineate the elephant olivo-cerebellar tract. We find the elephant olivo-cerebellar tract is a strongly peripherin-antibody reactive, well-delineated fiber tract several millimeters wide and about a centimeter in height. The unstained olivo-cerebellar tract has a greyish appearance. In the anterior regions of the olivo-cerebellar tract, we find that peripherin-antibody reactive fibers run in the dorsolateral brainstem and approach the cerebellar peduncle, where the tract gradually diminishes in size, presumably because climbing fibers discharge into the peduncle. Indeed, peripherin-antibody reactive fibers can be seen entering the cerebellar peduncle. Towards the posterior end of the peduncle, the olivo-cerebellar disappears (in the dorsal brainstem directly below the peduncle. We note that the olivo-cerebellar tract was referred to as the spinal trigeminal tract by Maseko et al. 2013. We think the tract in question cannot be the spinal trigeminal tract for two reasons: (i) This tract is the sole brainstem source of peripherin-positive climbing fibers entering the peduncle/ the cerebellum; this is the defining characteristic of the olivo-cerebellar tract. (ii) The tract in question is much smaller than the trigeminal nerve, disappears posterior to where the trigeminal nerve enters the brainstem (see below), and has no continuity with the trigeminal nerve; the continuity with the trigeminal nerve is the defining characteristic of the spinal trigeminal tract, however.

      The anterior regions of the elephant olivo-cerebellar tract are similar to the anterior regions of olivo-cerebellar tract of other mammals in its dorsolateral position and the relation to the cerebellar peduncle. In its more posterior parts, the elephant olivo-cerebellar tract continues for a long distance (~1.5 cm) in roughly the same dorsolateral position and enters the serrated nucleus that we previously identified as the elephant inferior olive. The more posterior parts of the elephant olivo-cerebellar tract therefore differ from the more posterior parts of the olivo-cerebellar tract of other mammals, which follows a ventromedial trajectory towards a ventromedially situated inferior olive. The implication of our delineation of the elephant olivo-cerebellar tract is that we correctly identified the elephant inferior olive.

      (3) An in-depth analysis of peripherin-antibody reactivity also indicates that the trigeminal nucleus receives no climbing fiber input.

      We also studied the peripherin-antibody reactivity in and around the trigeminal nucleus. We had also noted in the previous submission that the trigeminal nucleus is weakly positive for peripherin, but that the staining pattern is uniform and not the type of axon bundle pattern that is seen in the inferior olive of other mammals. To us, this observation already argued against the presence of climbing fibers in the trigeminal nucleus. We also noted that the myelin stripes of the trigeminal nucleus were peripherin-antibody-negative. In the context of our olivo-cerebellar tract tracing we now also scrutinized the surroundings of the trigeminal nucleus for peripherin-antibody reactivity. We find that the ventral brainstem surrounding the trigeminal nucleus is devoid of peripherin-antibody reactivity. Accordingly, no climbing fibers, (which we have shown to be strongly peripherin-antibody-positive, see our point 1) arrive at the trigeminal nucleus. The absence of climbing fiber input indicates that previous work that identified the (trigeminal) nucleus as the inferior olive (Maseko et al 2013) is unlikely to be correct.

      (4) We characterized the entry of the trigeminal nerve into the elephant brain.

      To better understand how trigeminal information enters the elephant’s brain, we characterized the entry of the trigeminal nerve. This analysis indicated to us that the trigeminal nerve is not continuous with the olivo-cerebellar tract (the spinal trigeminal tract of Maseko et al. 2013) as previously claimed by Maseko et al. 2013. We show some of this evidence in Referee-Figure 1 below. The reason we think the trigeminal nerve is discontinuous with the olivo-cerebellar tract is the size discrepancy between the two structures. We first show this for the tracing data of Maseko et al. 2013. In the Maseko et al. 2013 data the trigeminal nerve (Referee-Figure 1A, their plate Y) has 3-4 times the diameter of the olivocerebellar tract (the alleged spinal trigeminal tract, Referee-Figure 1B, their plate Z). Note that most if not all trigeminal fibers are thought to continue from the nerve into the trigeminal tract (see our rat data below). We plotted the diameter of the trigeminal nerve and diameter of the olivo-cerebellar (the spinal trigeminal tract according to Maseko et al. 2013) from the Maseko et al. 2013 data (Referee-Figure 1C) and we found that the olivocerebellar tract has a fairly consistent diameter (46 ± 9 mm2, mean ± SD). Statistical considerations and anatomical evidence suggest that the tracing of the trigeminal nerve into the olivo-cerebellar (the spinal trigeminal tract according to Maseko et al. 2013) is almost certainly wrong. The most anterior point of the alleged spinal trigeminal tract has a diameter of 51 mm2 which is more than 15 standard deviations different from the most posterior diameter (194 mm2) of the trigeminal tract. For this assignment to be correct three-quarters of trigeminal nerve fibers would have to spontaneously disappear, something that does not happen in the brain. We also made similar observations in the African elephant Bibi, where the trigeminal nerve (Referee-Figure 1D) is much larger in diameter than the olivocerebellar tract (Referee-Figure 1E). We could also show that the olivocerebellar tract disappears into the peduncle posterior to where the trigeminal nerve enters (Referee-Figure 1F). Our data are very similar to Maseko et al. indicating that their outlining of structures was done correctly. What appears to have been oversimplified, is the assignment of structures as continuous. We also quantified the diameter of the trigeminal nerve and the spinal trigeminal tract in rats (from the Paxinos & Watson atlas; Referee-Figure 1D); as expected we found the trigeminal nerve and spinal trigeminal tract diameters are essentially continuous.

      In our hands, the trigeminal nerve does not continue into a well-defined tract that could be traced after its entry. In this regard, it differs both from the olivo-cerebellar tract of the elephant or the spinal trigeminal tract of the rodent, both of which are well delineated. We think the absence of a well-delineated spinal trigeminal tract in elephants might have contributed to the putative tracing error highlighted in our Referee-Figure 1A-C.

      We conclude that a size mismatch indicates trigeminal fibers do not run in the olivo-cerebellar tract (the spinal trigeminal tract according to Maseko et al. 2013).

      Author response image 1.

      The trigeminal nerve is discontinuous with the olivo-cerebellar tract (the spinal trigeminal tract according to Maseko et al. 2013). A, Trigeminal nerve (orange) in the brain of African elephant LAX as delineated by Maseko et al. 2013 (coronal section; their plate Y). B, Most anterior appearance of the spinal trigeminal tract of Maseko et al. 2013 (blue; coronal section; their plate Z). Note the much smaller diameter of the spinal trigeminal tract compared to the trigeminal nerve shown in C, which argues against the continuity of the two structures. Indeed, our peripherin-antibody staining showed that the spinal trigeminal tract of Maseko corresponds to the olivo-cerebellar tract and is discontinuous with the trigeminal nerve. C, Plot of the trigeminal nerve and olivo-cerebellar tracts (the spinal trigeminal tract according to Maseko et al. 2013) diameter along the anterior-posterior axis. The trigeminal nerve is much larger in diameter than the olivocerebellar tract (the spinal trigeminal tract according to Maseko et al. 2013). C, D measurements, for which sections are shown in panels C and D respectively. The olivocerebellar tract (the spinal trigeminal tract according to Maseko et al. 2013) has a consistent diameter; data replotted from Maseko et al. 2013. At mm 25 the inferior olive appears. D, Trigeminal nerve entry in the brain of African elephant Bibi; our data, coronal section, the trigeminal nerve is outlined in orange, note the large diameter. E, Most anterior appearance of the olivo-cerebellar tract in the brain of African elephant Bibi; our data, coronal section, approximately 3 mm posterior to the section shown in A, the olivocerebellar tract is outlined in blue. Note the smaller diameter of the olivo-cerebellar tract compared to the trigeminal nerve, which argues against the continuity of the two structures. F, Plot of the trigeminal nerve and olivo-cerebellar tract diameter along the anterior-posterior axis. The nerve and olivo-cerebellar tract are discontinuous and the trigeminal nerve is much larger in diameter than the olivocerebellar tract (the spinal trigeminal tract according to Maseko et al. 2013); our data. D, E measurements, for which sections are shown in panels D and E respectively. At mm 27 the inferior olive appears. G, In the rat the trigeminal nerve is continuous in size with the spinal trigeminal tract. Data replotted from Paxinos and Watson.

      Reviewer 2 (Public Review):

      As indicated in my previous review of this manuscript (see above), it is my opinion that the authors have misidentified, and indeed switched, the inferior olivary nuclear complex (IO) and the trigeminal nuclear complex (Vsens). It is this specific point only that I will address in this second review, as this is the crucial aspect of this paper - if the identification of these nuclear complexes in the elephant brainstem by the authors is incorrect, the remainder of the paper does not have any scientific validity.

      Comment: We agree with the referee that it is most important to sort out, the inferior olivary nuclear complex (IO) and the trigeminal nuclear complex, respectively.Change: We did additional experimental work to resolve this matter as detailed at the beginning of our response. Specifically, we ascertained that elephant climbing fibers are strongly peripherin-positive. Based on elephant climbing fiber peripherin-reactivity we delineated the elephant olivo-cerebellar tract. We find that the olivo-cerebellar connects to the structure we refer to as inferior olive to the cerebellum (the referee refers to this structure as the trigeminal nuclear complex). We also found that the trigeminal nucleus (the structure the referee refers to as inferior olive) appears to receive no climbing fibers. We provide indications that the tracing of the trigeminal nerve into the olivo-cerebellar tract by Maseko et al. 2023 was erroneous (Author response image 1). These novel findings support our ideas but are very difficult to reconcile with the referee’s partitioning scheme.

      The authors, in their response to my initial review, claim that I "bend" the comparative evidence against them. They further claim that as all other mammalian species exhibit a "serrated" appearance of the inferior olive, and as the elephant does not exhibit this appearance, that what was previously identified as the inferior olive is actually the trigeminal nucleus and vice versa. 

      For convenience, I will refer to IOM and VsensM as the identification of these structures according to Maseko et al (2013) and other authors and will use IOR and VsensR to refer to the identification forwarded in the study under review. <br /> The IOM/VsensR certainly does not have a serrated appearance in elephants. Indeed, from the plates supplied by the authors in response (Referee Fig. 2), the cytochrome oxidase image supplied and the image from Maseko et al (2013) shows a very similar appearance. There is no doubt that the authors are identifying structures that closely correspond to those provided by Maseko et al (2013). It is solely a contrast in what these nuclear complexes are called and the functional sequelae of the identification of these complexes (are they related to the trunk sensation or movement controlled by the cerebellum?) that is under debate.

      Elephants are part of the Afrotheria, thus the most relevant comparative data to resolve this issue will be the identification of these nuclei in other Afrotherian species. Below I provide images of these nuclear complexes, labelled in the standard nomenclature, across several Afrotherian species. 

      (A) Lesser hedgehog tenrec (Echinops telfairi) 

      Tenrecs brains are the most intensively studied of the Afrotherian brains, these extensive neuroanatomical studies undertaken primarily by Heinz Künzle. Below I append images (coronal sections stained with cresol violet) of the IO and Vsens (labelled in the standard mammalian manner) in the lesser hedgehog tenrec. It should be clear that the inferior olive is located in the ventral midline of the rostral medulla oblongata (just like the rat) and that this nucleus is not distinctly serrated. The Vsens is located in the lateral aspect of the medulla skirted laterally by the spinal trigeminal tract (Sp5). These images and the labels indicating structures correlate precisely with that provide by Künzle (1997, 10.1016, see his Figure 1K,L. Thus, in the first case of a related species, there is no serrated appearance of the inferior olive, the location of the inferior olive is confirmed through connectivity with the superior colliculus (a standard connection in mammals) by Künzle (1997), and the location of Vsens is what is considered to be typical for mammals. This is in agreement with the authors, as they propose that ONLY the elephants show the variations they report. 

      (B) Giant otter shrew (Potomogale velox) 

      The otter shrews are close relatives of the Tenrecs. Below I append images of cresyl violet (left column) and myelin (right column) stained coronal sections through the brainstem with the IO, Vsens and Sp5 labelled as per standard mammalian anatomy. Here we see hints of the serration of the IO as defined by the authors, but we also see many myelin stripes across the IO. Vsens is located laterally and skirted by the Sp5. This is in agreement with the authors, as they propose that ONLY the elephants show the variations they report.

      (C) Four-toed sengi (Petrodromus tetradactylus) 

      The sengis are close relatives of the Tenrecs and otter shrews, these three groups being part of the Afroinsectiphilia, a distinct branch of the Afrotheria. Below I append images of cresyl violet (left column) and myelin (right column) stained coronal sections through the brainstem with the IO, Vsens and Sp5 labelled as per standard mammalian anatomy. Here we see vague hints of the serration of the IO (as defined by the authors), and we also see many myelin stripes across the IO. Vsens is located laterally and skirted by the Sp5. This is in agreement with the authors, as they propose that ONLY the elephants show the variations they report. 

      (D) Rock hyrax (Procavia capensis) 

      The hyraxes, along with the sirens and elephants form the Paenungulata branch of the Afrotheria. Below I append images of cresyl violet (left column) and myelin (right column) stained coronal sections through the brainstem with the IO, Vsens and Sp5 labelled as per the standard mammalian anatomy. Here we see hints of the serration of the IO (as defined by the authors), but we also see evidence of a more "bulbous" appearance of subnuclei of the IO (particularly the principal nucleus), and we also see many myelin stripes across the IO. Vsens is located laterally and skirted by the Sp5. This is in agreement with the authors, as they propose that ONLY the elephants show the variations they report. 

      (E) West Indian manatee (Trichechus manatus) 

      The sirens are the closest extant relatives of the elephants in the Afrotheria. Below I append images of cresyl violet (top) and myelin (bottom) stained coronal sections (taken from the University of Wisconsin-Madison Brain Collection, https://brainmuseum.org, and while quite low in magnification they do reveal the structures under debate) through the brainstem with the IO, Vsens and Sp5 labelled as per standard mammalian anatomy. Here we see the serration of the IO (as defined by the authors). Vsens is located laterally and skirted by the Sp5. This is in agreement with the authors, as they propose that ONLY the elephants show the variations they report.

      These comparisons and the structural identification, with which the authors agree as they only distinguish the elephants from the other Afrotheria, demonstrate that the appearance of the IO can be quite variable across mammalian species, including those with a close phylogenetic affinity to the elephants. Not all mammal species possess a "serrated" appearance of the IO. Thus, it is more than just theoretically possible that the IO of the elephant appears as described prior to this study. 

      So what about elephants? Below I append a series of images from coronal sections through the African elephant brainstem stained for Nissl, myelin, and immunostained for calretinin. These sections are labelled according to standard mammalian nomenclature. In these complete sections of the elephant brainstem, we do not see a serrated appearance of the IOM (as described previously and in the current study by the authors). Rather the principal nucleus of the IOM appears to be bulbous in nature. In the current study, no image of myelin staining in the IOM/VsensR is provided by the authors. However, in the images I provide, we do see the reported myelin stripes in all stains - agreement between the authors and reviewer on this point. The higher magnification image to the bottom left of the plate shows one of the IOM/VsensR myelin stripes immunostained for calretinin, and within the myelin stripes axons immunopositive for calretinin are seen (labelled with an arrow). The climbing fibres of the elephant cerebellar cortex are similarly calretinin immunopositive (10.1159/000345565). In contrast, although not shown at high magnification, the fibres forming the Sp5 in the elephant (in the Maseko description, unnamed in the description of the authors) show no immunoreactivity to calretinin. 

      Comment: We appreciate the referee’s additional comments. We concede the possibility that some relatives of elephants have a less serrated inferior olive than most other mammals. We maintain, however, that the elephant inferior olive (our Figure 1J) has the serrated appearance seen in the vast majority of mammals.

      Change: None.

      Peripherin Immunostaining 

      In their revised manuscript the authors present immunostaining of peripherin in the elephant brainstem. This is an important addition (although it does replace the only staining of myelin provided by the authors which is unusual as the word myelin is in the title of the paper) as peripherin is known to specifically label peripheral nerves. In addition, as pointed out by the authors, peripherin also immunostains climbing fibres (Errante et al., 1998). The understanding of this staining is important in determining the identification of the IO and Vsens in the elephant, although it is not ideal for this task as there is some ambiguity. Errante and colleagues (1998; Fig. 1) show that climbing fibres are peripherin-immunopositive in the rat. But what the authors do not evaluate is the extensive peripherin staining in the rat Sp5 in the same paper (Errante et al, 1998, Fig. 2). The image provided by the authors of their peripherin immunostaining (their new Figure 2) shows what I would call the Sp5 of the elephant to be strongly peripherin immunoreactive, just like the rat shown in Errant et al (1998), and more over in the precise position of the rat Sp5! This makes sense as this is where the axons subserving the "extraordinary" tactile sensitivity of the elephant trunk would be found (in the standard model of mammalian brainstem anatomy). Interestingly, the peripherin immunostaining in the elephant is clearly lamellated...this coincides precisely with the description of the trigeminal sensory nuclei in the elephant by Maskeo et al (2013) as pointed out by the authors in their rebuttal. Errante et al (1998) also point out peripherin immunostaining in the inferior olive, but according to the authors this is only "weakly present" in the elephant IOM/VsensR. This latter point is crucial. Surely if the elephant has an extraordinary sensory innervation from the trunk, with 400 000 axons entering the brain, the VsensR/IOM should be highly peripherin-immunopositive, including the myelinated axon bundles?! In this sense, the authors argue against their own interpretation - either the elephant trunk is not a highly sensitive tactile organ, or the VsensR is not the trigeminal nuclei it is supposed to be. 

      Comment: We made sure that elephant climbing fibers are strongly peripherin-positive (our revised Figure 2). As we noted in already our previous ms, we see weak diffuse peripherin-reactivity in the trigeminal nucleus (the inferior olive according to the referee), but no peripherin-reactive axon bundles (i.e. climbing fibers) that are seen in the inferior olive of other species. We also see no peripherin-reactive axon bundles (i.e. the olivo-cerebellar tract) arriving in the trigeminal nucleus as the tissue surrounding the trigeminal nucleus is devoid of peripherin-reactivity. Again, this finding is incompatible with the referee’s ideas. As far as we can tell, the trigeminal fibers are not reactive for peripherin in the elephant, i.e. we did not observe peripherin-reactivity very close to the nerve entry, but unfortunately, we did not stain for peripherin-reactivity into the nerve. As the referee alludes to the absence of peripherin-reactivity in the trigeminal tract is a difference between rodents and elephants.

      Change: Our novel Figure 2.

      Summary: 

      (1) Comparative data of species closely related to elephants (Afrotherians) demonstrates that not all mammals exhibit the "serrated" appearance of the principal nucleus of the inferior olive. 

      (2) The location of the IO and Vsens as reported in the current study (IOR and VsensR) would require a significant, and unprecedented, rearrangement of the brainstem in the elephants independently. I argue that the underlying molecular and genetic changes required to achieve this would be so extreme that it would lead to lethal phenotypes. Arguing that the "switcheroo" of the IO and Vsens does occur in the elephant (and no other mammals) and thus doesn't lead to lethal phenotypes is a circular argument that cannot be substantiated. 

      (3) Myelin stripes in the subnuclei of the inferior olivary nuclear complex are seen across all related mammals as shown above. Thus, the observation made in the elephant by the authors in what they call the VsensR, is similar to that seen in the IO of related mammals, especially when the IO takes on a more bulbous appearance. These myelin stripes are the origin of the olivocerebellar pathway, and are indeed calretinin immunopositive in the elephant as I show. 

      (4) What the authors see aligns perfectly with what has been described previously, the only difference being the names that nuclear complexes are being called. But identifying these nuclei is important, as any functional sequelae, as extensively discussed by the authors, is entirely dependent upon accurately identifying these nuclei. 

      (4) The peripherin immunostaining scores an own goal - if peripherin is marking peripheral nerves (as the authors and I believe it is), then why is the VsensR/IOM only "weakly positive" for this stain? This either means that the "extraordinary" tactile sensitivity of the elephant trunk is non-existent, or that the authors have misinterpreted this staining. That there is extensive staining in the fibre pathway dorsal and lateral to the IOR (which I call the spinal trigeminal tract), supports the idea that the authors have misinterpreted their peripherin immunostaining.

      (5) Evolutionary expediency. The authors argue that what they report is an expedient way in which to modify the organisation of the brainstem in the elephant to accommodate the "extraordinary" tactile sensitivity. I disagree. As pointed out in my first review, the elephant cerebellum is very large and comprised of huge numbers of morphologically complex neurons. The inferior olivary nuclei in all mammals studied in detail to date, give rise to the climbing fibres that terminate on the Purkinje cells of the cerebellar cortex. It is more parsimonious to argue that, in alignment with the expansion of the elephant cerebellum (for motor control of the trunk), the inferior olivary nuclei (specifically the principal nucleus) have had additional neurons added to accommodate this cerebellar expansion. Such an addition of neurons to the principal nucleus of the inferior olive could readily lead to the loss of the serrated appearance of the principal nucleus of the inferior olive, and would require far less modifications in the developmental genetic program that forms these nuclei. This type of quantitative change appears to be the primary way in which structures are altered in the mammalian brainstem. 

      Comment: We still disagree with the referee. We note that our conclusions rest on the analysis of 8 elephant brainstems, which we sectioned in three planes and stained with a variety of metabolic and antibody stains and in which assigned two structures (the inferior olive and the trigeminal nucleus). Most of the evidence cited by the referee stems from a single paper, in which 147 structures were identified based on the analysis of a single brainstem sectioned in one plane and stained with a limited set of antibodies. Our synopsis of the evidence is the following.

      (1) We agree with the referee that concerning brainstem position our scheme of a ventromedial trigeminal nucleus and a dorsolateral inferior olive deviates from the usual mammalian position of these nuclei (i.e. a dorsolateral trigeminal nucleus and a ventromedial inferior olive).

      (2) Cytoarchitectonics support our partitioning scheme. The compact cellular appearance of our ventromedial trigeminal nucleus is characteristic of trigeminal nuclei. The serrated appearance of our dorsolateral inferior olive is characteristic of the mammalian inferior olive; we acknowledge that the referee claims exceptions here. To our knowledge, nobody has described a mammalian trigeminal nucleus with a serrated appearance (which would apply to the elephant in case the trigeminal nucleus is situated dorsolaterally).

      (3) Metabolic staining (Cyto-chrome-oxidase reactivity) supports our partitioning scheme. Specifically, our ventromedial trigeminal nucleus shows intense Cyto-chrome-oxidase reactivity as it is seen in the trigeminal nuclei of trigeminal tactile experts.

      (4) Isomorphism. The myelin stripes on our ventromedial trigeminal nucleus are isomorphic to trunk wrinkles. Isomorphism is a characteristic of somatosensory brain structures (barrel, barrelettes, nose-stripes, etc) and we know of no case, where such isomorphism was misleading.

      (5) The large-scale organization of our ventromedial trigeminal nuclei in anterior-posterior repeats is characteristic of the mammalian trigeminal nuclei. To our knowledge, no such organization has ever been reported for the inferior olive.

      (6) Connectivity analysis supports our partitioning scheme. According to our delineation of the elephant olivo-cerebellar tract, our dorsolateral inferior olive is connected via peripherin-positive climbing fibers to the cerebellum. In contrast, our ventromedial trigeminal nucleus (the referee’s inferior olive) is not connected via climbing fibers to the cerebellum.

      Change: As discussed, we advanced further evidence in this revision. Our partitioning scheme (a ventromedial trigeminal nucleus and a dorsolateral inferior olive) is better supported by data and makes more sense than the referee’s suggestion (a dorsolateral trigeminal nucleus and a ventromedial inferior olive). It should be published.

      Reviewer #3 (Public Review):

      Summary: 

      The study claims to investigate trunk representations in elephant trigeminal nuclei located in the brainstem. The researchers identify large protrusions visible from the ventral surface of the brainstem, which they examined using a range of histological methods. However, this ventral location is usually where the inferior olivary complex is found, which challenges the author's assertions about the nucleus under analysis. They find that this brainstem nucleus of elephants contains repeating modules, with a focus on the anterior and largest unit which they define as the putative nucleus principalis trunk module of the trigeminal. The nucleus exhibits low neuron density, with glia outnumbering neurons significantly. The study also utilizes synchrotron X-ray phase contrast tomography to suggest that myelin-stripe-axons traverse this module. The analysis maps myelin-rich stripes in several specimens and concludes that based on their number and patterning that they likely correspond with trunk folds; however this conclusion is not well supported if the nucleus has been misidentified. 

      Comment: The referee provides a summary of our work. The referee also notes that the correct identification of the trigeminal nucleus is critical to the message of our paper.

      Change: In line with these assessments we focused our revision efforts on the issue of trigeminal nucleus identification, please see our introductory comments and our response to Referee 2.

      Strengths: 

      The strength of this research lies in its comprehensive use of various anatomical methods, including Nissl staining, myelin staining, Golgi staining, cytochrome oxidase labeling, and synchrotron X-ray phase contrast tomography. The inclusion of quantitative data on cell numbers and sizes, dendritic orientation and morphology, and blood vessel density across the nucleus adds a quantitative dimension. Furthermore, the research is commendable for its high-quality and abundant images and figures, effectively illustrating the anatomy under investigation.

      Comment: We appreciate this positive assessment.

      Change: None

      Weaknesses: 

      While the research provides potentially valuable insights if revised to focus on the structure that appears to be inferior olivary nucleus, there are certain additional weaknesses that warrant further consideration. First, the suggestion that myelin stripes solely serve to separate sensory or motor modules rather than functioning as an "axonal supply system" lacks substantial support due to the absence of information about the neuronal origins and the termination targets of the axons. Postmortem fixed brain tissue limits the ability to trace full axon projections. While the study acknowledges these limitations, it is important to exercise caution in drawing conclusions about the precise role of myelin stripes without a more comprehensive understanding of their neural connections. 

      Comment: We understand these criticisms and the need for cautious interpretation. As we noted previously, we think that the Elife-publishing scheme, where critical referee commentary is published along with our ms, will make this contribution particularly valuable.

      Change: Our additional efforts to secure the correct identification of the trigeminal nucleus.

      Second, the quantification presented in the study lacks comparison to other species or other relevant variables within the elephant specimens (i.e., whole brain or brainstem volume). The absence of comparative data to different species limits the ability to fully evaluate the significance of the findings. Comparative analyses could provide a broader context for understanding whether the observed features are unique to elephants or more common across species. This limitation in comparative data hinders a more comprehensive assessment of the implications of the research within the broader field of neuroanatomy. Furthermore, the quantitative comparisons between African and Asian elephant specimens should include some measure of overall brain size as a covariate in the analyses. Addressing these weaknesses would enable a richer interpretation of the study's findings. 

      Comment: We understand, why the referee asks for additional comparative data, which would make our study more meaningful. We note that we already published a quantitative comparison of African and Asian elephant facial nuclei (Kaufmann et al. 2022). The quantitative differences between African and Asian elephant facial nuclei are similar in magnitude to what we observed here for the trigeminal nucleus, i.e. African elephants have about 10-15% more facial nucleus neurons than Asian elephants. The referee also notes that data on overall elephant brain size might be important for interpreting our data. We agree with this sentiment and we are preparing a ms on African and Asian elephant brain size. We find – unexpectedly given the larger body size of African elephants – that African elephants have smaller brains than Asian elephants. The finding might imply that African elephants, which have more facial nucleus neurons and more trigeminal nucleus trunk module neurons, are neurally more specialized in trunk control than Asian elephants.

      Change: We are preparing a further ms on African and Asian elephant brain size, a first version of this work has been submitted.

      Reviewer #4 (Public Review): 

      Summary: 

      The authors report a novel isomorphism in which the folds of the elephant trunk are recognizably mapped onto the principal sensory trigeminal nucleus in the brainstem. Further, they identifiy the enlarged nucleus as being situated in this species in an unusual ventral midline position. 

      Comment: The referee summarizes our work.

      Change: None.

      Strengths: 

      The identity of the purported trigeminal nucleus and the isomorphic mapping with the trunk folds is supported by multiple lines of evidence: enhanced staining for cytochrome oxidase, an enzyme associated with high metabolic activity; dense vascularization, consistent with high metabolic activity; prominent myelinated bundles that partition the nucleus in a 1:1 mapping of the cutaneous folds in the trunk periphery; near absence of labeling for the anti-peripherin antibody, specific for climbing fibers, which can be seen as expected in the inferior olive; and a high density of glia.

      Comment: The referee again reviews some of our key findings.

      Change: None. 

      Weaknesses: 

      Despite the supporting evidence listed above, the identification of the gross anatomical bumps, conspicuous in the ventral midline, is problematic. This would be the standard location of the inferior olive, with the principal trigeminal nucleus occupying a more dorsal position. This presents an apparent contradiction which at a minimum needs further discussion. Major species-specific specializations and positional shifts are well-documented for cortical areas, but nuclear layouts in the brainstem have been considered as less malleable. 

      Comment: The referee notes that our discrepancy with referee 2, needs to be addressed with further evidence and discussion, given the unusual position of both inferior olive and trigeminal nucleus in the partitioning scheme and that the mammalian brainstem tends to be positionally conservative. We agree with the referee. We note that – based on the immense size of the elephant trigeminal ganglion (50 g), half the size of a monkey brain – it was expected that the elephant trigeminal nucleus ought to be exceptionally large.

      Change: We did additional experimental work to resolve this matter: (i) We ascertained that elephant climbing fibers are strongly peripherin-positive. (ii) Based on elephant climbing fiber peripherin-reactivity we delineated the elephant olivo-cerebellar tract. We find that the olivo-cerebellar connects to the structure we refer to as inferior olive to the cerebellum. (iii) We also found that the trigeminal nucleus (the structure the referee refers to as inferior olive) appears to receive no climbing fibers. (iv) We provide indications that the tracing of the trigeminal nerve into the olivo-cerebellar tract by Maseko et al. 2023 was erroneous (Referee-Figure 1). These novel findings support our ideas.

      Reviewer #5 (Public Review): 

      After reading the manuscript and the concerns raised by reviewer 2 I see both sides of the argument - the relative location of trigeminal nucleus versus the inferior olive is quite different in elephants (and different from previous studies in elephants), but when there is a large disproportionate magnification of a behaviorally relevant body part at most levels of the nervous system (certainly in the cortex and thalamus), you can get major shifting in location of different structures. In the case of the elephant, it looks like there may be a lot of shifting. Something that is compelling is that the number of modules separated but the myelin bands correspond to the number of trunk folds which is different in the different elephants. This sort of modular division based on body parts is a general principle of mammalian brain organization (demonstrated beautifully for the cuneate and gracile nucleus in primates, VP in most of species, S1 in a variety of mammals such as the star nosed mole and duck-billed platypus). I don't think these relative changes in the brainstem would require major genetic programming - although some surely exists. Rodents and elephants have been independently evolving for over 60 million years so there is a substantial amount of time for changes in each l lineage to occur.

      I agree that the authors have identified the trigeminal nucleus correctly, although comparisons with more out groups would be needed to confirm this (although I'm not suggesting that the authors do this). I also think the new figure (which shows previous divisions of the brainstem versus their own) allows the reader to consider these issues for themselves. When reviewing this paper, I actually took the time to go through atlases of other species and even look at some of my own data from highly derived species. Establishing homology across groups based only on relative location is tough especially when there appears to be large shifts in relative location of structures. My thoughts are that the authors did an extraordinary amount of work on obtaining, processing and analyzing this extremely valuable tissue. They document their work with images of the tissue and their arguments for their divisions are solid. I feel that they have earned the right to speculate - with qualifications - which they provide. 

      Comment: The referee summarizes our work and appears to be convinced by the line of our arguments. We are most grateful for this assessment. We add, again, that the skeptical assessment of referee 2 will be published as well and will give the interested reader the possibility to view another perspective on our work.

      Change: None. 

      Recommendations for the authors: 

      Reviewer #1 (Recommendations For The Authors):

      With this manuscript being virtually identical to the previous version, it is possible that some of the definitive conclusions about having identified the elephant trigeminal nucleus and trunk representation should be moderated in a more nuanced manner, especially given the careful and experienced perspective from reviewers with first hand knowledge elephant neuroanatomy.

      Comment: We agree that both our first and second revisions were very much centered on the debate of the correct identification of the trigeminal nucleus and that our ms did not evolve as much in other regards. This being said we agree with Referee 2 that we needed to have this debate. We also think we advanced important novel data in this context (the delineation of elephant olivo-cerebellar tract through the peripherin-antibody).

      Changes: Our revised Figure 2. 

      The peripherin staining adds another level of argument to the authors having identified the trigeminal brainstem instead of the inferior olive, if differential expression of peripherin is strong enough to distinguish one structure from the other.

      Comment: We think we showed too little peripherin-antibody staining in our previous revision. We have now addressed this problem.

      Changes: Our revised Figure 2, i.e. the delineation of elephant olivo-cerebellar tract through the peripherin-antibody).

      There are some minor corrections to be made with the addition of Fig. 2., including renumbering the figures in the manuscript (e.g., 406, 521). 

      I continue to appreciate this novel investigation of the elephant brainstem and find it an interesting and thorough study, with the use of classical and modern neuroanatomical methods.

      Comment: We are thankful for this positive assessment.

      Reviewer #2 (Recommendations For The Authors):

      I do realise the authors are very unhappy with me and the reviews I have submitted. I do apologise if feelings have been hurt, and I do understand the authors put in a lot of hard work and thought to develop what they have; however, it is unfortunate that the work and thoughts are not correct. Science is about the search for the truth and sometimes we get it wrong. This is part of the scientific process and why most journals adhere to strict review processes of scientific manuscripts. As I said previously, the authors can use their data to write a paper describing and quantifying Golgi staining of neurons in the principal olivary nucleus of the elephant that should be published in a specialised journal and contextualised in terms of the motor control of the trunk and the large cerebellum of the elephant. 

      Comment: We appreciate the referee’s kind words. Also, no hard feelings from our side, this is just a scientific debate. In our experience, neuroanatomical debates are resolved by evidence and we note that we provide evidence strengthening our identification of the trigeminal nucleus and inferior olive. As far as we can tell from this effort and the substantial evidence accumulated, the referee is wrong.

      Reviewer #4 (Recommendations For The Authors):

      As a new reviewer, I have benefited from reading the previous reviews and Author response, even while having several new comments to add. 

      (1) The identification of the inferior olive and trigeminal nuclei is obviously center stage. An enlargement of the trigeminal nuclei is not necessarily problematic, given the published reports on the dramatic enlargement of the trigeminal nerve (Purkart et al., 2022). At issue is the conspicuous relocation of the trigeminal nuclei that is being promoted by Reveyaz et al. Conspicuous rearrangements are not uncommon; for example, primary sensory cortical fields in different species (fig. 1 in H.H.A. Oelschlager for dolphins; S. De Vreese et al. (2023) for cetaceans, L. Krubitzer on various species, in the context of evolution). The difficult point here concerns what looks like a rather conspicuous gross anatomical rearrangement, in BRAINSTEM - the assumption being that the brainstem bauplan is going to be specifically conservative and refractory to gross anatomical rearrangement. 

      Comment: We agree with the referee that the brainstem rearrangements are unexpected. We also think that the correct identification of nuclei needs to be at the center of our revision efforts.

      Change: Our revision provided further evidence (delineation of the olivo-cerebellar tract, characterization of the trigeminal nerve entry) about the identity of the nuclei we studied.

      Why would a major nucleus shift to such a different location? and how? Can ex vivo DTI provide further support of the correct identification? Is there other "disruption" in the brainstem? What occupies the traditional position of the trigeminal nuclei? An atlas-equivalent coronal view of the entire brainstem would be informative. The Authors have assembled multiple criteria to support their argument that the ventral "bumps" are in fact a translocated trigeminal principal nucleus: enhanced CO staining, enhanced vascularization, enhanced myelination (via Golgi stains and tomography), very scant labeling for a climbing fiber specific antibody ( anti-peripherin), vs. dense staining of this in the alternative structure that they identify as IO; and a high density of glia. Admittedly, this should be sufficient, but the proposed translocation (in the BRAINSTEM) is sufficiently startling that this is arguably NOT sufficient. <br /> The terminology of "putative" is helpful, but a more cogent presentation of the results and more careful discussion might succeed in winning over at least some of a skeptical readership. 

      Comment: We do not know, what led to the elephant brainstem rearrangements we propose. If the trigeminal nuclei had expanded isometrically in elephants from the ancestral pattern, one would have expected a brain with big lateral bumps, not the elephant brain with its big ventromedial bumps. We note, however, that very likely the expansion of the elephant trigeminal nuclei did not occur isometrically. Instead, the neural representation of the elephant nose expanded dramatically and in rodents the nose is represented ventromedially in the brainstem face representation. Thus, we propose a ‘ventromedial outgrowth model’ according to which the elephant ventromedial trigeminal bumps result from a ventromedially direct outgrowth of the ancestral ventromedial nose representation.

      We advanced substantially more evidence to support our partitioning scheme, including the delineation of the olivo-cerebellar tract based on peripherin-reactivity. We also identified problems in previous partitioning schemes, such as the claim that the trigeminal nerve continues into the ~4x smaller olivocerebellar tract (Referee-Figure 1C, D); we think such a flow of fibers, (which is also at odds with peripherin-antibody-reactivity and the appearance of nerve and olivocerebellar tract), is highly unlikely if not physically impossible. With all that we do not think that we overstate our case in our cautiously presented ms.

      Change: We added evidence on the identification of elephant trigeminal nuclei and inferior olive.

      (2) Role of myelin. While the photos of myelin are convincing, it would be nice to have further documentation. Gallyas? Would antibodies to MBP work? What is the myelin distribution in the "standard" trigeminal nuclei (human? macaque or chimpanzee?). What are alternative sources of the bundles? Regardless, I think it would be beneficial to de-emphasize this point about the role of myelin in demarcating compartments. <br /> I would in fact suggest an alternative (more neutral) title that might highlight instead the isomorphic feature; for example, "An isomorphic representation of Trunk folds in the Elephant Trigeminal Nucleus." The present title stresses myelin, but figure 1 already focuses on CO. Additionally, the folds are actually mentioned almost in passing until later in the manuscript. I recommend a short section on these at the beginning of the Results to serve as a useful framework.

      Here I'm inclined to agree with the Reviewer, that the Authors' contention that the myelin stipes serve PRIMARILY to separate trunk-fold domains is not particularly compelling and arguably a distraction. The point can be made, but perhaps with less emphasis. After all, the fact that myelin has multiple roles is well-established, even if frequently overlooked. In addition, the Authors might make better use of an extensive relevant literature related to myelin as a compartmental marker; for example, results and discussion in D. Haenelt....N. Weiskopf (eLife, 2023), among others. Another example is the heavily myelinated stria of Gennari in primate visual cortex, consisting of intrinsic pyramidal cell axons, but where the role of the myelination has still not been elucidated. 

      Comment: (1) Documentation of myelin. We note that we show further identification of myelinated fibers by the fluorescent dye fluomyelin in Figure 4B. We also performed additional myelin stains as the gold-myelin stain after the protocol of Schmued (Referee-Figure 2). In the end, nothing worked quite as well to visualize myelin-stripes as the bright-field images shown in Figure 4A and it is only the images that allowed us to match myelin-stripes to trunk folds. Hence, we focus our presentation on these images.

      (2) Title: We get why the referee envisions an alternative title. This being said, we would like to stick with our current title, because we feel it highlights the major novelty we discovered.

      (3) We agree with many of the other comments of the referee on myelin phenomenology. We missed the Haenelt reference pointed out by the referee and think it is highly relevant to our paper

      Change: 1. Review image 2. Inclusion of the Haenelt-reference.

      Author response image 2.

      Myelin stripes of the elephant trunk module visualized by Gold-chloride staining according to Schmued. A, Low magnification micrograph of the trunk module of African elephant Indra stained with AuCl according to Schmued. The putative finger is to the left, proximal is to the right. Myelin stripes can easily be recognized. The white box indicates the area shown in B. B, high magnification micrograph of two myelin stripes. Individual gold-stained (black) axons organized in myelin stripes can be recognized.

      Schmued, L. C. (1990). A rapid, sensitive histochemical stain for myelin in frozen brain sections. Journal of Histochemistry & Cytochemistry,38(5), 717-720.

      Are the "bumps" in any way "analogous" to the "brain warts" seen in entorhinal areas of some human brains (G. W. van Hoesen and A. Solodkin (1993)? 

      Comment: We think this is a similar phenomenon.

      Change: We included the Hoesen and A. Solodkin (1993) reference in our discussion.

      At least slightly more background (ie, a separate section or, if necessary, supplement) would be helpful, going into more detail on the several subdivisions of the ION and if these undergo major alterations in the elephant.

      Comment: The strength of the paper is the detailed delineation of the trunk module, based on myelin stripes and isomorphism. We don’t think we have strong evidence on ION subdivisions, because it appears the trigeminal tract cannot be easily traced in elephants. Accordingly, we find it difficult to add information here.

      Change: None.

      Is there evidence from the literature of other conspicuous gross anatomical translocations, in any species, especially in subcortical regions? 

      Comment: The best example that comes to mind is the star-nosed mole brainstem. There is a beautiful paper comparing the star-nosed mole brainstem to the normal mole brainstem (Catania et al 2011). The principal trigeminal nucleus in the star-nosed mole is far more rostral and also more medial than in the mole; still, such rearrangements are minor compared to what we propose in elephants.

      Catania, Kenneth C., Duncan B. Leitch, and Danielle Gauthier. "A star in the brainstem reveals the first step of cortical magnification." PloS one 6.7 (2011): e22406.

      Change: None.

      (3) A major point concerns the isomorphism between the putative trigeminal nuclei and the trunk specialization. I think this can be much better presented, at least with more discussion and other examples. The Authors mention about the rodent "barrels," but it seemed strange to me that they do not refer to their own results in pig (C. Ritter et al., 2023) nor the work from Ken Catania, 2002 (star-nosed mole; "fingerprints in the brain") or other that might be appropriate. I concur with the Reviewer that there should be more comparative data. 

      Comment: We agree.

      Change: We added a discussion of other isomorphisms including the the star-nosed mole to our paper.

      (4) Textual organization could be improved. 

      The Abstract all-important Introduction is a longish, semi "run-on" paragraph. At a minimum this should be broken up. The last paragraph of the Introduction puts forth five issues, but these are only loosely followed in the Results section. I think clarity and good organization is of the upmost importance in this manuscript. I recommend that the Authors begin the Results with a section on the trunk folds (currently figure 5, and discussion), continue with the several points related to the identification of the trigeminal nuclei, and continue with a parallel description of ION with more parallel data on the putative trigeminal and IO structures (currently referee Table 1, but incorporate into the text and add higher magnification of nucleus-specific cell types in the IO and trigeminal nuclei). Relevant comparative data should be included in the Discussion.

      Comment: 1. We agree with the referee that our abstract needed to be revised. 2. We also think that our ms was heavily altered by the insertion of the new Figure 2, which complemented Figure 1 from our first submission and is concerned with the identification of the inferior olive. From a standpoint of textual flow such changes were not ideal, but the revisions massively added to the certainty with which we identify the trigeminal nuclei. Thus, although we are not as content as we were with the flow, we think the ms advanced in the revision process and we would like to keep the Figure sequence as is. 3. We already noted above that we included additional comparative evidence.

      Change: 1. We revised our abstract. 2. We added comparative evidence.

      Reviewer #5 (Recommendations For The Authors): 

      The data is invaluable and provides insights into some of the largest mammals on the planet. 

      Comment: We are incredibly thankful for this positive assessment.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      eLife Assessment

      This neuroimaging and electrophysiology study in a small cohort of congenital cataract patients with sight recovery aims to characterize the effects of early visual deprivation on excitatory and inhibitory balance in visual cortex. While contrasting sight-recovery with visually intact controls suggested the existence of persistent alterations in Glx/GABA ratio and aperiodic EEG signals, it provided only incomplete evidence supporting claims about the effects of early deprivation itself. The reported data were considered valuable, given the rare study population. However, the small sample sizes, lack of a specific control cohort and multiple methodological limitations will likely restrict usefulness to scientists working in this particular subfield.

      We thank the reviewing editors for their consideration and updated assessment of our manuscript after its first revision.

      In order to assess the effects of early deprivation, we included an age-matched, normally sighted control group recruited from the same community, measured in the same scanner and laboratory. This study design is analogous to numerous studies in permanently congenitally blind humans, which typically recruited sighted controls, but hardly ever individuals with a different, e.g. late blindness history. In order to improve the specificity of our conclusions, we used a frontal cortex voxel in addition to a visual cortex voxel (MRS). Analogously, we separately analyzed occipital and frontal electrodes (EEG).

      Moreover, we relate our findings in congenital cataract reversal individuals to findings in the literature on permanent congenital blindness. Note, there are, to the best of our knowledge, neither MRS nor resting-state EEG studies in individuals with permanent late blindness.

      Our participants necessarily have nystagmus and low visual acuity due to their congenital deprivation phase, and the existence of nystagmus is a recruitment criterion to diagnose congenital cataracts.

      It might be interesting for future studies to investigate individuals with transient late blindness. However, such a study would be ill-motivated had we not found differences between the most “extreme” of congenital visual deprivation conditions and normally sighted individuals (analogous to why earlier research on permanent blindness investigated permanent congenitally blind humans first, rather than permanently late blind humans, or both in the same study). Any result of these future work would need the reference to our study, and neither results in these additional groups would invalidate our findings.

      Since all our congenital cataract reversal individuals by definition had visual impairments, we included an eyes closed condition, both in the MRS and EEG assessment. Any group effect during the eyes closed condition cannot be due to visual acuity deficits changing the bottom-up driven visual activation.

      As we detail in response to review 3, our EEG analyses followed the standards in the field.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary

      In this human neuroimaging and electrophysiology study, the authors aimed to characterise effects of a period of visual deprivation in the sensitive period on excitatory and inhibitory balance in the visual cortex. They attempted to do so by comparing neurochemistry conditions ('eyes open', 'eyes closed') and resting state, and visually evoked EEG activity between ten congenital cataract patients with recovered sight (CC), and ten age-matched control participants (SC) with normal sight.

      First, they used magnetic resonance spectroscopy to measure in vivo neurochemistry from two locations, the primary location of interest in the visual cortex, and a control location in the frontal cortex. Such voxels are used to provide a control for the spatial specificity of any effects, because the single-voxel MRS method provides a single sampling location. Using MR-visible proxies of excitatory and inhibitory neurotransmission, Glx and GABA+ respectively, the authors report no group effects in GABA+ or Glx, no difference in the functional conditions 'eyes closed' and 'eyes open'. They found an effect of group in the ratio of Glx/GABA+ and no similar effect in the control voxel location. They then perform multiple exploratory correlations between MRS measures and visual acuity, and report a weak positive correlation between the 'eyes open' condition and visual acuity in CC participants.

      The same participants then took part in an EEG experiment. The authors selected two electrodes placed in the visual cortex for analysis and report a group difference in an EEG index of neural activity, the aperiodic intercept, as well as the aperiodic slope, considered a proxy for cortical inhibition. Control electrodes in the frontal region did not present with the same pattern. They report an exploratory correlation between the aperiodic intercept and Glx in one out of three EEG conditions.

      The authors report the difference in E/I ratio, and interpret the lower E/I ratio as representing an adaptation to visual deprivation, which would have initially caused a higher E/I ratio. Although intriguing, the strength of evidence in support of this view is not strong. Amongst the limitations are the low sample size, a critical control cohort that could provide evidence for higher E/I ratio in CC patients without recovered sight for example, and lower data quality in the control voxel. Nevertheless, the study provides a rare and valuable insight into experience-dependent plasticity in the human brain.

      Strengths of study

      How sensitive period experience shapes the developing brain is an enduring and important question in neuroscience. This question has been particularly difficult to investigate in humans. The authors recruited a small number of sight-recovered participants with bilateral congenital cataracts to investigate the effect of sensitive period deprivation on the balance of excitation and inhibition in the visual brain using measures of brain chemistry and brain electrophysiology. The research is novel, and the paper was interesting and well written.

      Limitations

      Low sample size. Ten for CC and ten for SC, and further two SC participants were rejected due to lack of frontal control voxel data. The sample size limits the statistical power of the dataset and increases the likelihood of effect inflation.

      In the updated manuscript, the authors have provided justification for their sample size by pointing to prior studies and the inherent difficulties in recruiting individuals with bilateral congenital cataracts. Importantly, this highlights the value the study brings to the field while also acknowledging the need to replicate the effects in a larger cohort.

      Lack of specific control cohort. The control cohort has normal vision. The control cohort is not specific enough to distinguish between people with sight loss due to different causes and patients with congenital cataracts with co-morbidities. Further data from a more specific populations, such as patients whose cataracts have not been removed, with developmental cataracts, or congenitally blind participants, would greatly improve the interpretability of the main finding. The lack of a more specific control cohort is a major caveat that limits a conclusive interpretation of the results.

      In the updated version, the authors have indicated that future studies can pursue comparisons between congenital cataract participants and cohorts with later sight loss.

      MRS data quality differences. Data quality in the control voxel appears worse than in the visual cortex voxel. The frontal cortex MRS spectrum shows far broader linewidth than the visual cortex (Supplementary Figures). Compared to the visual voxel, the frontal cortex voxel has less defined Glx and GABA+ peaks; lower GABA+ and Glx concentrations, lower NAA SNR values; lower NAA concentrations. If the data quality is a lot worse in the FC, then small effects may not be detectable.

      In the updated version, the authors have added more information that informs the reader of the MRS quality differences between voxel locations. This increases the transparency of their reporting and enhances the assessment of the results.

      Because of the direction of the difference in E/I, the authors interpret their findings as representing signatures of sight improvement after surgery without further evidence, either within the study or from the literature. However, the literature suggests that plasticity and visual deprivation drives the E/I index up rather than down. Decreasing GABA+ is thought to facilitate experience dependent remodelling. What evidence is there that cortical inhibition increases in response to a visual cortex that is over-sensitised to due congenital cataracts? Without further experimental or literature support this interpretation remains very speculative.

      The updated manuscript contains key reference from non-human work to justify their interpretation.

      Heterogeneity in patient group. Congenital cataract (CC) patients experienced a variety of duration of visual impairment and were of different ages. They presented with co-morbidities (absorbed lens, strabismus, nystagmus). Strabismus has been associated with abnormalities in GABAergic inhibition in the visual cortex. The possible interactions with residual vision and confounds of co-morbidities are not experimentally controlled for in the correlations, and not discussed.

      The updated document has addressed this caveat.

      Multiple exploratory correlations were performed to relate MRS measures to visual acuity (shown in Supplementary Materials), and only specific ones shown in the main document. The authors describe the analysis as exploratory in the 'Methods' section. Furthermore, the correlation between visual acuity and E/I metric is weak, not corrected for multiple comparisons. The results should be presented as preliminary, as no strong conclusions can be made from them. They can provide a hypothesis to test in a future study.

      This has now been done throughout the document and increases the transparency of the reporting.

      P.16 Given the correlation of the aperiodic intercept with age ("Age negatively correlated with the aperiodic intercept across CC and SC individuals, that is, a flattening of the intercept was observed with age"), age needs to be controlled for in the correlation between neurochemistry and the aperiodic intercept. Glx has also been shown to negatively correlates with age.

      This caveat has been addressed in the revised manuscript.

      Multiple exploratory correlations were performed to relate MRS to EEG measures (shown in Supplementary Materials), and only specific ones shown in the main document. Given the multiple measures from the MRS, the correlations with the EEG measures were exploratory, as stated in the text, p.16, and in Fig.4. yet the introduction said that there was a prior hypothesis "We further hypothesized that neurotransmitter changes would relate to changes in the slope and intercept of the EEG aperiodic activity in the same subjects." It would be great if the text could be revised for consistency and the analysis described as exploratory.

      This has been done throughout the document and increases the transparency of the reporting.

      The analysis for the EEG needs to take more advantage of the available data. As far as I understand, only two electrodes were used, yet far more were available as seen in their previous study (Ossandon et al., 2023). The spatial specificity is not established. The authors could use the frontal cortex electrode (FP1, FP2) signals as a control for spatial specificity in the group effects, or even better, all available electrodes and correct for multiple comparisons. Furthermore, they could use the aperiodic intercept vs Glx in SC to evaluate the specificity of the correlation to CC.

      This caveat has been addressed. The authors have added frontal electrodes to their analysis, providing an essential regional control for the visual cortex location.

      Comments on the latest version:

      The authors have made reasonable adjustments to their manuscript that addressed most of my comments by adding further justification for their methodology, essential literature support, pointing out exploratory analyses, limitations and adding key control analyses. Their revised manuscript has overall improved, providing valuable information, though the evidence that supports their claims is still incomplete.

      We thank the reviewer for suggesting ways to improve our manuscript and carefully reassessing our revised manuscript.

      Reviewer #2 (Public review):

      Summary:

      The study examined 10 congenitally blind patients who recovered vision through the surgical removal of bilateral dense cataracts, measuring neural activity and neuro chemical profiles from the visual cortex. The declared aim is to test whether restoring visual function after years of complete blindness impacts excitation/inhibition balance in the visual cortex.

      Strengths:

      The findings are undoubtedly useful for the community, as they contribute towards characterising the many ways in which this special population differs from normally sighted individuals. The combination of MRS and EEG measures is a promising strategy to estimate a fundamental physiological parameter - the balance between excitation and inhibition in the visual cortex, which animal studies show to be heavily dependent upon early visual experience. Thus, the reported results pave the way for further studies, which may use a similar approach to evaluate more patients and control groups.

      Weaknesses:

      The main methodological limitation is the lack of an appropriate comparison group or condition to delineate the effect of sight recovery (as opposed to the effect of congenital blindness). Few previous studies suggested that Excitation/Inhibition ratio in the visual cortex is increased in congenitally blind patients; the present study reports that E/I ratio decreases instead. The authors claim that this implies a change of E/I ratio following sight recovery. However, supporting this claim would require showing a shift of E/I after vs. before the sight-recovery surgery, or at least it would require comparing patients who did and did not undergo the sight-recovery surgery (as common in the field).

      We thank the reviewer for suggesting ways to improve our manuscript and carefully reassessing our revised manuscript.

      Since we have not been able to acquire longitudinal data with the experimental design of the present study in congenital cataract reversal individuals, we compared the MRS and EEG results of congenital cataract reversal individuals  to published work in congenitally permanent blind individuals. We consider this as a resource saving approach. We think that the results of our cross-sectional study now justify the costs and enormous efforts (and time for the patients who often have to travel long distances) associated with longitudinal studies in this rare population.

      There are also more technical limitations related to the correlation analyses, which are partly acknowledged in the manuscript. A bland correlation between GLX/GABA and the visual impairment is reported, but this is specific to the patients group (N=10) and would not hold across groups (the correlation is positive, predicting the lowest GLX/GABA ratio values for the sighted controls - opposite of what is found). There is also a strong correlation between GLX concentrations and the EEG power at the lowest temporal frequencies. Although this relation is intriguing, it only holds for a very specific combination of parameters (of the many tested): only with eyes open, only in the patients group.

      Given the exploratory nature of the correlations, we do not base the majority of our conclusions on this analysis. There are no doubts that the reported correlations need replication; however, replication is only possible after a first report. Thus, we hope to motivate corresponding analyses in further studies.

      It has to be noted that in the present study significance testing for correlations were corrected for multiple comparisons, and that some findings replicate earlier reports (e.g. effects on EEG aperiodic slope, alpha power, and correlations with chronological age).

      Conclusions:

      The main claim of the study is that sight recovery impacts the excitation/inhibition balance in the visual cortex, estimated with MRS or through indirect EEG indices. However, due to the weaknesses outlined above, the study cannot distinguish the effects of sight recovery from those of visual deprivation. Moreover, many aspects of the results are interesting but their validation and interpretation require additional experimental work.

      We interpret the group differences between individuals tested years after congenital visual deprivation and normally sighted individuals as supportive of the E/I ratio being impacted by congenital visual deprivation. In the absence of a sensitive period for the development of an E/I ratio, individuals with a transient phase of congenital blindness might have developed a visual system indistinguishable  from normally sighted individuals. As we demonstrate, this is not so. Comparing the results of congenitally blind humans with those of congenitally permanently blind humans (from previous studies) allowed us to identify changes of E/I ratio, which add to those found for congenital blindness.  

      We thank the reviewer for the helpful comments and suggestions related to the first submission and first revision of our manuscript. We are keen to translate some of them into future studies.

      Reviewer #3 (Public review):

      This manuscript examines the impact of congenital visual deprivation on the excitatory/inhibitory (E/I) ratio in the visual cortex using Magnetic Resonance Spectroscopy (MRS) and electroencephalography (EEG) in individuals whose sight was restored. Ten individuals with reversed congenital cataracts were compared to age-matched, normally sighted controls, assessing the cortical E/I balance and its interrelationship and to visual acuity. The study reveals that the Glx/GABA ratio in the visual cortex and the intercept and aperiodic signal are significantly altered in those with a history of early visual deprivation, suggesting persistent neurophysiological changes despite visual restoration.

      First of all, I would like to disclose that I am not an expert in congenital visual deprivation, nor in MRS. My expertise is in EEG (particularly in the decomposition of periodic and aperiodic activity) and statistical methods.

      Although the authors addressed some of the concerns of the previous version, major concerns and flaws remain in terms of methodological and statistical approaches along with the (over)interpretation of the results. Specific concerns include:

      (1 3.1) Response to Variability in Visual Deprivation<br /> Rather than listing the advantages and disadvantages of visual deprivation, I recommend providing at least a descriptive analysis of how the duration of visual deprivation influenced the measures of interest. This would enhance the depth and relevance of the discussion.

      Although Review 2 and Review 3 (see below) pointed out problems in interpreting multiple correlational analyses in small samples, we addressed this request by reporting such correlations between visual deprivation history and measured EEG/MRS outcomes.

      Calculating the correlation between duration of visual deprivation and behavioral or brain measures is, in fact, a common suggestion. The existence of sensitive periods, which are typically assumed to not follow a linear gradual decline of neuroplasticity, does not necessary allow predicting a correlation with duration of blindness. Daphne Maurer has additionally worked on the concept of “sleeper effects” (Maurer et al., 2007), that is, effects on the brain and behavior by early deprivation which are observed only later in life when the function/neural circuits matures.

      In accordance with this reasoning, we did not observe a significant correlation between duration of visual deprivation and any of our dependent variables.

      (2 3.2) Small Sample Size<br /> The issue of small sample size remains problematic. The justification that previous studies employed similar sample sizes does not adequately address the limitation in the current study. I strongly suggest that the correlation analyses should not feature prominently in the main manuscript or the abstract, especially if the discussion does not substantially rely on these correlations. Please also revisit the recommendations made in the section on statistical concerns.

      In the revised manuscript, we explicitly mention that our sample size is not atypical for the special group investigated, but that a replication of our results in larger samples would foster their impact. We only explicitly mention correlations that survived stringent testing for multiple comparisons in the main manuscript.

      Given the exploratory nature of the correlations, we have not based the majority of our claims on this analysis.

      (3 3.3) Statistical Concerns<br /> While I appreciate the effort of conducting an independent statistical check, it merely validates whether the reported statistical parameters, degrees of freedom (df), and p-values are consistent. However, this does not address the appropriateness of the chosen statistical methods.

      We did not intend for the statcheck report to justify the methods used for statistics, which we have done in a separate section with normality and homogeneity testing (Supplementary Material S9), and references to it in the descriptions of the statistical analyses (Methods, Page 13, Lines 326-329 and Page 15, Lines 400-402).

      Several points require clarification or improvement:<br /> (4) Correlation Methods: The manuscript does not specify whether the reported correlation analyses are based on Pearson or Spearman correlation.

      The depicted correlations are Pearson correlations. We will add this information to the Methods.

      (5) Confidence Intervals: Include confidence intervals for correlations to represent the uncertainty associated with these estimates.

      We have added the confidence intervals for all measured correlations to the second revision of our manuscript.

      (6) Permutation Statistics: Given the small sample size, I recommend using permutation statistics, as these are exact tests and more appropriate for small datasets.

      Our study focuses on a rare population, with a sample size limited by the availability of participants. Our findings provide exploratory insights rather than make strong inferential claims. To this end, we have ensured that our analysis adheres to key statistical assumptions (Shapiro-Wilk as well as Levene’s tests, Supplementary Material S9), and reported our findings with effect sizes, appropriate caution and context.

      (7) Adjusted P-Values: Ensure that reported Bonferroni corrected p-values (e.g., p > 0.999) are clearly labeled as adjusted p-values where applicable.

      In the revised manuscript, we have changed Figure 4 to say ‘adjusted p,’  which we indeed reported.

      (8) Figure 2C

      Figure 2C still lacks crucial information that the correlation between Glx/GABA ratio and visual acuity was computed solely in the control group (as described in the rebuttal letter). Why was this analysis restricted to the control group? Please provide a rationale.

      Figure 2C depicts the correlation between Glx/GABA+ ratio and visual acuity in the congenital cataract reversal group, not the control group. This is mentioned in the Figure 2 legend, as well as in the main text where the figure is referred to (Page 18, Line 475).

      The correlation analyses between visual acuity and MRS/EEG measures were only performed in the congenital cataract reversal group since the sighed control group comprised of individuals with vision in the normal range; thus this analyses would not make sense. Table 1 with the individual visual acuities for all participants, including the normally sighted controls, shows the low variance in the latter group.  

      For variables in which no apiori group differences in variance were predicted, we performed the correlation analyses across groups (see Supplementary Material S12, S15).

      We have now highlighted these motivations more clearly in the Methods of the revised manuscript (Page 16, Lines 405-410).

      (9 3.4) Interpretation of Aperiodic Signal

      Relying on previous studies to interpret the aperiodic slope as a proxy for excitation/inhibition (E/I) does not make the interpretation more robust.

      How to interpret aperiodic EEG activity has been subject of extensive investigation. We cite studies which provide evidence from multiple species (monkeys, humans) and measurements (EEG, MEG, ECoG), including studies which pharmacologically manipulated E/I balance.

      Whether our findings are robust, in fact, requires a replication study. Importantly, we analyzed the intercept of the aperiodic activity fit as well, and discuss results related to the intercept.

      Quote:

      “(3.4) Interpretation of aperiodic signal:

      - Several recent papers demonstrated that the aperiodic signal measured in EEG or ECoG is related to various important aspects such as age, skull thickness, electrode impedance, as well as cognition. Thus, currently, very little is known about the underlying effects which influence the aperiodic intercept and slope. The entire interpretation of the aperiodic slope as a proxy for E/I is based on a computational model and simulation (as described in the Gao et al. paper).

      Apart from the modeling work from Gao et al., multiple papers which have also been cited which used ECoG, EEG and MEG and showed concomitant changes in aperiodic activity with pharmacological manipulation of the E/I ratio (Colombo et al., 2019; Molina et al., 2020; Muthukumaraswamy & Liley, 2018). Further, several prior studies have interpreted changes in the aperiodic slope as reflective of changes in the E/I ratio, including studies of developmental groups (Favaro et al., 2023; Hill et al., 2022; McSweeney et al., 2023; Schaworonkow & Voytek, 2021) as well as patient groups (Molina et al., 2020; Ostlund et al., 2021).

      - The authors further wrote: We used the slope of the aperiodic (1/f) component of the EEG spectrum as an estimate of E/I ratio (Gao et al., 2017; Medel et al., 2020; Muthukumaraswamy & Liley, 2018). This is a highly speculative interpretation with very little empirical evidence. These papers were conducted with ECoG data (mostly in animals) and mostly under anesthesia. Thus, these studies only allow an indirect interpretation by what the 1/f slope in EEG measurements is actually influenced.

      Note that Muthukumaraswamy et al. (2018) used different types of pharmacological manipulations and analyzed periodic and aperiodic MEG activity in humans, in addition to monkey ECoG (Muthukumaraswamy & Liley, 2018). Further, Medel et al. (now published as Medel et al., 2023) compared EEG activity in addition to ECoG data after propofol administration. The interpretation of our results are in line with a number of recent studies in developing (Hill et al., 2022; Schaworonkow & Voytek, 2021) and special populations using EEG. As mentioned above, several prior studies have used the slope of the 1/f component/aperiodic activity as an indirect measure of the E/I ratio (Favaro et al., 2023; Hill et al., 2022; McSweeney et al., 2023; Molina et al., 2020; Ostlund et al., 2021; Schaworonkow & Voytek, 2021), including studies using scalp-recorded EEG from humans.

      In the introduction of the revised manuscript, we have made more explicit that this metric is indirect (Page 3, Line 91), (additionally see Discussion, Page 24, Lines 644-645, Page 25, Lines 650-657).

      While a full understanding of aperiodic activity needs to be provided, some convergent ideas have emerged. We think that our results contribute to this enterprise, since our study is, to the best of our knowledge, the first which assessed MRS measured neurotransmitter levels and EEG aperiodic activity. “

      (10) Additionally, the authors state:

      "We cannot think of how any of the exploratory correlations between neurophysiological measures and MRS measures could be accounted for by a difference e.g. in skull thickness."

      (11) This could be addressed directly by including skull thickness as a covariate or visualizing it in scatterplots, for instance, by representing skull thickness as the size of the dots.

      We are not aware of any study that would justify such an analysis.

      Our analyses were based on previous findings in the literature.

      Since to the best of our knowledge, no evidence exists that congenital cataracts go together with changes in skull thickness, and that skull thickness might selectively modulate visual cortex Glx/GABA+ but not NAA measures, we decided against following this suggestion.

      Notably, the neurotransmitter concentration reported here is after tissue segmentation of the voxel region. The tissue fraction was shown to not differ between groups in the MRS voxels (Supplementary Material S4). The EEG electrode impedance was lowered to <10 kOhm in every participant (Methods, Page 13, Line 344), and preparation was identical across groups.

      (12 3.5) Problems with EEG Preprocessing and Analysis

      Downsampling: The decision to downsample the data to 60 Hz "to match the stimulation rate" is problematic. This choice conflates subsequent spectral analyses due to aliasing issues, as explained by the Nyquist theorem. While the authors cite prior studies (Schwenk et al., 2020; VanRullen & MacDonald, 2012) to justify this decision, these studies focused on alpha (8-12 Hz), where aliasing is less of a concern compared of analyzing aperiodic signal. Furthermore, in contrast, the current study analyzes the frequency range from 1-20 Hz, which is too narrow for interpreting the aperiodic signal as E/I. Typically, this analysis should include higher frequencies, spanning at least 1-30 Hz or even 1-45 Hz (not 20-40 Hz).

      As previously mentied in the Methods (Page 15 Line 376) and the previous response, the pop_resample function used by EEGLAB applies an anti-aliasing filter, at half the resampling frequency (as per the Nyquist theorem

      https://eeglab.org/tutorials/05_Preprocess/resampling.html). The upper cut off of the low pass filter set by EEGlab prior to down sampling (30 Hz) is still far above the frequency of interest in the current study  (1-20 Hz), thus allowing us to derive valid results.

      Quote:

      “- The authors downsampled the data to 60Hz to "to match the stimulation rate". What is the intention of this? Because the subsequent spectral analyses are conflated by this choice (see Nyquist theorem).

      This data were collected as part of a study designed to evoke alpha activity with visual white-noise, which ranged in luminance with equal power at all frequencies from 1-60 Hz, restricted by the refresh rate of the monitor on which stimuli were presented (Pant et al., 2023). This paradigm and method was developed by VanRullen and colleagues (Schwenk et al., 2020; Vanrullen & MacDonald, 2012), wherein the analysis requires the same sampling rate between the presented frequencies and the EEG data. The downsampling function used here automatically applies an anti-aliasing filter (EEGLAB 2019) .”

      Moreover, the resting-state data were not resampled to 60 Hz. We have made this clearer in the Methods of the second revision (Page 15, Line 367).

      Our consistent results of group differences across all three EEG conditions, thus, exclude any possibility that they were driven by aliasing artifacts.

      The expected effects of this anti-aliasing filter can be seen in the attached Author response image 1, showing an example participant’s spectrum in the 1-30 Hz range (as opposed to the 1-20 Hz plotted in the manuscript), clearly showing a 30-40 dB drop at 30 Hz. Any aliasing due to, for example, remaining line noise, would additionally be visible in this figure (as well as Figure 3) as a peak.

      Author response image 1.

      Power spectral density of one congenital cataract-reversal (CC) participant in the visual stimulation condition across all channels. The reduced power at 30 Hz shows the effects of the anti-aliasing filter applied by EEGLAB’s pop_resample function.

      As we stated in the manuscript, and in previous reviews, so far there has been no consensus on the exact range of measuring aperiodic activity. We made a principled decision based on the literature (showing a knee in aperiodic fits of this dataset at 20 Hz) (Medel et al., 2023; Ossandón et al., 2023), data quality (possible contamination by line noise at higher frequencies) and the purpose of the visual stimulation experiment (to look at the lower frequency range by stimulating up to 60 Hz, thereby limiting us to quantifying below 30 Hz), that 1-20 Hz would be the fit range in this dataset.

      Quote:

      “(3) What's the underlying idea of analyzing two separate aperiodic slopes (20-40Hz and 1-19Hz). This is very unusual to compute the slope between 20-40 Hz, where the SNR is rather low.

      "Ossandón et al. (2023), however, observed that in addition to the flatter slope of the aperiodic power spectrum in the high frequency range (20-40 Hz), the slope of the low frequency range (1-19 Hz) was steeper in both, congenital cataract-reversal individuals, as well as in permanently congenitally blind humans."

      The present manuscript computed the slope between 1-20 Hz. Ossandón et al. as well as Medel et al. (2023) found a “knee” of the 1/f distribution at 20 Hz and describe further the motivations for computing both slope ranges. For example, Ossandón et al. used a data driven approach and compared single vs. dual fits and found that the latter fitted the data better. Additionally, they found the best fit if a knee at 20 Hz was used. We would like to point out that no standard range exists for the fitting of the 1/f component across the literature and, in fact, very different ranges have been used (Gao et al., 2017; Medel et al., 2023; Muthukumaraswamy & Liley, 2018). “

      (13) Baseline Removal: Subtracting the mean activity across an epoch as a baseline removal step is inappropriate for resting-state EEG data. This preprocessing step undermines the validity of the analysis. The EEG dataset has fundamental flaws, many of which were pointed out in the previous review round but remain unaddressed. In its current form, the manuscript falls short of standards for robust EEG analysis. If I were reviewing for another journal, I would recommend rejection based on these flaws.

      The baseline removal step from each epoch serves to remove the DC component of the recording and detrend the data. This is a standard preprocessing step (included as an option in preprocessing pipelines recommended by the EEGLAB toolbox, FieldTrip toolbox and MNE toolbox), additionally necessary to improve the efficacy of ICA decomposition (Groppe et al., 2009).

      In the previous review round, a clarification of the baseline timing was requested, which we added. Beyond this request, there was no mention of the appropriateness of the baseline removal and/or a request to provide reasons for why it might not undermine the validity of the analysis.

      Quote:

      “- "Subsequently, baseline removal was conducted by subtracting the mean activity across the length of an epoch from every data point." The actual baseline time segment should be specified.

      The time segment was the length of the epoch, that is, 1 second for the resting state conditions and 6.25 seconds for the visual stimulation conditions. This has been explicitly stated in the revised manuscript (Page 13, Line 354).”

      Prior work in the time (not frequency) domain on event-related potential (ERP) analysis has suggested that the baselining step might cause spurious effects (Delorme, 2023) (although see (Tanner et al., 2016)). We did not perform ERP analysis at any stage. One recent study suggests spurious group differences in the 1/f signal might be driven by an inappropriate dB division baselining method (Gyurkovics et al., 2021), which we did not perform.

      Any effect of our baselining procedure on the FFT spectrum would be below the 1 Hz range, which we did not analyze.  

      Each of the preprocessing steps in the manuscript match pipelines described and published in extensive prior work. We document how multiple aspects of our EEG results replicate prior findings (Supplementary Material S15, S18, S19), reports of other experimenters, groups and locations, validating that our results are robust.

      We therefore reject the claim of methodological flaws in our EEG analyses in the strongest possible terms.

      Quote:

      “(3.5) Problems with EEG preprocessing and analysis:

      - It seems that the authors did not identify bad channels nor address the line noise issue (even a problem if a low pass filter of below-the-line noise was applied).

      As pointed out in the methods and Figure 1, we only analyzed data from two occipital channels, O1 and O2 neither of which were rejected for any participant. Channel rejection was performed for the larger dataset, published elsewhere (Ossandón et al., 2023; Pant et al., 2023). As control sites we added the frontal channels FP1 and Fp2 (see Supplementary Material S14)

      Neither Ossandón et al. (2023) nor Pant et al. (2023) considered frequency ranges above 40 Hz to avoid any possible contamination with line noise. Here, we focused on activity between 0 and 20 Hz, definitely excluding line noise contaminations (Methods, Page 14, Lines 365-367). The low pass filter (FIR, 1-45 Hz) guaranteed that any spill-over effects of line noise would be restricted to frequencies just below the upper cutoff frequency.

      Additionally, a prior version of the analysis used spectrum interpolation to remove line noise; the group differences remained stable (Ossandón et al., 2023). We have reported this analysis in the revised manuscript (Page 14, Lines 364-357).

      Further, both groups were measured in the same lab, making line noise (~ 50 Hz) as an account for the observed group effects in the 1-20 Hz frequency range highly unlikely. Finally, any of the exploratory MRS-EEG correlations would be hard to explain if the EEG parameters would be contaminated with line noise.

      - What was the percentage of segments that needed to be rejected due to the 120μV criteria? This should be reported specifically for EO & EC and controls and patients.

      The mean percentage of 1 second segments rejected for each resting state condition and the percentage of 6.25 long segments rejected in each group for the visual stimulation condition have been added to the revised manuscript (Supplementary Material S10), and referred to in the Methods on Page 14, Lines 372-373).

      - The authors downsampled the data to 60Hz to "to match the stimulation rate". What is the intention of this? Because the subsequent spectral analyses are conflated by this choice (see Nyquist theorem).

      This data were collected as part of a study designed to evoke alpha activity with visual white-noise, which changed in luminance with equal power at all frequencies from 1-60 Hz, restricted by the refresh rate of the monitor on which stimuli were presented (Pant et al., 2023). This paradigm and method was developed by VanRullen and colleagues (Schwenk et al., 2020; VanRullen & MacDonald, 2012), wherein the analysis requires the same sampling rate between the presented frequencies and the EEG data. The downsampling function used here automatically applies an anti-aliasing filter (EEGLAB 2019) .

      - "Subsequently, baseline removal was conducted by subtracting the mean activity across the length of an epoch from every data point." The actual baseline time segment should be specified.

      The time segment was the length of the epoch, that is, 1 second for the resting state conditions and 6.25 seconds for the visual stimulation conditions. This has now been explicitly stated in the revised manuscript (Page 14, Lines 379-380).

      - "We excluded the alpha range (8-14 Hz) for this fit to avoid biasing the results due to documented differences in alpha activity between CC and SC individuals (Bottari et al., 2016; Ossandón et al., 2023; Pant et al., 2023)." This does not really make sense, as the FOOOF algorithm first fits the 1/f slope, for which the alpha activity is not relevant.

      We did not use the FOOOF algorithm/toolbox in this manuscript. As stated in the Methods, we used a 1/f fit to the 1-20 Hz spectrum in the log-log space, and subtracted this fit from the original spectrum to obtain the corrected spectrum. Given the pronounced difference in alpha power between groups (Bottari et al., 2016; Ossandón et al., 2023; Pant et al., 2023), we were concerned it might drive differences in the exponent values. Our analysis pipeline had been adapted from previous publications of our group and other labs (Ossandón et al., 2023; Voytek et al., 2015; Waschke et al., 2017).

      We have conducted the analysis with and without the exclusion of the alpha range, as well as using the FOOOF toolbox both in the 1-20 Hz and 20-40 Hz ranges (Ossandón et al., 2023). The findings of a steeper slope in the 1-20 Hz range as well as lower alpha power in CC vs SC individuals remained stable. In Ossandón et al., the comparison between the piecewise fits and FOOOF fits led the authors to use the former, as it outperformed the FOOOF algorithm for their data.

      - The model fits of the 1/f fitting for EO, EC, and both participant groups should be reported.

      In Figure 3 of the manuscript, we depicted the mean spectra and 1/f fits for each group.

      In the revised manuscript, we added the fit quality metrics (average R<sup>2</sup> values > 0.91 for each group and condition) (Methods Page 15, Lines 395-396; Supplementary Material S11) and additionally show individual subjects’ fits (Supplementary Material S11). “

      (14) The authors mention:

      "The EEG data sets reported here were part of data published earlier (Ossandón et al., 2023; Pant et al., 2023)." Thus, the statement "The group differences for the EEG assessments corresponded to those of a larger sample of CC individuals (n=38) " is a circular argument and should be avoided."

      The authors addressed this comment and adjusted the statement. However, I do not understand, why not the full sample published earlier (Ossandón et al., 2023) was used in the current study?

      The recording of EEG resting state data stated in 2013, while MRS testing could only be set up by the second half of 2019. Moreover, not all subjects who qualify for EEG recording qualify for being scanned (e.g. due to MRI safety, claustrophobia)

      References

      Bottari, D., Troje, N. F., Ley, P., Hense, M., Kekunnaya, R., & Röder, B. (2016). Sight restoration after congenital blindness does not reinstate alpha oscillatory activity in humans. Scientific Reports. https://doi.org/10.1038/srep24683

      Colombo, M. A., Napolitani, M., Boly, M., Gosseries, O., Casarotto, S., Rosanova, M., Brichant, J. F., Boveroux, P., Rex, S., Laureys, S., Massimini, M., Chieregato, A., & Sarasso, S. (2019). The spectral exponent of the resting EEG indexes the presence of consciousness during unresponsiveness induced by propofol, xenon, and ketamine. NeuroImage, 189(September 2018), 631–644. https://doi.org/10.1016/j.neuroimage.2019.01.024

      Delorme, A. (2023). EEG is better left alone. Scientific Reports, 13(1), 2372. https://doi.org/10.1038/s41598-023-27528-0

      Favaro, J., Colombo, M. A., Mikulan, E., Sartori, S., Nosadini, M., Pelizza, M. F., Rosanova, M., Sarasso, S., Massimini, M., & Toldo, I. (2023). The maturation of aperiodic EEG activity across development reveals a progressive differentiation of wakefulness from sleep. NeuroImage, 277. https://doi.org/10.1016/J.NEUROIMAGE.2023.120264

      Gao, R., Peterson, E. J., & Voytek, B. (2017). Inferring synaptic excitation/inhibition balance from field potentials. NeuroImage, 158(March), 70–78. https://doi.org/10.1016/j.neuroimage.2017.06.078

      Groppe, D. M., Makeig, S., & Kutas, M. (2009). Identifying reliable independent components via split-half comparisons. NeuroImage, 45(4), 1199–1211. https://doi.org/10.1016/j.neuroimage.2008.12.038

      Gyurkovics, M., Clements, G. M., Low, K. A., Fabiani, M., & Gratton, G. (2021). The impact of 1/f activity and baseline correction on the results and interpretation of time-frequency analyses of EEG/MEG data: A cautionary tale. NeuroImage, 237. https://doi.org/10.1016/j.neuroimage.2021.118192

      Hill, A. T., Clark, G. M., Bigelow, F. J., Lum, J. A. G., & Enticott, P. G. (2022). Periodic and aperiodic neural activity displays age-dependent changes across early-to-middle childhood. Developmental Cognitive Neuroscience, 54, 101076. https://doi.org/10.1016/J.DCN.2022.101076

      Maurer, D., Mondloch, C. J., & Lewis, T. L. (2007). Sleeper effects. In Developmental Science. https://doi.org/10.1111/j.1467-7687.2007.00562.x

      McSweeney, M., Morales, S., Valadez, E. A., Buzzell, G. A., Yoder, L., Fifer, W. P., Pini, N., Shuffrey, L. C., Elliott, A. J., Isler, J. R., & Fox, N. A. (2023). Age-related trends in aperiodic EEG activity and alpha oscillations during early- to middle-childhood. NeuroImage, 269, 119925. https://doi.org/10.1016/j.neuroimage.2023.119925

      Medel, V., Irani, M., Crossley, N., Ossandón, T., & Boncompte, G. (2023). Complexity and 1/f slope jointly reflect brain states. Scientific Reports, 13(1), 21700. https://doi.org/10.1038/s41598-023-47316-0

      Molina, J. L., Voytek, B., Thomas, M. L., Joshi, Y. B., Bhakta, S. G., Talledo, J. A., Swerdlow, N. R., & Light, G. A. (2020). Memantine Effects on Electroencephalographic Measures of Putative Excitatory/Inhibitory Balance in Schizophrenia. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging, 5(6), 562–568. https://doi.org/10.1016/j.bpsc.2020.02.004

      Muthukumaraswamy, S. D., & Liley, D. T. (2018). 1/F electrophysiological spectra in resting and drug-induced states can be explained by the dynamics of multiple oscillatory relaxation processes. NeuroImage, 179(November 2017), 582–595. https://doi.org/10.1016/j.neuroimage.2018.06.068

      Ossandón, J. P., Stange, L., Gudi-Mindermann, H., Rimmele, J. M., Sourav, S., Bottari, D., Kekunnaya, R., & Röder, B. (2023). The development of oscillatory and aperiodic resting state activity is linked to a sensitive period in humans. NeuroImage, 275, 120171. https://doi.org/10.1016/J.NEUROIMAGE.2023.120171

      Ostlund, B. D., Alperin, B. R., Drew, T., & Karalunas, S. L. (2021). Behavioral and cognitive correlates of the aperiodic (1/f-like) exponent of the EEG power spectrum in adolescents with and without ADHD. Developmental Cognitive Neuroscience, 48, 100931. https://doi.org/10.1016/j.dcn.2021.100931

      Pant, R., Ossandón, J., Stange, L., Shareef, I., Kekunnaya, R., & Röder, B. (2023). Stimulus-evoked and resting-state alpha oscillations show a linked dependence on patterned visual experience for development. NeuroImage: Clinical, 103375. https://doi.org/10.1016/J.NICL.2023.103375

      Schaworonkow, N., & Voytek, B. (2021). Longitudinal changes in aperiodic and periodic activity in electrophysiological recordings in the first seven months of life. Developmental Cognitive Neuroscience, 47. https://doi.org/10.1016/j.dcn.2020.100895

      Schwenk, J. C. B., VanRullen, R., & Bremmer, F. (2020). Dynamics of Visual Perceptual Echoes Following Short-Term Visual Deprivation. Cerebral Cortex Communications, 1(1). https://doi.org/10.1093/TEXCOM/TGAA012

      Tanner, D., Norton, J. J. S., Morgan-Short, K., & Luck, S. J. (2016). On high-pass filter artifacts (they’re real) and baseline correction (it’s a good idea) in ERP/ERMF analysis. Journal of Neuroscience Methods, 266, 166–170. https://doi.org/10.1016/j.jneumeth.2016.01.002

      Vanrullen, R., & MacDonald, J. S. P. (2012). Perceptual echoes at 10 Hz in the human brain. Current Biology. https://doi.org/10.1016/j.cub.2012.03.050

      Voytek, B., Kramer, M. A., Case, J., Lepage, K. Q., Tempesta, Z. R., Knight, R. T., & Gazzaley, A. (2015). Age-related changes in 1/f neural electrophysiological noise. Journal of Neuroscience, 35(38). https://doi.org/10.1523/JNEUROSCI.2332-14.2015

      Waschke, L., Wöstmann, M., & Obleser, J. (2017). States and traits of neural irregularity in the age-varying human brain. Scientific Reports 2017 7:1, 7(1), 1–12. https://doi.org/10.1038/s41598-017-17766-4

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      For many years, there has been extensive electrophysiological research investigating the relationship between local field potential patterns and individual cell spike patterns in the hippocampus. In this study, using state-ofthe-art imaging techniques, they examined spike synchrony of hippocampal cells during locomotion and immobility states. In contrast to conventional understanding of the hippocampus, the authors demonstrated that hippocampal place cells exhibit prominent synchronous spikes locked to theta oscillations.

      Strengths:

      The voltage imaging used in this study is a highly novel method that allows recording not only suprathreshold-level spikes but also subthreshold-level activity. With its high frame rate, it offers time resolution comparable to electrophysiological recordings.

      We thank the reviewer for a thorough review of our manuscript and for recognizing the strength of our study.

      Reviewer #2 (Public review):

      Summary:

      This study employed voltage imaging in the CA1 region of the mouse hippocampus during the exploration of a novel environment. The authors report synchronous activity, involving almost half of the imaged neurons, occurred during periods of immobility. These events did not correlate with SWRs, but instead, occurred during theta oscillations and were phased locked to the trough of theta. Moreover, pairs of neurons with high synchronization tended to display non-overlapping place fields, leading the authors to suggest these events may play a role in binding a distributed representation of the context.

      Strengths:

      Technically this is an impressive study, using an emerging approach that allow single-cell resolution voltage imaging in animals, that while head-fixed, can move through a real environment. The paper is written clearly and suggests novel observations about population-level activity in CA1.

      We thank the reviewer for a thorough review of our manuscript and for recognizing the strength of our study.

      Weaknesses:

      The evidence provided is weak, with the authors making surprising population-level claims based on a very sparse data set (5 data sets, each with less than 20 neurons simultaneously recorded) acquired with exciting, but less tested technology. Further, while the authors link these observations to the novelty of the context, both in the title and text, they do not include data from subsequent visits to support this. Detailed comments are below:

      (1) My first question for the authors, which is not addressed in the discussion, is why these events have not been observed in the countless extracellular recording experiments conducted in rodent CA1 during exploration of novel environments. Those data sets often have 10x the neurons simultaneously recording compared to these present data, thus the highly synchronous firing should be very hard to miss. Ideally, the authors could confirm their claims via the analysis of publicly available electrophysiology data sets. Further, the claim of high extra-SWR synchrony is complicated by the observation that their recorded neurons fail to spike during the limited number of SWRs recorded during behavior- again, not agreeing with much of the previous electrophysiological recordings.

      (2) The authors posit that these events are linked to the novelty of the context, both in the text, as well as in the title and abstract. However they do not include any imaging data from subsequent days to demonstrate the failure to see this synchrony in a familiar environment. If these data are available it would strengthen the proposed link to novelty is they were included.

      (3) In the discussion the authors begin by speculating the theta present during these synchronous events may be slower type II or attentional theta. This can be supported by demonstrating a frequency shift in the theta recording during these events/immobility versus the theta recording during movement. (4) The authors mention in the discussion that they image deep layer PCs in CA1, however this is not mentioned in the text or methods. They should include data, such as imaging of a slice of a brain post-recording with immunohistochemistry for a layer specific gene to support this.

      Comments on revisions:

      I have no further major requests and thank the authors for the additional data and analyses.

      We thank the reviewer for recognizing our efforts in revising the manuscript.

      Reviewer #3 (Public review):

      Summary:

      In the present manuscript, the authors use a few minutes of voltage imaging of CA1 pyramidal cells in head-fixed mice running on a track while local field potentials (LFPs) are recorded. The authors suggest that synchronous ensembles of neurons are differentially associated with different types of LFP patterns, theta and ripples. The experiments are flawed in that the LFP is not "local" but rather collected the other side of the brain.

      Strengths:

      The authors use a cutting-edge technique.

      We thank the reviewer for a thoughtful review of our manuscript and for pointing out the technical strength of our study.

      Weaknesses:

      The two main messages of the manuscript indicated in the title are not supported by the data. The title gives two messages that relate to CA1 pyramidal neurons in behaving head-fixed mice: (1) synchronous ensembles are associated with theta (2) synchronous ensembles are not associated with ripples. The main problem with the work is that the theta and ripple signals were recorded using electrophysiology from the opposite hemisphere to the one in which the spiking was monitored. However, both rhythms exhibit profound differences as a function of location.

      Theta phase changes with the precise location along the proximo-distal and dorso-ventral axes, and importantly, even reverses with depth. Because the LFP was recorded using a single-contact tungsten electrode, there is no way to know whether the electrode was exactly in the CA1 pyramidal cell layer, or in the CA1 oriens, CA1 radiatum, or perhaps even CA3 - which exhibits ripples and theta which are weakly correlated and in anti-phase with the CA1 rhythms, respectively. Thus, there is no way to know whether the theta phase used in the analysis is the phase of the local CA1 theta.

      Although the occurrence of CA1 ripples is often correlated across parts of the hippocampus, ripples are inherently a locally-generated rhythm. Independent ripples occur within a fraction of a millimeter within the same hemisphere. Ripples are also very sensitive to the precise depth - 100 micrometers up or down, and only a positive deflection/sharp wave is evident. Thus, even if the LFP was recorded from the center of the CA1 pyramidal layer in the contralateral hemisphere, it would not suffice for the claim made in the title.

      We thank the reviewer for pointing out the issue regarding the claim made in the title. We have revised the manuscript to clarify that the theta and ripple oscillations referenced in the title refer to specific frequency bands of intracellular and contralaterally recorded field potentials rather than field potentials recorded at the same site as the neuronal activity.

      Abstract (line19):

      “… Notably, these synchronous ensembles were not associated with contralateral ripple oscillations but were instead phase-locked to theta waves recorded in the contralateral CA1 region. Moreover, the subthreshold membrane potentials of neurons exhibited coherent intracellular theta oscillations with a depolarizing peak at the moment of synchrony.”

      Introduction (line68):

      “… Surprisingly, these synchronous ensembles occurred outside of contralateral ripples and were phase-locked to intracellular theta oscillations as well as extracellular theta oscillations recorded from the contralateral CA1 region.”

      To address concerns about electrode placement, we have now included posthoc histological verification of electrode locations, confirming that they were positioned in the contralateral CA1 pyramidal layer (Author response image 1). 

      Author response image 1.

      Post-hoc histological section showing the location of a DiI-coated electrode in the contralateral CA1 pyramidal layer. Scale bar: 300 μm.

      While we appreciate that theta and ripple oscillations exhibit regional variations in phase and amplitude, previous studies have demonstrated a strong co-occurrence and synchrony of these oscillations between both hippocampi1-3. Given that our primary objective was to examine how neuronal ensembles relate to large-scale hippocampal oscillation states rather than local microcircuit-level fluctuations, we recorded theta and ripple oscillations from the contralateral CA1 region.

      However, we acknowledge that contralateral recordings do not capture all ipsilateral-specific dynamics. Theta phases vary with depth and precise location, and local ripple events may be independently generated across small spatial scales. To reflect this, we have now explicitly acknowledged these considerations in the discussion. 

      Discussion (line527):

      While contralateral LFP recordings reliably capture large-scale hippocampal theta and ripple oscillations, they may not fully account for ipsilateral-specific dynamics, such as variations in theta phase alignment or locally generated ripple events. Although contralateral recordings serve as a well-established proxy for large-scale hippocampal oscillatory states, incorporating simultaneous ipsilateral field potential recordings in future studies could refine our understanding of local-global network interactions. Despite these considerations, our findings provide robust evidence for the existence of synchronous neuronal ensembles and their role in coordinating newly formed place cells. These results advance our understanding of how synchronous neuronal ensembles contribute to spatial memory acquisition and hippocampal network coordination.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The authors have provided sufficient experimental and analytical data addressing my comments, particularly regarding consistency with past electrophysiological data and the exclusion of potential imaging artifacts.

      We thank the reviewer for recognizing our efforts in revising the manuscript.

      Minor comment: In Figure 2C and Figure 5-figure supplement 1, 'paired Student's t-test' is not entirely appropriate. More precisely, either 'paired t-test' or 'Student's t-test' would better indicate the correct statistical method. Please verify whether these data comparisons are within-group or between-group.

      Thank you for the comment. We have revised the manuscript as suggested.

      Reviewer #2 (Recommendations for the authors):

      I have no further major requests and thank the authors for the additional data and analyses.

      We thank the reviewer for recognizing our efforts in revising the manuscript.

      Minor points- line 169- typo, correct grant to grand

      Thank you for pointing it out. The typo has been corrected.

      (1) Buzsaki, G. et al. Hippocampal network patterns of activity in the mouse. Neuroscience 116, 201-211 (2003). https://doi.org:10.1016/s03064522(02)00669-3

      (2) Szabo, G. G. et al. Ripple-selective GABAergic projection cells in the hippocampus. Neuron 110, 1959-1977 e1959 (2022). https://doi.org:10.1016/j.neuron.2022.04.002

      (3) Huang, Y. C. et al. Dynamic assemblies of parvalbumin interneurons in brain oscillations. Neuron 112, 2600-2613 e2605 (2024). https://doi.org:10.1016/j.neuron.2024.05.015

    1. Author response:

      The following is the authors’ response to the previous reviews

      Response to the reviewer #2 (Public review):

      We greatly appreciate the reviewer’s high evaluation of our paper and helpful comments and suggestions.

      Regarding in vivo Treg homing assay, we did not exclude doublets and dead cells from the analysis of Kaede-expressing Tregs migrated to the aorta, which may affect the results. We described this issue as the limitation of this study in the revised manuscript. Nonetheless, we believe the reliability of our findings because we repeated this experiment three times and obtained similar results.

      There is no evidence to support the clinical relevance of our findings. Future clinical research on this topic is highly desired.

      Response to the reviewer #3 (Public review):

      We greatly appreciate the reviewer’s high evaluation of our paper and helpful comments and suggestions.

      Despite the controversial role of Th17 cells in atherosclerosis, we understand the possible involvement of Th17 cells and the Th1 cell/Th17 cell balance in lymphoid tissues and aortic lesions in accelerated inflammation and atherosclerosis in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice. Although we could not completely evaluate the changes in these immune responses in detail, future study may elucidate interesting mechanisms mediated by Th17 cell responses.

      As the reviewer suggested, we understand that it is necessary to provide in vivo evidence for the Treg suppressive effects on DC activation. Based on the results of in vitro experiments, we described the discussion on the in vivo evidence in the revised manuscript.

      We understand methodological limitations for flow cytometric analysis of immune cells in the aorta and in vivo Treg homing assay. We described this issue as the limitation of this study in the revised manuscript. Regarding in vivo Treg homing assay, we statistically re-analyzed the combined data from multiple experiments and observed a tendency toward reduction in the proportion of CCR4-deficient Kaede-expressing Tregs in the aorta of recipient Apoe<sup>-/-</sup> mice, though there was no statistically significant difference in the migratory capacity of CCR4-intact or CCR4-deficient Kaede-expressing Tregs. Accordingly, we toned down our claim that CCR4 expression on Tregs plays a critical role in mediating Treg migration to the atherosclerotic aorta under hypercholesterolemia.

      The reviewer requested us to evaluate aortic inflammation in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice injected with CCR4-intact or CCR4-deficient Tregs. However, we think that this experiment will provide marginal information because Treg transfer experiments in Apoe<sup>-/-</sup> mice have already shown the protective role of CCR4 in Tregs against aortic inflammation and early atherosclerosis.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (1) #1 and #2: CD103 and CD86 expression should be discussed on the text and not only in the response to reviewer.

      In accordance with the reviewer’s suggestion, we added a discussion on the downregulated CD103 expression in peripheral LN Tregs and upregulated CD86 expression on DCs in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice in the discussion section in the revised manuscript.

      (2) #5: Authors response is not satisfactory. No gate percentage is shown. As it currently is, the difference in the number of cells shown in the figure could be due to differences in events recorded. Furthermore, the gate strategy is not thorough. Considering the very low frequency of Kaede + cells detected, it is crucial to properly exclude doublets and dead cells.

      Authors reported a dramatic difference in Kaede + Tregs cells in the aorta across experiments. This could be addressed by normalization followed by appropriate statistical analysis (One sample t-test).

      The data shown is not strong enough to conclude that there is a reduced migration to the aorta.

      We understand the importance of reviewer’s suggestion. We described the percentage of Kaede+ Tregs in the aorta of Apoe<sup>-/-</sup> mice receiving transfer of Kaede-expressing CCR4-intact or CCR4-deficient Tregs in Figure 5I.

      As the reviewer pointed out, we understand that it would be important to properly exclude doublets and dead cells in in vivo Treg homing assay. However, it is difficult for us to resolve this issue because we need to perform the same experiments again which will require a great number of additional mice and substantial amount of time. We deeply regret that these important experimental procedures were not performed. We described this issue as the limitation of this study.

      In accordance with the reviewer’s suggestion, we re-analyzed the combined data from multiple experiments using one-sample t-test. We observed a tendency toward reduction in the proportion of CCR4-deficient Kaede-expressing Tregs in the aorta of recipient Apoe<sup>-/-</sup> mice, though there was no statistically significant difference in the migratory capacity of CCR4-intact or CCR4-deficient Kaede-expressing Tregs. By modifying the corresponding descriptions in the manuscript, we toned down our claim that CCR4 expression on Tregs plays a critical role in mediating Treg migration to the atherosclerotic aorta under hypercholesterolemia.

      (3) #8: There are still several not shown data

      In accordance with the reviewer’s suggestion, we showed the data on the responses of Tregs and effector memory T cells in 8-week-old wild-type or Ccr4<sup>-/-</sup> mice and Ccr4 mRNA expression in Tregs and non-Tregs from Apoe<sup>-/-</sup> or Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice in Supplementary Figures 4 and 7.

      Reviewer #3 (Recommendations for the authors):

      (1) Issue 1. For future studies, I recommend not omitting viability controls during cell staining. Removal of dead cells and doublets should always be included during the gating strategy to avoid undesirable artefacts, especially when analysing less-represented cell populations. According to your previous report (ref #40), I agree that isotype controls were unnecessary using the same staining protocol. FMO controls should always be included in flow cytometry analysis (not mentioned in the methodology description and ref#40).

      As the reviewer suggested, we understand that it would be important to properly exclude dead cells and doublets and to prepare FMO controls in flow cytometric analysis. We deeply regret that these important experimental procedures were not performed. We described this issue as the limitation of this study.

      (2) Issue 3. Although Th17's role in atherosclerosis remains controversial, the data obtained in this work could provide valuable insights if discussed appropriately. As noted in my public review, I found it noteworthy that ROR γ t+ cells represented around 13% of effector TCD45+CD3+CD4+ lymphocytes in the aorta of Apoe<sup>-/-</sup> mice while Th1 less than 5% (Fig 4H and F, respectively). I recognise that differences in cell staining sensibility and robustness for different transcription factors may influence these percentages. However, analysing how CCR4 deficiency influences the Th1/TI h17 balance would yield interesting data, similar to what was done for the Th1/Treg ratio.

      Considering the higher proportion of Th17 cells than Th1 or Th2 cells in atherosclerotic aorta, we understand the importance of reviewer’s suggestion. However, we could not evaluate the effect of CCR4 deficiency on the Th1/Th17 balance in aorta because we did not perform flow cytometric analysis of aortic Th1 and Th17 cells in the same mice. Meanwhile, we could examine the Th1/Th17 balance in peripheral lymphoid tissues by flow cytometry. We found a significant increase in the Th1/Th17 ratio in the peripheral LNs of Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice, while there were no changes in its ratio in the spleen or para-aortic LNs of these mice, which limits the contribution of the Th1/Th17 balance to exacerbated atherosclerosis. We showed these data below.

      Author response image 1.

      (3) Issue 4. I appreciate the authors for sharing data on the flow cytometry analysis of Tregs in para-aortic LNs of Apoe<sup>-/-</sup> and Ccr4<sup>-/-</sup> Apoe<sup>-/-</sup> mice, which would have been included as a Supplementary figure. These results reinforce the notion that Treg dysfunction in CCR4-deficient mice may not be due to the downregulation of regulatory cell surface receptors.

      We showed the data on the expression of CTLA-4, CD103, and PD1 in Tregs in the para-aortic LNs of Apoe<sup>-/-</sup> and Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice in Supplementary Figure 8.

      (4) Issue 5. I agree that CD4+ T cell responses are substantially regulated by DCs. While CD80 and CD86 on DC primarily serve as costimulatory signals for T-cell activation, cytokines secreted by DCs are primordial signals for determining the differentiation phenotype of effector Th cells. Since the analysis of DC phenotype in lymphoid tissues of Apoe<sup>-/-</sup> and Ccr4<sup>-/-</sup> Apoe<sup>-/-</sup> mice could not be addressed in this study, it is not possible to differentiate which processes may be mainly affected by CCR4-deficiency during CD4+ T cell activation. In this scenario, and considering in vitro studies, the results suggest a possible role of CCR4 in controlling the extent of activation of CD4+T cells rather than shifting the CD4+T cell differentiation profile in peripheral lymphoid tissues, where a predominant Th1 profile was already established in Apoe<sup>-/-</sup> mice. Therefore, I advise caution when concluding about shifts in CD4+ T cell responses.

      We thank the reviewer for providing us thoughtful comments. As the reviewer pointed out, we understand that we should carefully interpret the mechanisms for the shift of CD4+ T cell responses by CCR4 deficiency.

      (5) Regarding migration studies in the revised manuscript. I fully understand that Treg transference assays are challenging. The results do not suggest that CCR4 was critical for Treg migration to lymphoid tissues in the conditions assayed. Concerning migration to the aorta, I found the results inconclusive since the authors mention that: i) there was a dramatic difference in the absolute numbers of Kaede-expressing Tregs that migrated to the aorta impairing statistical analysis; ii) the number of Kaede-expressing Tregs that migrated to the aorta was extremely low; iii) dead cells and doublets were not removed in the flow cytometry analysis. In this context, I do not agree with the following statements and recommend revising them:

      - "CCR4 deficiency in Tregs impaired their migration to the atherosclerotic aorta" (lines 36-7),

      - "…we found a significant reduction in the proportion of CCR4 deficient Kaede-expressing Tregs in the aorta of recipient Apoe<sup>-/-</sup> mice" (lines 356-7),

      - "CCR4 expression on Tregs regulates the development of early atherosclerosis by....... mediating Treg migration to the atherosclerotic aorta" (lines 409-411),

      - "…we found that CCR4 expression on Tregs is critical for regulating atherosclerosis by mediating their migration to the atherosclerotic aorta" (lines 437-438),

      - "CCR4 protects against early atherosclerosis by mediating Treg migration to the aorta.... (lines 464-465),

      - "We showed that CCR4 expression on Tregs is critical for ...... mediating Treg migration to the atherosclerotic aorta" (503-505).

      We understand the importance of the reviewer’s suggestion. We described this issue as the limitation of this study. In accordance with the reviewer’s suggestion, we modified the above descriptions and toned down our claim that CCR4 expression on Tregs plays a critical role in mediating Treg migration to the atherosclerotic aorta under hypercholesterolemia.

      (6) Line 206: Mention the increased expression of CD86 by DCs

      We mentioned this result in the revised manuscript. We also added a discussion on the upregulated CD86 expression on DCs in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice in the discussion section in the revised manuscript.

      (7) Lines 304-305. According to Fig 4F-H, a selective accumulation of Th1 cells seems to have occurred only in the aorta, coinciding with a higher Th1/Treg ratio. No selective accumulation of Th1 cells was observed in para-aortic lymph nodes. These results could be clarified.

      We modified the above description in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews

      Reviewer #1 (Public Review):

      Comment: The fact that there are Arid1a transcripts that escape the Cre system in the Arid1a KO mouse model might difficult the interpretation of the data. The phenotype of the Arid1a knockout is probably masked by the fact that many of the sequencing techniques used here are done on a heterogeneous population of knockout and wild type spermatocytes. In relation to this, I think that the use of the term "pachytene arrest" might be overstated, since this is not the phenotype truly observed. Knockout mice produce sperm, and probably litters, although a full description of the subfertility phenotype is lacking, along with identification of the stage at which cell death is happening by detection of apoptosis.

      Response: As the reviewer indicates, we did not observe a complete arrest at Pachynema. In fact, the histology shows the presence of spermatids and sperm in seminiferous tubules and epididymides (Fig. Sup. 3). However, our data argue that the wild-type haploid gametes produced were derived from spermatocyte precursors that have likely escaped Cre mediated activity (Fig. Sup. 4). Furthermore, diplotene and metaphase-I spermatocytes lacking ARID1A protein by IF were undetectable in the Arid1acKO testes (Fig. S4B). Therefore, although we do not demonstrate a strict pachytene arrest, it is reasonable to conclude that ARID1A is necessary to progress beyond pachynema. We have revised the manuscript to reflect this point (Abstract lines 17,18; Results lines 153,154)

      Comment: It is clear from this work that ARID1a is part of the protein network that contributes to silencing of the sex chromosomes. However, it is challenging to understand the timing of the role of ARID1a in the context of the well-known DDR pathways that have been described for MSCI.

      Response: With respect to the comment on the lack of clarity as to which stage of meiosis we observe cell death, our data do suggest that it is reasonable to conclude that mutant spermatocytes (ARID1A-) undergo cell death at pachynema given their inability to execute MSCI, which is a well-established phenotype.

      Comment: Staining of chromosome spreads with Arid1a antibody showed localization at the sex chromosomes by diplonema; however, analysis of gene expression in Arid1a KO was performed on pachytene spermatocytes. Therefore, is not very clear how the chromatin remodeling activity of Arid1a in diplonema is affecting gene expression of a previous stage. CUTnRUN showed that ARID1a is present at the sex chromatin in earlier stages, leading to hypothesize that immunofluorescence with ARID1a antibody might not reflect ARID1a real localization.

      Response: It is unclear what the reviewer means about not understanding how ARID1A activity at diplonema affects gene expression at earlier stages. Our interpretations were not based solely on the observation of ARID1A associations with the XY body at diplonema. In fact, mRNA expression and CUT&RUN analyses were performed on pachytene-enriched populations. ARID1A's association with the XY body is not exclusive to diplonema. Based on both CUT&RUN and IF data, ARID1A associates with XY chromatin as early as pachynema. Only at late diplonema did we observe ARID1A hyperaccumulation on the XY body by IF.

      Reviewer #2 (Public Review):

      Comment: The inefficient deletion of ARID1A in this mouse model does not allow any detailed analysis in a quantitative manner.

      Response: As explained in our response to these comments in the first revision, we respectfully disagree with this reviewer’s conclusions. We have been quantitative by co-staining for ARID1A, ensuring that we can score mutant pachytene spermatocytes from escapers. Additionally, we provide data to show the efficiency of ARID1A loss in the purified pachytene populations sampled in our genomic assays.

      Reviewer #3 (Public Review):

      Comment: The data demonstrate that the mutant cells fail to progress past pachytene, although it is unclear whether this specifically reflects pachytene arrest, as accumulation in other stages of Prophase also is suggested by the data in Table 1. The western blot showing ARID1A expression in WT vs. cKO spermatocytes (Fig. S2) is supportive of the cKO model but raises some questions. The blot shows many bands that are at lower intensity in the cKO, at MWs from 100-250kDa. The text and accompanying figure legend have limited information. Are the various bands with reduced expression different isoforms of ARID1A, or something else? What is the loading control 'NCL'? How was quantification done given the variation in signal across a large range of MWs?

      Response: The loading control is Nucleolin. With respect to the other bands in the range of 100-250 kDa, it is difficult to say whether they represent ARID1A isoforms. The Uniprot entry for Mouse ARID1A only indicates a large mol. wt sequence of ~242 kDa; therefore, the band corresponding to that size was quantified. There is no evidence to suggest that lower molecular weight isoforms may be translated. Although speculative, it is possible that the lower molecular weight bands represent proteolytic/proteasomal degradation products or products of antibody non-specificity. These points are addressed in the revised manuscript (Legend to Fig S2, lines 926-931). Blots were scanned on a LI-COR Odyssey CLx imager and viewed and quantified using Image Studio Version 5.2.5 (Methods, lines 640-642).

      Comment: An additional weakness relates to how the authors describe the relationship between ARID1A and DNA damage response (DDR) signaling. The authors don't see defects in a few DDR markers in ARID1A CKO cells (including a low-resolution assessment of ATR), suggesting that ARID1A may not be required for meiotic DDR signaling. However, as previously noted the data do not rule out the possibility that ARID1A is downstream of DDR signaling and the authors even indicate that "it is reasonable to hypothesize that DDR signaling might recruit BAF-A to the sex chromosomes (lines 509-510)." It therefore is difficult to understand why the authors continue to state that "...the mechanisms underlying ARID1A-mediated repression of the sex-linked transcription are mutually exclusive to DDR pathways regulating sex body formation" (p. 8) and that "BAF-A-mediated transcriptional repression of the sex chromosomes occurs independently of DDR signaling" (p. 16). The data provided do not justify these conclusions, as a role for DDR signaling upstream of ARID1A would mean that these mechanisms are not mutually exclusive or independent of one another.

      Response: The reviewer’s argument is reasonable, and we have made the recommended changes (Results, lines 212-215; Discussion, lines 499-500).

      Comment: A final comment relates to the impacts of ARID1A loss on DMC1 focus formation and the interesting observation of reduced sex chromosome association by DMC1. The authors additionally assess the related recombinase RAD51 and suggest that it is unaffected by ARID1A loss. However, only a single image of RAD51 staining in the cKO is provided (Fig. S11) and there are no associated quantitative data provided. The data are suggestive but it would be appropriate to add a qualifier to the conclusion regarding RAD51 in the discussion which states that "...loss of ARID1a decreases DMC1 foci on the XY chromosomes without affecting RAD51" given that the provided RAD51 data are not rigorous. In the long-term it also would be interesting to quantitatively examine DMC1 and RAD51 focus formation on autosomes as well.

      Response: We agree with the reviewer’s comment and have made the recommended changes (Discussion, lines 518-519).

      Response to non-public recommendations

      Reviewer 2:

      Comment: Meiotic arrest is usually judged based on testicular phenotypes. If mutant testes do not have any haploid spermatids, we can conclude that meiotic arrest is a phenotype. In this case, mutant testes have haploid spermatids and are fertile. The authors cannot conclude meiotic arrest. The mutant cells appear to undergo cell death in the pachytene stage, but the authors cannot say "meiotic arrest."

      Response: We disagree with this comment. By IF, we see that ~70% of the spermatocytes have deleted ARID1A. Furthermore, we never observed diplotene spermatocytes that lacked ARID1A. The conclusion that the absence of ARID1A results in a pachynema arrest and that the escapers produce the haploid spermatids is firm.

      Comment: Fig. S2 and S3 have wrong figure legends.

      Response: The figure legends for Fig. S2 and S3 are correct.

      Comment: The authors do not appear to evaluate independent mice for scoring (the result is about 74% deletion above, Table S1). Sup S2: how many independent mice did the authors examine?

      Response:These were Sta-Put purified fractions obtained from 14-15 WT and mutant mice. It is difficult to isolate pachytene spermatocytes by Sta-Put at the required purity in sufficient yields using one mouse at a time. We used three technical replicates to quantify the band intensity, and the error bars represent the standard error of the mean (S.E.M) of the band intensity.

      Comment: Comparison of cKO and wild-type littermate yielded nearly identical results (Avg total conc WT = 32.65 M/m; Avg total conc cKO = 32.06 M/ml)". This sounds like a negative result (i.e., no difference between WT and cKO).

      Response: This is correct. There is no difference between Arid1aWT and Arid1aCKO sperm production. This is because wild-type haploid gametes produced were derived from spermatocyte precursors that have escaped Cre-mediated activity (Fig. S4). These data merely serve to highlight an inherent caveat of our conditional knockout model and are not intended to support the main conclusion that ARID1A is necessary for pachytene progression.

      Comment: The authors now admit ~ 70 % efficiency in deletion, and the authors did not show the purity of these samples. If the purity of pachytene spermatocytes is ~ 80%, the real proportion of mutant cells can be ~ 56%. It is very difficult to interpret the data.

      Response: The original submission did refer to inefficient Cre-induced recombination. The reviewer asked for the % efficiency, which was provided in the revised version. Also, please refer to Fig. S2, where Western blot analysis demonstrates a significant loss of ARID1A protein levels in CKO relative to WT pachytene spermatocyte populations that were used for CUT&RUN data generation.

      Comment: The authors should not use the other study to justify their own data. The H3.3 ChIP-seq data in the NAR paper detected clear peaks on autosomes. However, in this study, as shown in Fig. S7A, the authors detected only 4 peaks on autosomes based on MACS2 peak calling. This must be a failed experiment. Also, S7A appears to have labeling errors.

      Response: I believe the reviewer is referring to supplementary figure 8A. Here, it is not clear which labeling errors the reviewer is referring to. In the wild type, the identified peaks were overwhelmingly sex-linked intergenic sites. This is consistent with the fact that H3.3 is hyper-accumulated on the sex chromosomes at pachynema.

      The authors of the NAR paper did not perform a peak-calling analysis using MACS2 or any other peak-calling algorithm. They merely compared the coverage of H3.3 relative to input. Therefore, it is not clear on what basis the reviewer says that the NAR paper identified autosomal peaks. Their H3.3 signal appears widely distributed over a 6 kb window centered at the TSS of autosomal genes, which, compared to input, appears enriched. Our data clearly demonstrates a less noisy and narrower window of H3.3 enrichment at autosomal TSSs in WT pachytene spermatocytes, albeit at levels lower than that seen in CKO pachytene spermatocytes (Fig S8B and see data copied below for each individual replicate). Moreover, the lack of peaks does not mean that there was an absence of H3.3 at these autosomal TSSs (Supp. Fig. S8B). Therefore, we disagree with the reviewer’s comment that the H3.3 CUT&RUN was a failed experiment.

      Author response image 1.

      H3.3 Occupancy at genes mis-regulated in the absence of ARID1A

      Comment: If the author wishes to study the function of ARID2 in spermatogenesis, they may need to try other cre-lines to have more robust phenotypes, and all analyses must be redone using a mouse model with efficient deletion of ARID2.

      Response: As noted, we chose Stra8-Cre to conditionally knockout Arid1a because ARID1A is haploinsufficient during embryonic development. The lack of Cre expression in the maternal germline allows for transmission of the floxed allele, allowing for the experiments to progress.

      Comment: The inefficient deletion of ARID1A in this mouse model does not allow any detailed analysis in a quantitative manner.

      Response: In many experiments, we have been quantitative when possible by co-staining for ARID1A, ensuring that we can score mutant pachytene spermatocytes from escapers. Additionally, we provide data to show the efficiency of ARID1A loss in the purified pachytene populations sampled in our genomic assays.

      Reviewer 3:

      Comment: The Methods section refers to antibodies as being in Supplementary Table 3, but the table is labeled as Supplementary Table 2.

      Response: This has been corrected

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Although this manuscript contains a potentially interesting piece of work that delineates a mechanism of IQCH that associates with spermatogenesis, this reviewer feels that a number of issues require clarification and re-evaluation for a better understanding of the role of IQCH in spermatogenesis. With the shortage of logics and supporting data, causal relationships are still not clear among IQCH, CaM, and HNRPAB. The most serious point in this manuscript could be that the authors try to generalize their interpretations with too simplified model from limited pieces of their data. The way the data and the logic are presented needs to be largely revised, and several interpretations should be supported by direct evidence.

      Response: Thank you for the reviewer’s comment. IQCH is a calmodulin-binding protein, and the binding of IQCH and CaM was confirmed by LC-MS/MS analysis and co-IP assay using sperm lysate. We thus speculated that if the interaction of IQCH and CaM might be a prerequisite for IQCH function. To prove that speculation, we took HNRPAB as an example. We knocked down IQCH in cultured cells, and a decrease in the expression of HNRPAB was observed. Similarly, when we knocked down CaM in cultured cells, and a decrease in the expression of HNRPAB was also detected. However, these results cannot exclude that IQCH or CaM could regulate HNRPAB expression alone. To investigate that if IQCH or CaM could regulate HNRPAB expression alone, we overexpressed IQCH in cells that knocked down CaM, while the expression of HNRPAB cannot be rescued, suggesting that IQCH cannot regulate HNRPAB expression when CaM is reduced. In consistent, we overexpressed CaM in cells that knocked down IQCH, while the expression of HNRPAB cannot be rescued, suggesting that CaM cannot regulate HNRPAB expression when IQCH is reduced. Thus, IQCH or CaM cannot regulate HNRPAB expression alone. Moreover, we deleted the IQ motif of IQCH, which is required for binding to CaM. The co-IP results showed that the interaction of IQCH and CaM was disrupted when deleting the IQ motif of IQCH, and the expression of HNRPAB was decreased. Therefore, we suggested that the interaction of IQCH and CaM might be required for IQCH regulating HNRPAB. In future studies, we will further investigate the relationships among IQCH, CaM, and HNRPAB.

      Reviewer #3 (Public Review):

      (1) More background details are needed regarding the proteins involved, in particular IQ proteins and calmodulin. The authors state that IQ proteins are not well-represented in the literature, but do not state how many IQ proteins are encoded in the genome. They also do not provide specifics regarding which calmodulins are involved, since there are at least 5 family members in mice and humans. This information could help provide more granular details about the mechanism to the reader and help place the findings in context.

      Response: Thanks to reviewer’s suggestion. We have provided additional background information regarding IQ-containing protein family members in humans and mice, as well as other IQ-containing proteins implicated in male fertility, in the Introduction section. Furthermore, we have supplemented the Introduction with background information concerning the association between CaM and male infertility.

      (2) The mouse fertility tests could be improved with more depth and rigor. There was no data regarding copulatory plug rate; data was unclear regarding how many WT females were used for the male breeding tests and how many litters were generated; the general methodology used for the breeding tests in the Methods section was not very explicitly or clearly described; the sample size of n=3 for the male breeding tests is rather small for that type of assay; and, given that ICHQ appears to be expressed in testicular interstitial cells (Fig. S10) and somewhat in other organs (Fig. S2), another important parameter of male fertility that should be addressed is reproductive hormone levels (e.g., LH, FSH, and testosterone). While normal epididymal size in Fig. S3 suggests that hormone (testosterone) levels are normal, epididymal size and/or weight were not rigorously quantified.

      Response: Thanks to reviewer’s comment. We have provided the data regarding copulatory plug rate and the average number of litters for breeding tests in revised Figure 3—figure supplement 2. The methodology used for the breeding tests has been revised to be more detailed and explicit in the revised Method section. Moreover, we have increased the sample size for male breeding tests to n=6. We measured the serum levels of FSH, LH, and Testosterone in the WT (9.3±1.9 ng/ml, 0.93±0.15 ng/ml, and 0.2±0.03 ng/ml) and Iqch KO mice (12±2 ng/ml, 1.17±0.2 ng/ml, and 0.2±0.04 ng/ml). There was no significant difference observed in the serum levels of reproductive hormones between WT and Iqch KO mice; therefore, we did not include the data in the study. Furthermore, we have added quantitative data on epididymal size in the revised Figure 3—figure supplement 2.

      (3) The Western blots in Figure 6 should be rigorously quantified from multiple independent experiments so that there is stronger evidence supporting claims based on those assays.

      Response: We appreciate the reviewer's comment. As suggested, we have added quantified data in Figure 6—figure supplement 2 from the results of Western blotting in Figure 6.

      (4) Some of the mouse testis images could be improved. For example, the PNA and PLCz images in Figure S7 are difficult to interpret in that the tubules do not appear to be stage-matched, and since the authors claimed that testicular histology is unaffected in knockout testes, it should be feasible to stage-match control and knockout samples. Also, the anti-ICHQ and CaM immunofluorescence in Figure S10 would benefit from some cell-type-specific co-stains to more rigorously define their expression patterns, and they should also be stage-matched.

      Response: Thanks to reviewer’s suggestions. We have included immunofluorescence images of anti-PLCz, anti-PNA and anti-IQCH and CaM during spermatogenesis development.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) There are multiple grammatical errors and statements drawn beyond the results. The entire manuscript would benefit from professional editing.

      Response: We are sorry for the grammatical errors. We have enlisted professional editing services to refine our manuscript.

      (2) Line 40, "Firstly" is not appropriate here.

      Response: Thanks to reviewer’s comment. The word "Firstly" has been removed from the revised manuscript.

      (3) Line 44, "processes".

      Response: Thanks to reviewer’s suggestion. We have changed “process” in to “processes” on line 45.

      (4) "spermatocytogenesis (mitosis)" is incorrect.

      Response: Thanks to reviewer’s comment. We have changed “spermatocytogenesis (mitosis)” in to “mitosis” on line 47.

      (5) Ca and Ca2+ are both used in line 67 - 77. Be consistent.

      Response: We appreciate the reviewer's detailed checks. We have maintained consistency by revising instances of "Ca" to "Ca2+" in revised manuscript.

      (6) Line 238 to 240, "To elucidate the molecular mechanism by which IQCH regulates male fertility, we performed liquid chromatography tandem mass spectrometry (LC-MS/MS) analysis using mouse sperm lysates and detected 288 interactors of IQCH (Data S1)."It is not clear how LC-MS/MS using mouse sperm lysates could detect "288 interactors of IQCH"? A co-IP experiment for IQCH using sperm lysates prior to LC-MS/MS is needed to detect "interactors of IQCH". However, in the Methods section, consistent with the main text, proteomic quantification was conducted for protein extract from sperm. Figure legend for Fig. 5 did not explain this, either.Thus, it is unable to evaluate Figure 5.

      Response: We sincerely apologize for the oversight. Following reviewer’s suggestions, we have supplemented the method details of LC-MS/MS experiment in the Methods section of revised manuscript. Additionally, we conducted a co-IP experiment for IQCH using sperm lysates prior to LC-MS/MS and we did not include the corresponding figure in the manuscript. The results are as follows:

      Author response image 1.

      The results of a co-IP experiment for IQCH using sperm lysates from WT mice.

      (7) Line 246, "... key proteins that might be activated by IQCH". What does "activated" here refer to? Should it be "upregulated"?

      Response: We are sorry to our inexact statement. Instead, "upregulated" would better convey the intended meaning. According to reviewer’s suggestions, we have modified "activated" into "upregulated".

      (8) Line 252 to 254, "the cross-analysis revealed that 76 proteins were shared between the IQCH-bound proteins and the IQCH-activated proteins (Fig. 5E), implicating this subset of genes as direct targets." This is a confusing statement. Is the author trying to say, IQCH-bound proteins have upregulated expression, suggesting that IQCH enhances their expression?

      Response: We appreciate the reviewer's comment regarding the clarity of the statement in Line 252 to 254 of the manuscript. We have modified this sentence into “Importantly, cross-analysis revealed that 76 proteins were shared between the IQCH-bound proteins and the downregulated proteins in Iqch KO mice (Figure 5E), suggesting that IQCH might regulate their expression by the interaction.”

      (9) Line 260 to 261, "SYNCRIP, HNRNPK, FUS, EWSR1, ANXA7, SLC25A4, and HNRPAB ... the loss of which showed the greatest influence on the phenotype of the Iqch KO mice." There is no evidence suggesting that the loss of SYNCRIP, HNRNPK, FUS, EWSR1, ANXA7, SLC25A4, and HNRPAB leads to Iqch KO phenotype.

      Response: We apologize for our inaccurate statement. According to the literature, Fus KO, Ewsr1 KO, and Hnrnpk KO male mice were infertile, showing the spermatogenic arrest with absence of spermatozoa (Kuroda et al. 2000; Tian et al. 2021; Xu et al. 2022). Syncrip is involved meiotic process in Drosophila by interacting with Doublefault (Sechi et al. 2019). HNRPAB might be associated with mouse spermatogenesis by binding to Protamine 2 and contributing its translational regulation. Specifically, ANXA7 is a calcium-dependent phospholipid-binding protein that is a negative regulator of mitochondrial apoptosis (Du et al. 2015). Loss of SLC25A4 results in mitochondrial energy metabolism defects in mice (Graham et al. 1997). Moreover, RNA immunoprecipitation on formaldehyde cross-linked sperm followed by qPCR detected the interactions between HNRPAB and Catsper1, Catsper2, Catsper3, Ccdc40, Ccdc39, Ccdc65, Dnah8, Irrc6, and Dnhd1, which are essential for sperm development (Fukuda et al. 2013). Our Iqch KO mice showed abnormal sperm count, motility, morphology, and mitochondria, so we inferenced that IQCH might play a role in spermatogenesis by regulating the expression of SYNCRIP, HNRNPK, FUS, EWSR1, ANXA7, SLC25A4, and HNRPAB to some extent. We have changed an appropriate stamen that “We focused on SYNCRIP, HNRNPK, FUS, EWSR1, ANXA7, SLC25A4, and HNRPAB, which play important roles in spermatogenesis.”

      (10) Fig. 6C and 6D use different styles of error bars.

      Response: We are sorry for our oversight. In accordance with the reviewer's recommendations, we have modified the representation of error bars in the revised Fig. 6C.

      (11) Line 296 to 297, "As expected, CaM interacted with IQCH, as indicated by LC-MS/MS analysis". It is not clear how LC-MS/MS detects protein interaction.

      Response: As reviewer’s suggestions, we have supplemented the method details of LC-MS/MS experiment in the Methods section of revised manuscript. The results of proteins interacting with IQCH in sperm lysates from the LC-MS/MS experiment analysis were submitted as Figure 5—source data 1.

      (12) It is still not clear how the interaction between IQCH, CaM, and HNRPAB is required for the expression of each other.

      Response: Thank you for the reviewer’s comment. IQCH is a calmodulin-binding protein, and the binding of IQCH and CaM was confirmed by LC-MS/MS analysis and co-IP assay using sperm lysate. We thus speculated that if the interaction of IQCH and CaM might be a prerequisite for IQCH function. To prove that speculation, we took HNRPAB as an example. We knocked down IQCH in cultured cells, and a decrease in the expression of HNRPAB was observed. Similarly, when we knocked down CaM in cultured cells, and a decrease in the expression of HNRPAB was also detected. However, these results cannot exclude that IQCH or CaM could regulate HNRPAB expression alone. To investigate that if IQCH or CaM could regulate HNRPAB expression alone, we overexpressed IQCH in cells that knocked down CaM, while the expression of HNRPAB cannot be rescued, suggesting that IQCH cannot regulate HNRPAB expression when CaM is reduced. In consistent, we overexpressed CaM in cells that knocked down IQCH, while the expression of HNRPAB cannot be rescued, suggesting that CaM cannot regulate HNRPAB expression when IQCH is reduced. Thus, IQCH or CaM cannot regulate HNRPAB expression alone. Moreover, we deleted the IQ motif of IQCH, which is required for binding to CaM. The co-IP results showed that the interaction of IQCH and CaM was disrupted when deleting the IQ motif of IQCH, and the expression of HNRPAB was decreased. Therefore, we suggested that the interaction of IQCH and CaM might be required for IQCH regulating HNRPAB. In future studies, we will further investigate the relationships among IQCH, CaM, and HNRPAB.

      Reviewer #3 (Recommendations For The Authors):

      The authors have addressed my minor concerns. However, they neglected to address any of my more significant concerns in the public review. I assume that they simply overlooked these critiques, despite the fact that eLife explicitly states that "...as a general rule, concerns about a claim not being justified by the data should be explained in the public review." Therefore, the authors should have looked more carefully at the public reviews. As a result, my major concerns about the manuscript remain.

      Response: We apologize for overlooking the public review process. We have improved our study based on the feedback received during the public review.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Recommendations For The Authors):

      The additional data included in this revision nicely strengthens the major claim.

      I apologize that my comment about K+ concentration in the prior review was unclear. The cryoEM structure of KCNQ1 with S4 in the resting state was obtained with lowered K+ relative to the active state. Throughout the results and discussion it seems implied that the change in voltage sensor state is somehow causative of the change in selectivity filter state while the paper that identified the structures attributes the change in selectivity filter state not to voltage sensors, but to the change in [K+] between the 2 structures. Unless there is a flaw in my understanding of the conditions in which the selectivity filter structures used in modeling were generated, it seems misleading to ignore the change in [K+] when referring to the activated vs resting or up vs down structures. My understanding is that the closed conformation adopted in the resting/low [K+] is similar to that observed in low [K+] previously and is more commonly associated with [K+]-dependent inactivation, not resulting from voltage sensor deactivation as implied here. The original article presenting the low [K+] structure also suggests this. When discussing conformational changes in the selectivity filter, I strongly suggest referring to these structures as activated/high [K+] vs resting/low [K+] or something similar, as the [K+] concentration is a salient variable.

      There seems to be some major confusion here and we will try to explain how we think. Note that in the Mandela and MacKinnon paper, there is no significant difference in the amino acid positions in the selectivity filter between low and high K+ when S4 is in the activated position (See Mandala and Mackinnon, PNAS Suppl. Fig S5 C and D). There are only fewer K+ in the selectivity filter in low K+. So, the structure with the distorted selectivity filter is not due to low K+ by itself. Note that there is no real difference between macroscopic currents recorded in low and high K+ solutions (except what is expected from changes in driving force) for KCNQ1/KCNE1 channels (Larsen et al., Bioph J 2011), suggesting that low K+ do not promote the non-conductive state (Figure 1). We now include a section in the Discussion about high/low K+ in the structures and the absence of effects of K+ on the function of KCNQ1/KCNE1 channels.

      Author response image 1.

      Macroscopic KCNQ1/KCNE1 currents recorded in different K+ conditions.  Note that there is no difference between current recorded in low K+ (2 mM) conditions and high (96 mM) K+ conditions (n=3 oocytes). Currents were normalized in respect to high K+.

      Note also that, in the previous version of the manuscript, we did not propose that the position of S4 is what determines the state of the selectivity filter. We only reported that the CryoEM structure with S4 resting shows a distorted selectivity filter. It seems like our text confused the reviewer to think that we proposed that S4 determines the state of the selectivity filter, when we did not propose this earlier. We previously did not want to speculate too much about this, but we have now included a section in the Discussion to make our view clear in light of the confusion of the reviewers.

      It is clear from our data that the majority of sweeps are empty (which we assume is with S4 up), suggesting that the selectivity filter can be (and is in the majority of sweeps) in the non-conducting state even with S4 up.  We think that the selectivity filter switches between a non-conductive and a conductive conformation both with S4 down and with S4 up. The cryoEM structure in low K+ and S4 down just happened to catch the non-conductive state of the selectivity filter.  We have now added a section in the Discussion to clarify all this and explain how we think it works.

      However, S4 in the active conformation seems to stabilize the conductive conformation of the selectivity filter, because during long pulses the channel seems to stay open once opened (See Suppl Fig S2). So, one possibility is that the selectivity filter goes more readily into the non-conductive state when S4 is down (and maybe, or not, low K+ plays a role) and then when S4 moves up the selectivity filter sometimes recovers into the conductive state and stays there. We now have included a section in the Discussion to present our view. Since this whole discussion was initiated and pushed by the reviewer, we hope that the reviewers will not demand more data to support these ideas. We think that this addition makes sense since other readers might have the same questions and ideas as the reviewer, and we would like to prevent any confusion about this topic.

      Figure 1

      It remains unclear in the manuscript itself what "control" refers to. Are control patched the same patches that later receive LG?

      Yes, the control means the same patch before LG. We now indicate that in legends and text throughout.

      Supplementary Figure S1

      Unclear if any changes occur after addition of LG in left panel and if the LG data on right is paired in any way to data on left.

      Yes, in all cases the left and right panel in all figures are from the same patch. We now indicate that in legends and text throughout.

      The letter p is used both to represent open probability open probability from the all-point amplitude histogram and as a p-value statistical probability indicator sometime lower case, sometimes upper case. This was confusing.

      We have now exclusively use lower case p for statistical probability and Po for open probability.

      "This indicates that mutations of residues in the more intracellular region of the selectivity filter do not affect the Gmax increases and that the interactions that stabilize the channel involve only residues located near the external region part of the selectivity filter. "

      Seems too strongly worded, it remains possible that mutations of other residues in the more intracellular region of the selectivity filter could affect the Gmax increases.

      We have changed the text to: "Mutations of residues in the more intracellular region of the selectivity filter do not affect the Gmax increases, as if the interactions that stabilize the channel involve residues located near the external region part of the selectivity filter. "

      Supplementary Figure S7

      Please report Boltzmann fit parameters. What are "normalized" uA?

      We removed the uA, which was mistakenly inserted. The lines in the graphs are just lines connecting the dots and not Boltzmann fits, since we don’t have saturating curves in all panels to make unique fits.

      "We have previously shown that the effects of PUFAs on IKs channels involve the binding of PUFAs to two independent sites." Was binding to the sites actually shown? Suggest changing to: "We have previously proposed models in which the effects of PUFAs..."

      We have now changed this as the Reviewer suggested: " We have previously proposed models in which the effects of PUFAs on IKs channels involve the binding of PUFAs to two independent sites."

      Statistics used not always clear. Methods refer to multiple statistical tests but it is not clear which is used when.

      We use two different tests and it is now explained in figure legends when either was used.

      n values confusing. Sometimes # of sweeps used as n. Sometimes # patches used as n. In one instance "The average current during the single channel sweeps was increased by 2.3 {plus minus} 0.33 times (n = 4 patches, p =0.0006)" ...this sems a low p value for this n=4 sample?

      We have now more clearly indicated what n stands for in each case. There was an extra 0 in the p value, so now it is p = 0.006. Thanks for catching that error.

      Reviewer #2 (Recommendations For The Authors):

      I still have some comments for the revised manuscript.

      (1) (From the previous minor point #6) Since D317E and T309S did not show statistical significance in Figure 5A, the sentences such as "This data shows that Y315 and D317 are necessary for the ability of Lin-Glycine to increase Gmax" or "the effect of Lin-Glycine on Gmax of the KCNQ1/KCNE1 mutant was noticeably reduced compared to the WT channel showing the this residue contributes to the Gmax effect (Figure 5A)." may need to be toned down. Alternatively, I suggest the authors refer to Supplementary Figure S7 to confirm that Y315 and D317 are critical for increasing Gmax.

      We have redone the analysis and statistical evaluation in Fig 5. We no use the more appropriate value of the fitted Gmax (which use the whole dose response curve instead of only the 20 mM value) in the statistical evaluation and now Y315F and D317E are statistically different from wt.

      (2) Supplementary Fig. S1. All control diary plots include the green arrows to indicate the timing of lin-glycine (LG) application. It is a bit confusing why they are included. Is it to show that LG application did not have an immediate effect? Are the LG-free plots not available?

      Not sure what the Reviewer is asking about? In the previous review round the Reviewers asked specifically for this. The arrow shows when LG was applied and the plot on the right shows the effect of LG from the same patch.

      (3) The legend to Supplementary Figure S4, "The side chain of residues ... are highlighted as sticks and colored based on the atomic displacement values, from white to blue to red on a scale of 0 to 9 Å." They look mostly blue (or light blue). Which one is colored white? It might be better to use a different color code. It would also be nice to link the color code to the colors of Supplementary Figure S5, which currently uses a single color.

      We have removed “from white to blue to red on a scale of 0 to 9 Å” and instead now include a color scale directly in Fig S4 to show how much each atom moved based on the color.

      We feel it is not necessary to include color in Fig S5 since the scale of how much each atom moves is shown on the y axis.

      (4) Add unit (pA) to the y-axis of Supplementary Figure S2.

      pA has been added.

      Reviewer #3 (Recommendations For The Authors):

      Some issues on how data support conclusions are identified. Further justifications are suggested.

      186: “The decrease in first latency is most likely due to an effect of Lin-Glycine on Site I in the VSD and related to the shift in voltage dependence caused by Lin-Glycine." The results in Fig S1B do not seem to support this statement since the mutation Y315F in the pore helix seemed to have eliminated the effect of Lin-Glycine in reducing first latency. The authors may want to show that a mutation that eliminating Site I would eliminate the effect of Lin-Glycine on first latency. On the other hand, it will be also interesting to examine if another pore mutation, such as P320L (Fig 5) also reduce the effect of Lin-Glycine on first latency.

      These experiments are very hard and laborious, and we feel these are outside the scope of this paper which focuses on Site II and the mechanism of increasing Gmax. Further studies of the voltage shift and latency will have to be for a future study.

      The mutation D317E did not affect the effect of Lin-Glycine on Gmax significantly (Fig 5A, and Fig S7F comparing with Fig S7A), but the authors conclude that D317 is important for Lin-Glycine association. This conclusion needs a better justification.

      We have redone the analysis and statistical evaluation in Fig 5. We no use the more appropriate value of the fitted Gmax (which use the whole dose response curve instead of only the 20 mM value) in the statistical evaluation and now D317E is statistically different from wt

    1. Author response:

      The following is the authors’ response to the previous reviews.

      As you can see from the assessment (which is unchanged from before) and the reviews included below, the reviewers felt that the revisions did not yet address all of the major concerns. There was agreement that the strength of evidence would be upgraded to "solid" by addressing, at minimum, the following: 

      (1) Which of the results are significant for individual monkeys; and 

      (2) How trials from different target contrasts were analyzed 

      In this revision, we have addressed the two primary editorial recommendations:

      (1) We apologize if this information was not clear in the previous version. We have updated Table 1 to highlight clearly the significant results for individual monkeys. Six of our key results – pupil diameter (Fig 2B), microsaccades (Fig 2D), decoding performance for narrow-spiking units (Fig 3A), decoding performance for broad-spiking units (Fig 3B), target-evoked firing rate for all units (Fig 3E) and target-evoked firing rate for broad-spiking units (Fig 3F) – are significant for individual animals and therefore gives us high confidence regarding our results. Please also note that we present all results for individual animals in the Supplementary figures accompanying each main figure.

      (2) We have updated the manuscript and methods to explain how trials of each contrast were included in each analysis, and how contrast normalization was performed for the analysis in Figure 3. In addition, we discuss this point in the Discussion section, which we quote below:

      “Non-target stimulus contrasts were slightly different between hits and misses (mean: 33.1% in hits, 34.0% in misses, permutation test, 𝑝 = 0.02), but the contrast of the target was higher in hits compared to misses (mean: 38.7% in hits, 27.7% in misses, permutation test, 𝑝 = 1.6   𝑒 − 31). To control for potential effects of stimulus contrast, firing rates were first normalized by contrast before performing the analyses reported in Figure 3. For all other results, we considered only non-target stimuli, which had very minor differences in contrast (<1%) across hits and misses. In fact, this minor difference was in the opposite direction of our results with mean contrast being slightly higher for misses. While we cannot completely rule out any other effects of stimulus contrast, the normalization in Figure 3 and minor differences for non-target stimuli should minimize them.”

      Reviewer #1 (Public Review): 

      Summary: 

      In this study, Nandy and colleagues examine neural, physiological and behavioral correlates of perceptual variability in monkeys performing a visual change detection task. They used a laminar probe to record from area V4 while two macaque monkeys detected a small change in stimulus orientation that occurred at a random time in one of two locations, focusing their analysis on stimulus conditions where the animal was equally likely to detect (hit) or not-detect (miss) a briefly presented orientation change (target). They discovered two behavioral and physiological measures that are significantly different between hit and miss trials - pupil size tends to be slightly larger on hits vs. misses, and monkeys are more likely to miss the target on trials in which they made a microsaccade shortly before target onset. They also examined multiple measures of neural activity across the cortical layers and found some measures that are significantly different between hits and misses. 

      Strengths: 

      Overall the study is well executed and the analyses are appropriate (though several issues still need to be addressed as discussed in Specific Comments). 

      Thank you.

      Weaknesses: 

      My main concern with this study is that, with the exception of the pre-target microsaccades, the correlates of perceptual variability (differences between hits and misses) appear to be weak, potentially unreliable and disconnected. The GLM analysis of predictive power of trial outcome based on the behavioral and neural measures is only discussed at the end of the paper. This analysis shows that some of the measures have no significant predictive power, while others cannot be examined using the GLM analysis because these measures cannot be estimated in single trials. Given these weak and disconnected effects, my overall sense is that the current results provide limited advance to our understanding of the neural basis of perceptual variability. 

      Please see our response above to item #1 of the editorial recommendation. Six of our key results are individually significant in both animals giving us high confidence about the reliability and strength of our results. 

      Regarding the reviewer’s comment about the GLM, we note (also stated in the manuscript) that among the measures that we could estimate reliably on a single trial basis, two of these – pre-target microsaccades and input-layer firing rates – were reliable signatures of stimulus perception at threshold. This analysis does not imply that the other measures – Fano Factor, PPC, inter-laminar population correlations, SSC (which are all standard tools in modern systems neuroscience, and which cannot be estimated on a single-trial basis) – are irrelevant. Our intent in including the GLM analyses was to complement the results reported from these across-trial measures (Figs 4-7) with the predictive power of single-trial measures.

      While no study is entirely complete in itself, we have attempted to synthesize our results into a conceptual model as depicted in Fig 8.

      Reviewer #2 (Public Review): 

      Strengths: 

      The experiments were well-designed and executed with meticulous control. The analyses of both behavioural and electrophysiological data align with the standards in the field. 

      Thank you.

      Weaknesses: 

      Many of the findings appear to be subtle differences and incremental compared to previous literature, including the authors' own work. While incremental findings are not necessarily a problem, the manuscript lacks clear statements about the extent to which the dataset, analysis, and findings overlap with the authors' prior research. For example, one of the main findings, which suggests that V4 neurons exhibit larger visual responses in hit trials (as shown in Fig. 3), appears to have been previously reported in their 2017 paper. 

      We respectfully disagree with the assessment that the findings reported here are incremental over the results reported in our prior study (Nandy et al,. 2017). In the previous study, we compared the laminar profile of neural modulation due to the deployment of attention i.e. the main comparison points were the attend-in and the attend-away conditions while controlling for visual stimulation. In this study, we go one step further and home in on the attend-in condition and investigate the differences in the laminar profile of neural activity (and two additional physiological measures: pupil and microsaccades) when the animal either correctly reports or fails to report a stimulus with equal probability. We thus control for both the visual stimulation and the cued attention state of the animal. While there are parallels to our previous results (as the reviewer correctly noted), the results reported here cannot be trivially predicted from our previous results. Please also note that we discuss our new results in the context of prior results, from both our group and others, in the manuscript (lines 310-332).

      Furthermore, the manuscript does not explore potentially interesting aspects of the dataset. For instance, the authors could have investigated instances where monkeys made 'false' reports, such as executing saccades towards visual stimuli when no orientation change occurred, which allows for a broader analysis that considers the perceptual component of neural activity over pure sensory responses. Overall, lacking broad interest with the current form.

      We appreciate the reviewer’s feedback on analyzing false alarm trials. Our focus for this study was to investigate the behavioral and neural correlates accompanying a correct or incorrect perception of a target stimulus presented at perceptual threshold. False alarm trials, by definition, do not include a target presentation. Moreover, false alarm rates rapidly decline with duration into a trial, with high rates during the first non-target presentation and rates close to zero by the time of the eighth presentation (see figure). Investigating false alarms will thus involve a completely different form of analysis than we have undertaken here. We therefore feel that while analyzing false alarm trials will be an interesting avenue to pursue in the future, it is outside the scope of the present study.

      Author response image 1.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      eLife assessment

      This useful study tests the hypothesis that Mycobacterium tuberculosis infection increases glycolysis in monocytes, which alters their capacity to migrate to lymph nodes as monocyte-derived dendritic cells. The authors conclude that infected monocytes are metabolically pre-conditioned to differentiate, with reduced expression of Hif1a and a glycolytically exhaustive phenotype, resulting in low migratory and immunologic potential. However, the evidence is incomplete as the use of live and dead mycobacteria still limits the ability to draw firm conclusions. The study will be of interest to microbiologists and infectious disease scientists.

      In response to the general eLife assessment, we would like to emphasize that the study did not deal with “infected monocytes” per se but rather with monocytes purified from patients with active TB. We show that monocytes purified from these TB patients (versus healthy controls) differentiate into DCs with different migratory capacities. In addition, to address the reviewer's comments in this new version of our manuscript, we include a relevant characterization of the migration capacity of DCs infected with Mtb to the plethora of assays already shown with viable bacteria in the previous revised version of our manuscript. 

      All in all, we believe that our study has significantly improved thanks to the feedback provided by the editor and reviewer panel during the different revision processes. We sincerely hope that this version of our manuscript is deemed fit for publication in this prestigious journal.

      Public Reviews:

      Reviewer #3 (Public Review):

      In the revised manuscript by Maio et al, the authors examined the bioenergetic mechanisms involved in the delayed migration of DC's during Mtb infection. The authors performed a series of in vitro infection experiments including bioenergetic experiments using the Agilent Seahorse XF, and glucose uptake and lactate production experiments. Also, data from SCENITH is included in the revised manuscript as well as some clinical data. This is a well written manuscript and addresses an important question in the TB field. A remaining weakness is the use of dead (irradiated) Mtb in several of the new experiments and claims where iMtb data were used to support live Mtb data. Another notable weakness lies in the author's insistence on asserting that lactate is the ultimate product of glycolysis, rather than acknowledging a large body of historical data in support of pyruvate's role in the process. This raises a perplexing issue highlighted by the authors: if Mtb indeed upregulates glycolysis, one would expect that inhibiting glycolysis would effectively control TB. However, the reality contradicts this expectation. Lastly, the examination of the bioenergetics of cells isolated from TB patients undergoing drug therapy, rather than studying them at their baseline state is a weakness.

      We thank the reviewer for this insightful assessment and feedback of our study. With regards to the data obtained with iMtb to support that with live Mtb, we have clarified the use of either iMtb or Mtb for each figure legend in the new version of the manuscript. Furthermore, we included the confirmation of the involvement of TLR2 ligation in the up-regulation of HIF-1α triggered by viable Mtb (new Fig S2E). We also conducted migration assays using (live) Mtb-infected dendritic cells (DCs) treated with either oxamate or PX-478 to validate that the HIF1a/glycolysis axis is indeed essential for DC migration (new Fig 5D).

      We respectfully acknowledge the reviewer's statement regarding the potential relationship between glycolysis and the control of TB. However, we find it necessary to elaborate on our stance, as our data offer a nuanced perspective. Our research indicates that DCs exhibit upregulated glycolysis following stimulation or infection by Mtb. This metabolic shift is crucial for facilitating cell migration to the draining lymph nodes, an essential step in mounting an effective immune response. Yet, it remains uncertain whether this glycolytic induction reaches a threshold conducive to generating a protective immune response, a matter that our findings do not definitively address. This aspect is carefully discussed in the manuscript, lines 380-385.

      Moreover, analyses of samples from chronic TB patients suggest that the outcome of inhibiting glycolysis may vary depending on factors such as the infection stage, the targeted cell type (e.g., monocytes, DCs), and the affected compartment (systemic versus local). This variability aligns with the concept of "too much, too little" exemplified by the dual roles of IFNγ (PMID: 28646367) and TNFα (PMID: 19275693) in TB, emphasizing the need to maintain an inflammatory equilibrium. In the context of the HIF1α/glycolysis axis, it appears to be a matter of timing: a case of "too early" activation of glycolysis in precursors, which could upset the delicate balance necessary for an effective immune response. We have added these comments in the discussion (pages 19-20, lines 468-485).

      In summary, while acknowledging the reviewer's perspective, we believe that a comprehensive understanding of the interplay between Mtb infection and glycolysis in myeloid cells requires further consideration of various contextual conditions, urging caution against oversimplified interpretations.

      With regard to the patients' information, as pointed out by the reviewer, according to the inclusion criteria for patient samples in the approved protocol by the Institutional Ethics Committee, we recruit patients who have received less than 15 days of treatment (for sensitive TB, the total treatment duration is at least 6 months). We do not have access to patient sample before they begin the treatment, as starting therapy is the most urgent matter in this case. Following the reviewer's suggestion, we investigated whether the glycolytic activity of monocytes correlated with the initiation of antibiotic treatment within this 15-day period. Our observations did not show any significant impact during the initial 15 days of treatment (see expanded reply below). However, after 2 months of treatment, we found that the glycolytic profile of CD16+ monocytes returned to baseline levels as per our analysis. This suggests that despite the normalization of glycolytic activity with antibiotic therapy, heightened basal glycolysis remains noticeable during the initial two weeks of treatment (time limit to meet the inclusion criteria in our study cohort).

      Recommendations for the authors:

      Reviewer #3 (Recommendations For The Authors):

      (1) In the revised manuscript, the authors addressed concerns related to using irradiated Mtb, a positive development. However, the study predominantly employs 1:1 or 2:1 MOI, representing a low infection model, with no observed statistical distinction between the two MOIs (Fig-1). To enhance the study, inclusion of a higher MOI (e.g., 5:1 or 10:1) would have been more informative. This becomes crucial as prior research on human macrophages indicates that Mtb infection typically hampers glycolysis, a finding inconsistent with the present study.

      As the reviewer notes, important work has documented the inhibition of glycolysis in M. tuberculosis-infected macrophages dependent on the MOI (PMID 30444490). For instance, in this study, hMDMs infected at an MOI of 1 showed increased extracellular acidification and glycolytic parameters, as opposed to macrophages infected at higher MOI, or the same MOI but measured in THP1 cells. In light of these findings, we attempted to extend our study with Mo-DCs to higher MOIs, but too much cell death was induced, limiting our ability to obtain reliable metabolic measurements and functional assays from these cultures. Consistent with this, other authors reported that more than 40% of Mo-DC die after 24 hours following infection with H37Rv at an MOI of 10 (PMID 22024399, Fig 2B). We acknowledge that more comprehensive focused in vivo studies would be needed to assess the overall impact of infection. We foresee that in the context of natural infection, DC with different levels of infection will coexist, some with low bacillary load that may be able to trigger glycolysis and migrate, others highly infected and more likely to die. In this case, we are unable to provide a full explanation for the delay in the onset of the adaptive response, an aspect that requires further investigation. From our perspective, the important contribution of our work is more focused on understanding the later stage of infection, when chronic infection is established, where precursors already seem to have a limited capacity to generate DC with a good migratory performance regardless of being confronted with a low bacillary load. 

      To better clarify the scope and limitations of the work, we added these comments to the discussion (see discussion, lines 405-408).

      The study emphasizes that Mtb infection enhances glycolysis in Mo-DCs (Fig-1 and Fig-2). Despite the authors advocating lactate as the end product (citing three reviews/opinions), the historical literature supported by detailed experimentation convincingly favors pyruvate. While the authors' attempt to support an alternate glycolytic paradigm is understandable, it is simply not necessary. This is further supported by the authors' claim that oxamate is an inhibitor of glycolysis (abstract and main text). Oxamate is a pyruvate analogue that directly inhibits the conversion of pyruvate into lactate by lactate dehydrogenase. Simply put, if oxamate was an inhibitor of glycolysis then the cells would have died.

      (2) Taking into account the reviewer's suggestions, we changed the text accordingly, referring to oxamate as an LDH inhibitor, including in the abstract.

      In Fig-2, clarify the term "bystander DCs." Explain why these MtbRFP- DCs exhibit distinct behavior compared to uninfected DCs, especially considering their similarity to Mtb-infected ones.

      (3) To clarify these results, as correctly suggested by the reviewer, we incorporated a sentence in the results section, stating that bystander DCs are cells that are not in direct association with Mtb (Mtb-RFP-DCs), but are rather nearby and exposed to the same environment (page 7, line 145-148). In other words, bystander cells are those exposed to the same secretome and soluble factors as infected cells. Our data indicate that bystander DCs upregulate their state of glycolysis just like infected DCs do, which suggests the presence of soluble mediators induced during infection that are capable of triggering glycolysis even in uninfected cells.

      These results are in line with the observation that bacteria lacking infectious capacity (such as the irradiated Mtb) also trigger glycolysis in DCs (Fig 1), likely via TLR2 receptors that are potentially activated by the release of mycobacterial antigens or bacterial debris present in the microenvironment (Fig 3). We incorporated this interpretation in the discussion of the manuscript (lines 403-408).

      (4) Notably, the authors conducted SCENITH on both iMtb and viable Mtb (Fig-2). However, OCR, PER, and Mito- & Glyco- ATP were solely measured in MO-DCs stimulated by iMtb. Given the distinct glycolytic responses between iMtb and viable Mtb, it is crucial to assess these parameters in Mo-DCs treated with viable Mtb. Moreover, it is unclear as to how the relative ATP in Fig-2F was calculated as both Mito-ATP and Glyco-ATP is significantly high in iMtb-treated Mo-DCs (Fig-2E). Also, figure 2 contains panels with no labeling, which is confusing.

      We appreciate the reviewer's suggestion that additional determinations would enrich the bioenergetic profile of DCs during infection. However, due to biosafety considerations and economic-driven limitations, we are currently unable to measure OCR, PER, and Mito- & Glyco- ATP, as these assessments require live cell cultures within BSL3 containment, if live Mtb is to be employed. Regrettably, our BSL3 facility is not equipped with a Seahorse instrument—few facilities in the world have such type of BLS3-driven investment. For this key reason, we employed SCENITH for our BSL3-based experiments.

      Concerning the how ATP was calculated, we show below the raw data for Mito-ATP and Glyco-ATP results and calculations of their relative contributions.

      Author response table 1.

      (5) In Figures 3, 4, & 5, the consistent use of only iMtb was observed. Previous concerns about this approach were raised in the review, with the authors asserting that the use of viable Mtb was beyond the manuscript's scope. However, this claim is inaccurate. Both the authors' findings and literature elsewhere emphasize notable differences not only in host-cell metabolism but also in immune responses when treated with viable Mtb compared to dead or iMtb. Therefore, it is recommended to incorporate viable Mtb in experiments where only iMtb was utilized. Also, in the abstract (3rd sentence), do the authors refer to live or irradiated Mtb? It is imperative to clearly indicate this distinction, as the subsequent conclusions are based only on one of these two scenarios, not both. The contradictory mitochondrial mass results (figure 1; live and dead Mtb showed opposite mitochondrial mass results) clearly illustrate the profound difference live (versus dead) Mtb cells can have on an experiment.

      We thank the reviewer for stating this concern. For Figure 3, the involvement of TLR2 ligation on lactate release was also confirmed with live Mtb (shown in Figure S2D). In this current version, we also confirmed the involvement of TLR2 ligation in the up-regulation of HIF-1α triggered by live Mtb (new Fig S2E). As for Figure 4, we agree that performing assays with live Mtb will add complementary information. Indeed, we hope to investigate in the future the impact of the glycolysis/HIF1a axes on the adaptive immune response. We believe that employing live bacteria and considering their active immune evasion strategies will be crucial. However, at present, this is not the focus of the current manuscript and is beyond its scope.

      We also agree with the reviewer that confirmation of the migratory behavior of DCs following Mtb infection is a crucial aspect of the study. To comply with this pertinent request, we performed new migration assays using Mtb-infected DCs treated with oxamate or PX-478 to validate that the HIF1a/glycolysis axis; results convincingly demonstrate that this axis is essential for DC migration, particularly in the context of Mtb-infected cells (new Fig 5D). Having observed the same inhibitory effect of HIF1a and LDH inhibition on cell migration in either Mtb-infected or iMtb-stimulated DCs, we consider that the sentence alluded to by the reviewer in the abstract is now applicable to both contexts (page 2, line 34-36). We hope this reviewer agrees.

      (6) The discussion and the graphical abstract elucidating the distinctions in glycolysis between CD16+ monocytes of HS and TB patients and iMtb-treated Mo-DCs are currently confusing and require clarification. According to the abstract, monocytes from TB patients exhibit heightened glycolysis, resulting in diminished HIF-a activity and migratory capacity of MO-DCs. This prompts a question: if exacerbated glycolysis in monocytes is associated with adverse outcomes, wouldn't it be logical to consider suppressing glycolysis? If so, how can inhibiting glycolysis, a favored metabolic pathway for pro-inflammatory responses, be beneficial for TB therapy?

      We understand the reviewer’s concern about this apparent paradox. As previously mentioned in response to the public review provided by the reviewer, inhibiting glycolysis may yield varying outcomes depending on the stage of infection, as well as the cellular target (e.g., monocytes, DCs) or compartment (systemic versus local). It is imperative to delve deeper into the potential role of the HIF1α/glycolysis axis at the systemic level within the context of chronic inflammation, contrasting with its role in a local setting during the acute phase of infection.

      A comprehensive understanding of the interplay between Mtb infection and glycolysis in myeloid cells requires further consideration of various contextual conditions, urging caution against oversimplified interpretations. For instance, one of the objectives of host-directed therapies (HDTs) is to mitigate host-response inflammatory toxicity, which can impede treatment efficacy (doi: 10.3389/fimmu.2021.645485). In this regard, traditional anti-inflammatory drugs such as non-steroidal anti-inflammatory drugs (NSAIDs) and corticosteroids have been explored as adjunct therapies due to their immunomodulatory properties. Additionally, compounds like vitamin D, phenylbutyrate (PBA), metformin, and thalidomide, among others, have been investigated in the context of TB infections (doi:10.3389/fimmu.2017.00772), highlighting the diverse range of strategies aimed at enhancing TB treatment. These efforts extend beyond bolstering antimicrobial activity to encompass minimizing inflammation and mitigating tissue damage.

      (7) I am not convinced that BubbleMap made any significant contribution to the manuscript perhaps because it is poorly described in the figure legends/main text (I am unable to determine what data set is significant or not).

      We agree with the reviewer’s comment. To clarify the valuable information gleaned from these analyses, we have added interpretive guidelines on bubble color, bubble size and statistical significance in the legend of Figure 7. We hope these changes may reflect the significant contribution of the BubbleMap analysis approach to this study, which demonstrates a significant enrichment of interferon response gene expression in the monocyte compartment from patients with active TB compared to their control counterparts. Notably, this enrichment does not extend to genes associated with the OXPHOS hallmark.

      (8) The use of cells/monocytes from TB patients is a concern in addition to the incomplete demographic table. In the case of the latter, absolute numbers including percentages should be included. Importantly, it appears that cells from TB patients were used, that received anti-TB drug therapy (regimen not stated) up to two weeks post diagnosis and not at baseline. This is important as recent studies have shown that anti-TB drugs modulates the bioenergetics of host cells. Lastly, what were the precise TB symptoms the authors referred to in figure 7C?

      We have updated the demographic table and included the absolute numbers. We concur with the reviewer's viewpoint, particularly in light of recent findings illustrating the impact of anti-TB drug treatment on cell metabolism (doi: 10.1128/AAC.00932-21/). Again, this study underscores the complexity of such effects, which exhibit considerable variability influenced by factors such as cell type, drug concentration, and combination therapy.

      Despite this variability, our analysis involving monocytes from TB patients, who received different antibiotic combinations within short time frames (less than 15 days) reveals a marked increase in glycolysis in CD16+ monocytes compared to healthy counterparts. We did not observe a correlation between monocyte glycolytic capacity and the start time of antibiotic treatment within this 15-day window (see below, Author response image 1). These findings suggest that the antibiotic regimen does not have a significant impact on monocyte glycolytic capacity during the first 15 days.  However, we did observe an effect of antibiotic treatment when comparing patients before and 2 months after treatment. Enrichment analysis of various monocyte subsets before and after 2 months of treatment (GEO accession number: GSE185372) showed that CD14dim CD16+ and CD14+ CD16+ populations had higher glycolytic activity before treatment, which is decreased then post-treatment (Author response image 2).

      Author response image 1.

      Correlation analysis between the baseline glycolytic capacity and the time since treatment onset for each monocyte subset (CD14+CD16-, CD14+CD16+ and CD14dimCD16+, N = 11). Linear regression lines are shown. Spearman’s rank test. The data are represented as scatter plots with each circle representing a single individual.

      Author response image 2.

      Gene enrichment analysis for glycolytic genes on the pairwise comparisons of each monocyte subset (CD14+CD16-, CD14+CD16+ and CD14dimCD16+) from patients with active TB pre-treatment vs patients with active TB (TB) undergoing treatment for 2 months. Comparisons with a p-value of less than 0.05 and an FDR value of less than 0.25 are considered significantly different.

      Overall, our results indicate that while drug treatment does affect cell bioenergetics, this effect is not prominent within the first 15 days of treatment. CD16+ monocytes maintain high basal glycolytic activity that normalizes after treatment, contrasting with the CD16- population (even under the same circulating antibiotic doses). This highlights the intricate interplay between anti-TB drugs and cellular metabolism, underscoring the need for further research to understand the underlying mechanisms and therapeutic implications.

      Finally, the term symptoms evolution refers to the time period during which a patient experiences cough and phlegm for more than 2-3 weeks, with or without sputum that may (or not) be bloody, accompanied by symptoms of constitutional illness (e.g, loss of appetite, weight loss, night sweats, general malaise). As requested, this definition has been included in the method section (page 28-29, lines 705-709).

      Minor:

      (1) Incorporate the abbreviation for tuberculosis "(TB)" in the first line of the abstract and similarly introduce the abbreviation for Mycobacterium tuberculosis when it is first mentioned in the abstract.

      Thank you, we have amended it accordingly.

      (2) As the majority of experiments are in vitro, the authors should specify the number of times each experiment was conducted for every figure.

      We have included this information in each figure legend (see N for each panel). Since the majority of our approaches are conducted in vitro using primary cell cultures (specifically, human monocyte-derived DCs), we utilized samples from four to ten independent donors, not replicates, in order to account for the variability seen between donors.

      (3) Rename Fig-2. Ensure consistent labeling for the metabolic dependency of uninfected, Mtb-infected, and the Bystander panel, aligning with the format used in panels A & B. Similarly, replace '-' with 'uninfected'.

      We have modified the figure following most of the reviewer’s suggestions. However, we decided to keep the nomenclature “-” to denote a control condition, which can be unstimulated (panels A-B, fig 2) or uninfected cells (panels C-D, fig 2) depending on the experimental design.

      (4) Discussion: It is unclear what the authors mean by 'some sort of exhausted glycolytic capacity'.

      We have slightly modified the phrase.

    1. Author response:

      The following is the authors’ response to the current reviews.

      eLife assessment

      This useful manuscript challenges the utility of current paradigms for estimating brain-age with magnetic resonance imaging measures, but presents inadequate evidence to support the suggestion that an alternative approach focused on predicting cognition is more useful. The paper would benefit from a clearer explication of the methods and a more critical evaluation of the conceptual basis of the different models. This work will be of interest to researchers working on brain-age and related models.

      Thank you so much for providing high-quality reviews on our manuscript. We revised the manuscript to address all of the reviewers’ comments and provided full responses to each of the comments below. Importantly, in this revision, we clarified that we did not intend to use Brain Cognition as an alternative approach as mentioned by the editor. This is because, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Here we made this point more explicit and further stated that the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. By examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age. And such quantification is the third aim of this study.

      Reviewer #1 (Public Review):

      In this paper, the authors evaluate the utility of brain age derived metrics for predicting cognitive decline by performing a 'commonality' analysis in a downstream regression that enables the different contribution of different predictors to be assessed. The main conclusion is that brain age derived metrics do not explain much additional variation in cognition over and above what is already explained by age. The authors propose to use a regression model trained to predict cognition ('brain cognition') as an alternative suited to applications of cognitive decline. While this is less accurate overall than brain age, it explains more unique variance in the downstream regression.

      Importantly, in this revision, we clarified that we did not intend to use Brain Cognition as an alternative approach. This is because, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Here we made this point more explicit and further stated that the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. By examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age.

      REVISED VERSION: while the authors have partially addressed my concerns, I do not feel they have addressed them all. I do not feel they have addressed the weight instability and concerns about the stacked regression models satisfactorily.

      Please see our responses to #3 below

      I also must say that I agree with Reviewer 3 about the limitations of the brain age and brain cognition methods conceptually. In particular that the regression model used to predict fluid cognition will by construction explain more variance in cognition than a brain age model that is trained to predict age. This suffers from the same problem the authors raise with brain age and would indeed disappear if the authors had a separate measure of cognition against which to validate and were then to regress this out as they do for age correction. I am aware that these conceptual problems are more widespread than this paper alone (in fact throughout the brain age literature), so I do not believe the authors should be penalised for that. However, I do think they can make these concerns more explicit and further tone down the comments they make about the utility of brain cognition. I have indicated the main considerations about these points in the recommendations section below.

      Thank you so much for raising this point. We now have the following statement in the introduction and discussion to address this concern (see below).

      Briefly, we made it explicit that, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. That is, the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. More importantly, by examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age. And this is the third goal of this present study.

      From Introduction:

      “Third and finally, certain variation in fluid cognition is related to brain MRI, but to what extent does Brain Age not capture this variation? To estimate the variation in fluid cognition that is related to the brain MRI, we could build prediction models that directly predict fluid cognition (i.e., as opposed to chronological age) from brain MRI data. Previous studies found reasonable predictive performances of these cognition-prediction models, built from certain MRI modalities (Dubois et al., 2018; Pat, Wang, Anney, et al., 2022; Rasero et al., 2021; Sripada et al., 2020; Tetereva et al., 2022; for review, see Vieira et al., 2022). Analogous to Brain Age, we called the predicted values from these cognition-prediction models, Brain Cognition. The strength of an out-of-sample relationship between Brain Cognition and fluid cognition reflects variation in fluid cognition that is related to the brain MRI and, therefore, indicates the upper limit of Brain Age’s capability in capturing fluid cognition. This is, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Consequently, if we included Brain Cognition, Brain Age and chronological age in the same model to explain fluid cognition, we would be able to examine the unique effects of Brain Cognition that explain fluid cognition beyond Brain Age and chronological age. These unique effects of Brain Cognition, in turn, would indicate the amount of co-variation between brain MRI and fluid cognition that is missed by Brain Age.”

      From Discussion:

      “Third, by introducing Brain Cognition, we showed the extent to which Brain Age indices were not able to capture the variation in fluid cognition that is related to brain MRI. More specifically, using Brain Cognition allowed us to gauge the variation in fluid cognition that is related to the brain MRI, and thereby, to estimate the upper limit of what Brain Age can do. Moreover, by examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age.

      From our results, Brain Cognition, especially from certain cognition-prediction models such as the stacked models, has relatively good predictive performance, consistent with previous studies (Dubois et al., 2018; Pat, Wang, Anney, et al., 2022; Rasero et al., 2021; Sripada et al., 2020; Tetereva et al., 2022; for review, see Vieira et al., 2022). We then examined Brain Cognition using commonality analyses (Nimon et al., 2008) in multiple regression models having a Brain Age index, chronological age and Brain Cognition as regressors to explain fluid cognition. Similar to Brain Age indices, Brain Cognition exhibited large common effects with chronological age. But more importantly, unlike Brain Age indices, Brain Cognition showed large unique effects, up to around 11%. As explained above, the unique effects of Brain Cognition indicated the amount of co-variation between brain MRI and fluid cognition that was missed by a Brain Age index and chronological age. This missing amount was relatively high, considering that Brain Age and chronological age together explained around 32% of the total variation in fluid cognition. Accordingly, if a Brain Age index was used as a biomarker along with chronological age, we would have missed an opportunity to improve the performance of the model by around one-third of the variation explained.”

      This is a reasonably good paper and the use of a commonality analysis is a nice contribution to understanding variance partitioning across different covariates. I have some comments that I believe the authors ought to address, which mostly relate to clarity and interpretation

      Reviewer #1 Public Review #1

      First, from a conceptual point of view, the authors focus exclusively on cognition as a downstream outcome. I would suggest the authors nuance their discussion to provide broader considerations of the utility of their method and on the limits of interpretation of brain age models more generally.

      Thank you for your comments on this issue.

      We now discussed the broader consideration in detail:

      (1) the consistency between our findings on fluid cognition and other recent works on brain disorders,

      (2) the difference between studies investigating the utility of Brain Age in explaining cognitive functioning, including ours and others (e.g., Butler et al., 2021; Cole, 2020, 2020; Jirsaraie, Kaufmann, et al., 2023) and those explaining neurological/psychological disorders (e.g., Bashyam et al., 2020; Rokicki et al., 2021)

      and

      (3) suggested solutions we and others made to optimise the utility of Brain Age for both cognitive functioning and brain disorders.

      From Discussion:

      “This discrepancy between the predictive performance of age-prediction models and the utility of Brain Age indices as a biomarker is consistent with recent findings (for review, see Jirsaraie, Gorelik, et al., 2023), both in the context of cognitive functioning (Jirsaraie, Kaufmann, et al., 2023) and neurological/psychological disorders (Bashyam et al., 2020; Rokicki et al., 2021). For instance, combining different MRI modalities into the prediction models, similar to our stacked models, often leads to the highest performance of age-prediction models, but does not likely explain the highest variance across different phenotypes, including cognitive functioning and beyond (Jirsaraie, Gorelik, et al., 2023).”

      “There is a notable difference between studies investigating the utility of Brain Age in explaining cognitive functioning, including ours and others (e.g., Butler et al., 2021; Cole, 2020, 2020; Jirsaraie, Kaufmann, et al., 2023) and those explaining neurological/psychological disorders (e.g., Bashyam et al., 2020; Rokicki et al., 2021). We consider the former as a normative type of study and the latter as a case-control type of study (Insel et al., 2010; Marquand et al., 2016). Those case-control Brain Age studies focusing on neurological/psychological disorders often build age-prediction models from MRI data of largely healthy participants (e.g., controls in a case-control design or large samples in a population-based design), apply the built age-prediction models to participants without vs. with neurological/psychological disorders and compare Brain Age indices between the two groups. On the one hand, this means that case-control studies treat Brain Age as a method to detect anomalies in the neurological/psychological group (Hahn et al., 2021). On the other hand, this also means that case-control studies have to ignore under-fitted models when applied prediction models built from largely healthy participants to participants with neurological/psychological disorders (i.e., Brain Age may predict chronological age well for the controls, but not for those with a disorder). On the contrary, our study and other normative studies focusing on cognitive functioning often build age-prediction models from MRI data of largely healthy participants and apply the built age-prediction models to participants who are also largely healthy. Accordingly, the age-prediction models for explaining cognitive functioning in normative studies, while not allowing us to detect group-level anomalies, do not suffer from being under-fitted. This unfortunately might limit the generalisability of our study into just the normative type of study. Future work is still needed to test the utility of brain age in the case-control case.”

      “Next, researchers should not select age-prediction models based solely on age-prediction performance. Instead, researchers could select age-prediction models that explained phenotypes of interest the best. Here we selected age-prediction models based on a set of features (i.e., modalities) of brain MRI. This strategy was found effective not only for fluid cognition as we demonstrated here, but also for neurological and psychological disorders as shown elsewhere (Jirsaraie, Gorelik, et al., 2023; Rokicki et al., 2021). Rokicki and colleagues (2021), for instance, found that, while integrating across MRI modalities led to age-prediction models with the highest age-prediction performance, using only T1 structural MRI gave age-prediction models that were better at classifying Alzheimer’s disease. Similarly, using only cerebral blood flow gave age-prediction models that were better at classifying mild/subjective cognitive impairment, schizophrenia and bipolar disorder.

      As opposed to selecting age-prediction models based on a set of features, researchers could also select age-prediction models based on modelling methods. For instance, Jirsaraie and colleagues (2023) compared gradient tree boosting (GTB) and deep-learning brain network (DBN) algorithms in building age-prediction models. They found GTB to have higher age-prediction performance but DBN to have better utility in explaining cognitive functioning. In this case, an algorithm with better utility (e.g., DBN) should be used for explaining a phenotype of interest. Similarly, Bashyam and colleagues (2020) built different DBN-based age-prediction models, varying in age-prediction performance. The DBN models with a higher number of epochs corresponded to higher age-prediction performance. However, DBN-based age-prediction models with a moderate (as opposed to higher or lower) number of epochs were better at classifying Alzheimer’s disease, mild cognitive impairment and schizophrenia. In this case, a model from the same algorithm with better utility (e.g., those DBN with a moderate epoch number) should be used for explaining a phenotype of interest. Accordingly, this calls for a change in research practice, as recently pointed out by Jirasarie and colleagues (2023, p7), “Despite mounting evidence, there is a persisting assumption across several studies that the most accurate brain age models will have the most potential for detecting differences in a given phenotype of interest”. Future neuroimaging research should aim to build age-prediction models that are not necessarily good at predicting age, but at capturing phenotypes of interest.”

      Reviewer #1 Public Review #2

      Second, from a methods perspective, there is not a sufficient explanation of the methodological procedures in the current manuscript to fully understand how the stacked regression models were constructed. I would request that the authors provide more information to enable the reader to better understand the stacked regression models used to ensure that these models are not overfit.

      Thank you for allowing us an opportunity to clarify our stacked model. We made additional clarification to make this clearer (see below). We wanted to confirm that we did not use test sets to build a stacked model in both lower and higher levels of the Elastic Net models. Test sets were there just for testing the performance of the models.

      From Methods: “We used nested cross-validation (CV) to build these prediction models (see Figure 7). We first split the data into five outer folds, leaving each outer fold with around 100 participants. This number of participants in each fold is to ensure the stability of the test performance across folds. In each outer-fold CV loop, one of the outer folds was treated as an outer-fold test set, and the rest was treated as an outer-fold training set. Ultimately, looping through the nested CV resulted in a) prediction models from each of the 18 sets of features as well as b) prediction models that drew information across different combinations of the 18 separate sets, known as “stacked models.” We specified eight stacked models: “All” (i.e., including all 18 sets of features), “All excluding Task FC”, “All excluding Task Contrast”, “Non-Task” (i.e., including only Rest FC and sMRI), “Resting and Task FC”, “Task Contrast and FC”, “Task Contrast” and “Task FC”. Accordingly, there were 26 prediction models in total for both Brain Age and Brain Cognition.

      To create these 26 prediction models, we applied three steps for each outer-fold loop. The first step aimed at tuning prediction models for each of 18 sets of features. This step only involved the outer-fold training set and did not involve the outer-fold test set. Here, we divided the outer-fold training set into five inner folds and applied inner-fold CV to tune hyperparameters with grid search. Specifically, in each inner-fold CV, one of the inner folds was treated as an inner-fold validation set, and the rest was treated as an inner-fold training set. Within each inner-fold CV loop, we used the inner-fold training set to estimate parameters of the prediction model with a particular set of hyperparameters and applied the estimated model to the inner-fold validation set. After looping through the inner-fold CV, we, then, chose the prediction models that led to the highest performance, reflected by coefficient of determination (R2), on average across the inner-fold validation sets. This led to 18 tuned models, one for each of the 18 sets of features, for each outer fold.

      The second step aimed at tuning stacked models. Same as the first step, the second step only involved the outer-fold training set and did not involve the outer-fold test set. Here, using the same outer-fold training set as the first step, we applied tuned models, created from the first step, one from each of the 18 sets of features, resulting in 18 predicted values for each participant. We, then, re-divided this outer-fold training set into new five inner folds. In each inner fold, we treated different combinations of the 18 predicted values from separate sets of features as features to predict the targets in separate “stacked” models. Same as the first step, in each inner-fold CV loop, we treated one out of five inner folds as an inner-fold validation set, and the rest as an inner-fold training set. Also as in the first step, we used the inner-fold training set to estimate parameters of the prediction model with a particular set of hyperparameters from our grid. We tuned the hyperparameters of stacked models using grid search by selecting the models with the highest R2 on average across the inner-fold validation sets. This led to eight tuned stacked models.

      The third step aimed at testing the predictive performance of the 18 tuned prediction models from each of the set of features, built from the first step, and eight tuned stacked models, built from the second step. Unlike the first two steps, here we applied the already tuned models to the outer-fold test set. We started by applying the 18 tuned prediction models from each of the sets of features to each observation in the outer-fold test set, resulting in 18 predicted values. We then applied the tuned stacked models to these predicted values from separate sets of features, resulting in eight predicted values.

      To demonstrate the predictive performance, we assessed the similarity between the observed values and the predicted values of each model across outer-fold test sets, using Pearson’s r, coefficient of determination (R2) and mean absolute error (MAE). Note that for R2, we used the sum of squares definition (i.e., R2 = 1 – (sum of squares residuals/total sum of squares)) per a previous recommendation (Poldrack et al., 2020). We considered the predicted values from the outer-fold test sets of models predicting age or fluid cognition, as Brain Age and Brain Cognition, respectively.”

      Note some previous research, including ours (Tetereva et al., 2022), splits the observations in the outer-fold training set into layer 1 and layer 2 and applies the first and second steps to layers 1 and 2, respectively. Here we decided against this approach and used the same outer-fold training set for both first and second steps in order to avoid potential bias toward the stacked models. This is because, when the data are split into two layers, predictive models built for each separate set of features only use the data from layer 1, while the stacked models use the data from both layers 1 and 2. In practice with large enough data, these two approaches might not differ much, as we demonstrated previously (Tetereva et al., 2022).

      Reviewer #1 Public Review #3

      Please also provide an indication of the different regression strengths that were estimated across the different models and cross-validation splits. Also, how stable were the weights across splits?

      The focus of this article is on the predictions. Still, it is informative for readers to understand how stable the feature importance (i.e., Elastic Net coefficients) is. To demonstrate the stability of feature importance, we now examined the rank stability of feature importance using Spearman’s ρ (see Figure 4). Specifically, we correlated the feature importance between two prediction models of the same features, used in two different outer-fold test sets. Given that there were five outer-fold test sets, we computed 10 Spearman’s ρ for each prediction model of the same features. We found Spearman’s ρ to be varied dramatically in both age-prediction (range=.31-.94) and fluid cognition-prediction (range=.16-.84) models. This means that some prediction models were much more stable in their feature importance than others. This is probably due to various factors such as a) the collinearity of features in the model, b) the number of features (e.g., 71,631 features in functional connectivity, which were further reduced to 75 PCAs, as compared to 19 features in subcortical volume based on the ASEG atlas), c) the penalisation of coefficients either with ‘Ridge’ or ‘Lasso’ methods, which resulted in reduction as a group of features or selection of a feature among correlated features, respectively, and d) the predictive performance of the models. Understanding the stability of feature importance is beyond the scope of the current article. As mentioned by Reviewer 1, “The predictions can be stable when the coefficients are not,” and we chose to focus on the prediction in the current article.

      Reviewer #1 Public Review #4

      Please provide more details about the task designs, MRI processing procedures that were employed on this sample in addition to the regression methods and bias correction methods used. For example, there are several different parameterisations of the elastic net, please provide equations to describe the method used here so that readers can easily determine how the regularisation parameters should be interpreted.

      Thank you for the opportunity for us to provide more methodical details.

      First, for the task design, we included the following statements:

      From Methods:

      “HCP-A collected fMRI data from three tasks: Face Name (Sperling et al., 2001), Conditioned Approach Response Inhibition Task (CARIT) (Somerville et al., 2018) and VISual MOTOR (VISMOTOR) (Ances et al., 2009).

      First, the Face Name task (Sperling et al., 2001) taps into episodic memory. The task had three blocks. In the encoding block [Encoding], participants were asked to memorise the names of faces shown. These faces were then shown again in the recall block [Recall] when the participants were asked if they could remember the names of the previously shown faces. There was also the distractor block [Distractor] occurring between the encoding and recall blocks. Here participants were distracted by a Go/NoGo task. We computed six contrasts for this Face Name task: [Encode], [Recall], [Distractor], [Encode vs. Distractor], [Recall vs. Distractor] and [Encode vs. Recall].

      Second, the CARIT task (Somerville et al., 2018) was adapted from the classic Go/NoGo task and taps into inhibitory control. Participants were asked to press a button to all [Go] but not to two [NoGo] shapes. We computed three contrasts for the CARIT task: [NoGo], [Go] and [NoGo vs. Go].

      Third, the VISMOTOR task (Ances et al., 2009) was designed to test simple activation of the motor and visual cortices. Participants saw a checkerboard with a red square either on the left or right. They needed to press a corresponding key to indicate the location of the red square. We computed just one contrast for the VISMOTOR task: [Vismotor], which indicates the presence of the checkerboard vs. baseline.”

      Second, for MRI processing procedures, we included the following statements.

      From Methods: “HCP-A provides details of parameters for brain MRI elsewhere (Bookheimer et al., 2019; Harms et al., 2018). Here we used MRI data that were pre-processed by the HCP-A with recommended methods, including the MSMALL alignment (Glasser et al., 2016; Robinson et al., 2018) and ICA-FIX (Glasser et al., 2016) for functional MRI. We used multiple brain MRI modalities, covering task functional MRI (task fMRI), resting-state functional MRI (rsfMRI) and structural MRI (sMRI), and organised them into 19 sets of features.”

      “ Sets of Features 1-10: Task fMRI contrast (Task Contrast) Task contrasts reflect fMRI activation relevant to events in each task. Bookheimer and colleagues (2019) provided detailed information about the fMRI in HCP-A. Here we focused on the pre-processed task fMRI Connectivity Informatics Technology Initiative (CIFTI) files with a suffix, “_PA_Atlas_MSMAll_hp0_clean.dtseries.nii.” These CIFTI files encompassed both the cortical mesh surface and subcortical volume (Glasser et al., 2013). Collected using the posterior-to-anterior (PA) phase, these files were aligned using MSMALL (Glasser et al., 2016; Robinson et al., 2018), linear detrended (see https://groups.google.com/a/humanconnectome.org/g/hcp-users/c/ZLJc092h980/m/GiihzQAUAwAJ) and cleaned from potential artifacts using ICA-FIX (Glasser et al., 2016).

      To extract Task Contrasts, we regressed the fMRI time series on the convolved task events using a double-gamma canonical hemodynamic response function via FMRIB Software Library (FSL)’s FMRI Expert Analysis Tool (FEAT) (Woolrich et al., 2001). We kept FSL’s default high pass cutoff at 200s (i.e., .005 Hz). We then parcellated the contrast ‘cope’ files, using the Glasser atlas (Gordon et al., 2016) for cortical surface regions and the Freesurfer’s automatic segmentation (aseg) (Fischl et al., 2002) for subcortical regions. This resulted in 379 regions, whose number was, in turn, the number of features for each Task Contrast set of features. “

      “ Sets of Features 11-13: Task fMRI functional connectivity (Task FC) Task FC reflects functional connectivity (FC ) among the brain regions during each task, which is considered an important source of individual differences (Elliott et al., 2019; Fair et al., 2007; Gratton et al., 2018). We used the same CIFTI file “_PA_Atlas_MSMAll_hp0_clean.dtseries.nii.” as the task contrasts. Unlike Task Contrasts, here we treated the double-gamma, convolved task events as regressors of no interest and focused on the residuals of the regression from each task (Fair et al., 2007). We computed these regressors on FSL, and regressed them in nilearn (Abraham et al., 2014). Following previous work on task FC (Elliott et al., 2019), we applied a highpass at .008 Hz. For parcellation, we used the same atlases as Task Contrast (Fischl et al., 2002; Glasser et al., 2016). We computed Pearson’s correlations of each pair of 379 regions, resulting in a table of 71,631 non-overlapping FC indices for each task. We then applied r-to-z transformation and principal component analysis (PCA) of 75 components (Rasero et al., 2021; Sripada et al., 2019, 2020). Note to avoid data leakage, we conducted the PCA on each training set and applied its definition to the corresponding test set. Accordingly, there were three sets of 75 features for Task FC, one for each task.

      Set of Features 14: Resting-state functional MRI functional connectivity (Rest FC) Similar to Task FC, Rest FC reflects functional connectivity (FC ) among the brain regions, except that Rest FC occurred during the resting (as opposed to task-performing) period. HCP-A collected Rest FC from four 6.42-min (488 frames) runs across two days, leading to 26-min long data (Harms et al., 2018). On each day, the study scanned two runs of Rest FC, starting with anterior-to-posterior (AP) and then with posterior-to-anterior (PA) phase encoding polarity. We used the “rfMRI_REST_Atlas_MSMAll_hp0_clean.dscalar.nii” file that was pre-processed and concatenated across the four runs. We applied the same computations (i.e., highpass filter, parcellation, Pearson’s correlations, r-to-z transformation and PCA) with the Task FC.

      Sets of Features 15-18: Structural MRI (sMRI)

      sMRI reflects individual differences in brain anatomy. The HCP-A used an established pre-processing pipeline for sMRI (Glasser et al., 2013). We focused on four sets of features: cortical thickness, cortical surface area, subcortical volume and total brain volume. For cortical thickness and cortical surface area, we used Destrieux’s atlas (Destrieux et al., 2010; Fischl, 2012) from FreeSurfer’s “aparc.stats” file, resulting in 148 regions for each set of features. For subcortical volume, we used the aseg atlas (Fischl et al., 2002) from FreeSurfer’s “aseg.stats” file, resulting in 19 regions. For total brain volume, we had five FreeSurfer-based features: “FS_IntraCranial_Vol” or estimated intra-cranial volume, “FS_TotCort_GM_Vol” or total cortical grey matter volume, “FS_Tot_WM_Vol” or total cortical white matter volume, “FS_SubCort_GM_Vol” or total subcortical grey matter volume and “FS_BrainSegVol_eTIV_Ratio” or ratio of brain segmentation volume to estimated total intracranial volume.”

      Third, for regression methods and bias correction methods used, we included the following statements:

      From Methods:

      “For the machine learning algorithm, we used Elastic Net (Zou & Hastie, 2005). Elastic Net is a general form of penalised regressions (including Lasso and Ridge regression), allowing us to simultaneously draw information across different brain indices to predict one target variable. Penalised regressions are commonly used for building age-prediction models (Jirsaraie, Gorelik, et al., 2023). Previously we showed that the performance of Elastic Net in predicting cognitive abilities is on par, if not better than, many non-linear and more-complicated algorithms (Pat, Wang, Bartonicek, et al., 2022; Tetereva et al., 2022). Moreover, Elastic Net coefficients are readily explainable, allowing us the ability to explain how our age-prediction and cognition-prediction models made the prediction from each brain feature (Molnar, 2019; Pat, Wang, Bartonicek, et al., 2022) (see below).

      Elastic Net simultaneously minimises the weighted sum of the features’ coefficients. The degree of penalty to the sum of the feature’s coefficients is determined by a shrinkage hyperparameter ‘α’: the greater the α, the more the coefficients shrink, and the more regularised the model becomes. Elastic Net also includes another hyperparameter, ‘l1 ratio’, which determines the degree to which the sum of either the squared (known as ‘Ridge’; l1 ratio=0) or absolute (known as ‘Lasso’; l1 ratio=1) coefficients is penalised (Zou & Hastie, 2005). The objective function of Elastic Net as implemented by sklearn (Pedregosa et al., 2011) is defined as:

      where X is the features, y is the target, and β is the coefficient. In our grid search, we tuned two Elastic Net hyperparameters: α using 70 numbers in log space, ranging from .1 and 100, and l_1-ratio using 25 numbers in linear space, ranging from 0 and 1.

      To understand how Elastic Net made a prediction based on different brain features, we examined the coefficients of the tuned model. Elastic Net coefficients can be considered as feature importance, such that more positive Elastic Net coefficients lead to more positive predicted values and, similarly, more negative Elastic Net coefficients lead to more negative predicted values (Molnar, 2019; Pat, Wang, Bartonicek, et al., 2022). While the magnitude of Elastic Net coefficients is regularised (thus making it difficult for us to interpret the magnitude itself directly), we could still indicate that a brain feature with a higher magnitude weights relatively stronger in making a prediction. Another benefit of Elastic Net as a penalised regression is that the coefficients are less susceptible to collinearity among features as they have already been regularised (Dormann et al., 2013; Pat, Wang, Bartonicek, et al., 2022).

      Given that we used five-fold nested cross validation, different outer folds may have different degrees of ‘α’ and ‘l1 ratio’, making the final coefficients from different folds to be different. For instance, for certain sets of features, penalisation may not play a big part (i.e., higher or lower ‘α’ leads to similar predictive performance), resulting in different ‘α’ for different folds. To remedy this in the visualisation of Elastic Net feature importance, we refitted the Elastic Net model to the full dataset without splitting them into five folds and visualised the coefficients on brain images using Brainspace (Vos De Wael et al., 2020) and Nilern (Abraham et al., 2014) packages. Note, unlike other sets of features, Task FC and Rest FC were modelled after data reduction via PCA. Thus, for Task FC and Rest FC, we, first, multiplied the absolute PCA scores (extracted from the ‘components_’ attribute of ‘sklearn.decomposition.PCA’) with Elastic Net coefficients and, then, summed the multiplied values across the 75 components, leaving 71,631 ROI-pair indices. “

      References

      Abraham, A., Pedregosa, F., Eickenberg, M., Gervais, P., Mueller, A., Kossaifi, J., Gramfort, A., Thirion, B., & Varoquaux, G. (2014). Machine learning for neuroimaging with scikit-learn. Frontiers in Neuroinformatics, 8, 14. https://doi.org/10.3389/fninf.2014.00014

      Ances, B. M., Liang, C. L., Leontiev, O., Perthen, J. E., Fleisher, A. S., Lansing, A. E., & Buxton, R. B. (2009). Effects of aging on cerebral blood flow, oxygen metabolism, and blood oxygenation level dependent responses to visual stimulation. Human Brain Mapping, 30(4), 1120–1132. https://doi.org/10.1002/hbm.20574

      Bashyam, V. M., Erus, G., Doshi, J., Habes, M., Nasrallah, I. M., Truelove-Hill, M., Srinivasan, D., Mamourian, L., Pomponio, R., Fan, Y., Launer, L. J., Masters, C. L., Maruff, P., Zhuo, C., Völzke, H., Johnson, S. C., Fripp, J., Koutsouleris, N., Satterthwaite, T. D., … on behalf of the ISTAGING Consortium, the P. A. disease C., ADNI, and CARDIA studies. (2020). MRI signatures of brain age and disease over the lifespan based on a deep brain network and 14 468 individuals worldwide. Brain, 143(7), 2312–2324. https://doi.org/10.1093/brain/awaa160

      Bookheimer, S. Y., Salat, D. H., Terpstra, M., Ances, B. M., Barch, D. M., Buckner, R. L., Burgess, G. C., Curtiss, S. W., Diaz-Santos, M., Elam, J. S., Fischl, B., Greve, D. N., Hagy, H. A., Harms, M. P., Hatch, O. M., Hedden, T., Hodge, C., Japardi, K. C., Kuhn, T. P., … Yacoub, E. (2019). The Lifespan Human Connectome Project in Aging: An overview. NeuroImage, 185, 335–348. https://doi.org/10.1016/j.neuroimage.2018.10.009

      Butler, E. R., Chen, A., Ramadan, R., Le, T. T., Ruparel, K., Moore, T. M., Satterthwaite, T. D., Zhang, F., Shou, H., Gur, R. C., Nichols, T. E., & Shinohara, R. T. (2021). Pitfalls in brain age analyses. Human Brain Mapping, 42(13), 4092–4101. https://doi.org/10.1002/hbm.25533

      Cole, J. H. (2020). Multimodality neuroimaging brain-age in UK biobank: Relationship to biomedical, lifestyle, and cognitive factors. Neurobiology of Aging, 92, 34–42. https://doi.org/10.1016/j.neurobiolaging.2020.03.014

      Destrieux, C., Fischl, B., Dale, A., & Halgren, E. (2010). Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature. NeuroImage, 53(1), 1–15. https://doi.org/10.1016/j.neuroimage.2010.06.010

      Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., Marquéz, J. R. G., Gruber, B., Lafourcade, B., Leitão, P. J., Münkemüller, T., McClean, C., Osborne, P. E., Reineking, B., Schröder, B., Skidmore, A. K., Zurell, D., & Lautenbach, S. (2013). Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1), 27–46. https://doi.org/10.1111/j.1600-0587.2012.07348.x

      Dubois, J., Galdi, P., Paul, L. K., & Adolphs, R. (2018). A distributed brain network predicts general intelligence from resting-state human neuroimaging data. Philosophical Transactions of the Royal Society B: Biological Sciences, 373(1756), 20170284. https://doi.org/10.1098/rstb.2017.0284

      Elliott, M. L., Knodt, A. R., Cooke, M., Kim, M. J., Melzer, T. R., Keenan, R., Ireland, D., Ramrakha, S., Poulton, R., Caspi, A., Moffitt, T. E., & Hariri, A. R. (2019). General functional connectivity: Shared features of resting-state and task fMRI drive reliable and heritable individual differences in functional brain networks. NeuroImage, 189, 516–532. https://doi.org/10.1016/j.neuroimage.2019.01.068

      Fair, D. A., Schlaggar, B. L., Cohen, A. L., Miezin, F. M., Dosenbach, N. U. F., Wenger, K. K., Fox, M. D., Snyder, A. Z., Raichle, M. E., & Petersen, S. E. (2007). A method for using blocked and event-related fMRI data to study “resting state” functional connectivity. NeuroImage, 35(1), 396–405. https://doi.org/10.1016/j.neuroimage.2006.11.051

      Fischl, B. (2012). FreeSurfer. NeuroImage, 62(2), 774–781. https://doi.org/10.1016/j.neuroimage.2012.01.021

      Fischl, B., Salat, D. H., Busa, E., Albert, M., Dieterich, M., Haselgrove, C., van der Kouwe, A., Killiany, R., Kennedy, D., Klaveness, S., Montillo, A., Makris, N., Rosen, B., & Dale, A. M. (2002). Whole Brain Segmentation. Neuron, 33(3), 341–355. https://doi.org/10.1016/S0896-6273(02)00569-X

      Glasser, M. F., Smith, S. M., Marcus, D. S., Andersson, J. L. R., Auerbach, E. J., Behrens, T. E. J., Coalson, T. S., Harms, M. P., Jenkinson, M., Moeller, S., Robinson, E. C., Sotiropoulos, S. N., Xu, J., Yacoub, E., Ugurbil, K., & Van Essen, D. C. (2016). The Human Connectome Project’s neuroimaging approach. Nature Neuroscience, 19(9), 1175–1187. https://doi.org/10.1038/nn.4361

      Glasser, M. F., Sotiropoulos, S. N., Wilson, J. A., Coalson, T. S., Fischl, B., Andersson, J. L., Xu, J., Jbabdi, S., Webster, M., Polimeni, J. R., Van Essen, D. C., & Jenkinson, M. (2013). The minimal preprocessing pipelines for the Human Connectome Project. NeuroImage, 80, 105–124. https://doi.org/10.1016/j.neuroimage.2013.04.127

      Gordon, E. M., Laumann, T. O., Adeyemo, B., Huckins, J. F., Kelley, W. M., & Petersen, S. E. (2016). Generation and Evaluation of a Cortical Area Parcellation from Resting-State Correlations. Cerebral Cortex, 26(1), 288–303. https://doi.org/10.1093/cercor/bhu239

      Gratton, C., Laumann, T. O., Nielsen, A. N., Greene, D. J., Gordon, E. M., Gilmore, A. W., Nelson, S. M., Coalson, R. S., Snyder, A. Z., Schlaggar, B. L., Dosenbach, N. U. F., & Petersen, S. E. (2018). Functional Brain Networks Are Dominated by Stable Group and Individual Factors, Not Cognitive or Daily Variation. Neuron, 98(2), 439-452.e5. https://doi.org/10.1016/j.neuron.2018.03.035

      Hahn, T., Fisch, L., Ernsting, J., Winter, N. R., Leenings, R., Sarink, K., Emden, D., Kircher, T., Berger, K., & Dannlowski, U. (2021). From ‘loose fitting’ to high-performance, uncertainty-aware brain-age modelling. Brain, 144(3), e31–e31. https://doi.org/10.1093/brain/awaa454

      Harms, M. P., Somerville, L. H., Ances, B. M., Andersson, J., Barch, D. M., Bastiani, M., Bookheimer, S. Y., Brown, T. B., Buckner, R. L., Burgess, G. C., Coalson, T. S., Chappell, M. A., Dapretto, M., Douaud, G., Fischl, B., Glasser, M. F., Greve, D. N., Hodge, C., Jamison, K. W., … Yacoub, E. (2018). Extending the Human Connectome Project across ages: Imaging protocols for the Lifespan Development and Aging projects. NeuroImage, 183, 972–984. https://doi.org/10.1016/j.neuroimage.2018.09.060

      Insel, T., Cuthbert, B., Garvey, M., Heinssen, R., Pine, D. S., Quinn, K., Sanislow, C., & Wang, P. (2010). Research Domain Criteria (RDoC): Toward a New Classification Framework for Research on Mental Disorders. American Journal of Psychiatry, 167(7), 748–751. https://doi.org/10.1176/appi.ajp.2010.09091379

      Jirsaraie, R. J., Gorelik, A. J., Gatavins, M. M., Engemann, D. A., Bogdan, R., Barch, D. M., & Sotiras, A. (2023). A systematic review of multimodal brain age studies: Uncovering a divergence between model accuracy and utility. Patterns, 4(4), 100712. https://doi.org/10.1016/j.patter.2023.100712

      Jirsaraie, R. J., Kaufmann, T., Bashyam, V., Erus, G., Luby, J. L., Westlye, L. T., Davatzikos, C., Barch, D. M., & Sotiras, A. (2023). Benchmarking the generalizability of brain age models: Challenges posed by scanner variance and prediction bias. Human Brain Mapping, 44(3), 1118–1128. https://doi.org/10.1002/hbm.26144

      Marquand, A. F., Rezek, I., Buitelaar, J., & Beckmann, C. F. (2016). Understanding Heterogeneity in Clinical Cohorts Using Normative Models: Beyond Case-Control Studies. Biological Psychiatry, 80(7), 552–561. https://doi.org/10.1016/j.biopsych.2015.12.023

      Molnar, C. (2019). Interpretable Machine Learning. A Guide for Making Black Box Models Explainable. https://christophm.github.io/interpretable-ml-book/

      Nimon, K., Lewis, M., Kane, R., & Haynes, R. M. (2008). An R package to compute commonality coefficients in the multiple regression case: An introduction to the package and a practical example. Behavior Research Methods, 40(2), 457–466. https://doi.org/10.3758/BRM.40.2.457

      Pat, N., Wang, Y., Anney, R., Riglin, L., Thapar, A., & Stringaris, A. (2022). Longitudinally stable, brain‐based predictive models mediate the relationships between childhood cognition and socio‐demographic, psychological and genetic factors. Human Brain Mapping, hbm.26027. https://doi.org/10.1002/hbm.26027

      Pat, N., Wang, Y., Bartonicek, A., Candia, J., & Stringaris, A. (2022). Explainable machine learning approach to predict and explain the relationship between task-based fMRI and individual differences in cognition. Cerebral Cortex, bhac235. https://doi.org/10.1093/cercor/bhac235

      Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(85), 2825–2830.

      Poldrack, R. A., Huckins, G., & Varoquaux, G. (2020). Establishment of Best Practices for Evidence for Prediction: A Review. JAMA Psychiatry, 77(5), 534–540. https://doi.org/10.1001/jamapsychiatry.2019.3671

      Rasero, J., Sentis, A. I., Yeh, F.-C., & Verstynen, T. (2021). Integrating across neuroimaging modalities boosts prediction accuracy of cognitive ability. PLOS Computational Biology, 17(3), e1008347. https://doi.org/10.1371/journal.pcbi.1008347

      Robinson, E. C., Garcia, K., Glasser, M. F., Chen, Z., Coalson, T. S., Makropoulos, A., Bozek, J., Wright, R., Schuh, A., Webster, M., Hutter, J., Price, A., Cordero Grande, L., Hughes, E., Tusor, N., Bayly, P. V., Van Essen, D. C., Smith, S. M., Edwards, A. D., … Rueckert, D. (2018). Multimodal surface matching with higher-order smoothness constraints. NeuroImage, 167, 453–465. https://doi.org/10.1016/j.neuroimage.2017.10.037

      Rokicki, J., Wolfers, T., Nordhøy, W., Tesli, N., Quintana, D. S., Alnæs, D., Richard, G., de Lange, A.-M. G., Lund, M. J., Norbom, L., Agartz, I., Melle, I., Nærland, T., Selbæk, G., Persson, K., Nordvik, J. E., Schwarz, E., Andreassen, O. A., Kaufmann, T., & Westlye, L. T. (2021). Multimodal imaging improves brain age prediction and reveals distinct abnormalities in patients with psychiatric and neurological disorders. Human Brain Mapping, 42(6), 1714–1726. https://doi.org/10.1002/hbm.25323

      Somerville, L. H., Bookheimer, S. Y., Buckner, R. L., Burgess, G. C., Curtiss, S. W., Dapretto, M., Elam, J. S., Gaffrey, M. S., Harms, M. P., Hodge, C., Kandala, S., Kastman, E. K., Nichols, T. E., Schlaggar, B. L., Smith, S. M., Thomas, K. M., Yacoub, E., Van Essen, D. C., & Barch, D. M. (2018). The Lifespan Human Connectome Project in Development: A large-scale study of brain connectivity development in 5–21 year olds. NeuroImage, 183, 456–468. https://doi.org/10.1016/j.neuroimage.2018.08.050

      Sperling, R. A., Bates, J. F., Cocchiarella, A. J., Schacter, D. L., Rosen, B. R., & Albert, M. S. (2001). Encoding novel face-name associations: A functional MRI study. Human Brain Mapping, 14(3), 129–139. https://doi.org/10.1002/hbm.1047

      Sripada, C., Angstadt, M., Rutherford, S., Kessler, D., Kim, Y., Yee, M., & Levina, E. (2019). Basic Units of Inter-Individual Variation in Resting State Connectomes. Scientific Reports, 9(1), Article 1. https://doi.org/10.1038/s41598-018-38406-5

      Sripada, C., Angstadt, M., Rutherford, S., Taxali, A., & Shedden, K. (2020). Toward a “treadmill test” for cognition: Improved prediction of general cognitive ability from the task activated brain. Human Brain Mapping, 41(12), 3186–3197. https://doi.org/10.1002/hbm.25007

      Tetereva, A., Li, J., Deng, J. D., Stringaris, A., & Pat, N. (2022). Capturing brain‐cognition relationship: Integrating task‐based fMRI across tasks markedly boosts prediction and test‐retest reliability. NeuroImage, 263, 119588. https://doi.org/10.1016/j.neuroimage.2022.119588

      Vieira, B. H., Pamplona, G. S. P., Fachinello, K., Silva, A. K., Foss, M. P., & Salmon, C. E. G. (2022). On the prediction of human intelligence from neuroimaging: A systematic review of methods and reporting. Intelligence, 93, 101654. https://doi.org/10.1016/j.intell.2022.101654

      Vos De Wael, R., Benkarim, O., Paquola, C., Lariviere, S., Royer, J., Tavakol, S., Xu, T., Hong, S.-J., Langs, G., Valk, S., Misic, B., Milham, M., Margulies, D., Smallwood, J., & Bernhardt, B. C. (2020). BrainSpace: A toolbox for the analysis of macroscale gradients in neuroimaging and connectomics datasets. Communications Biology, 3(1), 103. https://doi.org/10.1038/s42003-020-0794-7

      Woolrich, M. W., Ripley, B. D., Brady, M., & Smith, S. M. (2001). Temporal Autocorrelation in Univariate Linear Modeling of FMRI Data. NeuroImage, 14(6), 1370–1386. https://doi.org/10.1006/nimg.2001.0931

      Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x


      The following is the authors’ response to the previous reviews.

      eLife assessment

      This useful manuscript challenges the utility of current paradigms for estimating brain-age with magnetic resonance imaging measures, but presents inadequate evidence to support the suggestion that an alternative approach focused on predicting cognition is more useful. The paper would benefit from a clearer explication of the methods and a more critical evaluation of the conceptual basis of the different models. This work will be of interest to researchers working on brain-age and related models.

      Thank you so much for providing high-quality reviews on our manuscript. We revised the manuscript to address all of the reviewers’ comments and provided full responses to each of the comments below. Importantly, in this revision, we clarified that we did not intend to use Brain Cognition as an alternative approach. This is because, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Here we made this point more explicit and further stated that the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. By examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age. And such quantification is the third aim of this study.

      Public Reviews:

      Reviewer 1 (Public Review):

      In this paper, the authors evaluate the utility of brain-age-derived metrics for predicting cognitive decline by performing a 'commonality' analysis in a downstream regression that enables the different contribution of different predictors to be assessed. The main conclusion is that brain-age-derived metrics do not explain much additional variation in cognition over and above what is already explained by age. The authors propose to use a regression model trained to predict cognition ("brain-cognition") as an alternative suited to applications of cognitive decline. While this is less accurate overall than brain age, it explains more unique variance in the downstream regression.

      (1) I thank the authors for addressing many of my concerns with this revision. However, I do not feel they have addressed them all. In particular I think the authors could do more to address the concern I raised about the instability of the regression coefficients and about providing enough detail to determine that the stacked regression models do not overfit.

      Thank you Reviewer 1 for the comment. We addressed them in our response to Reviewer 1 Recommendations For The Authors #1 and #2 (see below).

      (2) In considering my responses to the authors revision, I also must say that I agree with Reviewer 3 about the limitations of the brain age and brain cognition methods conceptually. In particular that the regression model used to predict fluid cognition will by construction explain more variance in cognition than a brain age model that is trained to predict age. To be fair, these conceptual problems are more widespread than this paper alone, so I do not believe the authors should be penalised for that. However, I would recommend to make these concerns more explicit in the manuscript

      Thank you Reviewer 1 for the comment. We addressed them in our response to Reviewer 1 Recommendations For The Authors #3 (see below).

      Reviewer 2 (Public Review):

      In this study, the authors aimed to evaluate the contribution of brain-age indices in capturing variance in cognitive decline and proposed an alternative index, brain-cognition, for consideration.

      The study employs suitable methods and data to address the research questions, and the methods and results sections are generally clear and easy to follow.

      I appreciate the authors' efforts in significantly improving the paper, including some considerable changes, from the original submission. While not all reviewer points were tackled, the majority of them were adequately addressed. These include additional analyses, more clarity in the methods and a much richer and nuanced discussion. While recognising the merits of the revised paper, I have a few additional comments.

      (1) Perhaps it would help the reader to note that it might be expected for brain-cognition to account for a significantly larger variance (11%) in fluid cognition, in contrast to brain-age. This stems from the fact that the authors specifically trained brain-cognition to predict fluid cognition, the very variable under consideration. In line with this, the authors later recommend that researchers considering the use of brain-age should evaluate its utility using a regression approach. The latter involves including a brain index (e.g. brain-cognition) previously trained to predict the regression's target variable (e.g. fluid cognition) alongside a brain-age index (e.g., corrected brain-age gap). If the target-trained brain index outperforms the brain-age metric, it suggests that relying solely on brain-age might not be the optimal choice. Although not necessarily the case, is it surprising for the target-trained brain index to demonstrate better performance than brain-age? This harks back to the broader point raised in the initial review: while brain-age may prove useful (though sometimes with modest effect sizes) across diverse outcomes as a generally applicable metric, a brain index tailored for predicting a specific outcome, such as brain-cognition in this case, might capture a considerably larger share of variance in that specific context but could lack broader applicability. The latter aspect needs to be empirically assessed.

      Thank you so much for raising this point. Reviewer 1 (Public Review #2/Recommendations For The Authors #3) and Reviewer 3 (Recommendations for the Authors #1) made a similar observation. We now made changes to the introduction and discussion to address this concern (please see our responses to Reviewer 1 Recommendations For The Authors #3 below).

      Briefly, as in our 2nd revision, we did not intend to compare Brain Age with Brain Cognition since, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Here we made this point more explicit and further stated that the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. By examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age. And such quantification is the third aim of this study.

      (2) Furthermore, the discussion pertaining to training brain-age models on healthy populations for subsequent testing on individuals with neurological or psychological disorders seems somewhat one-sided within the broader debate. This one-sidedness might potentially confuse readers. It is worth noting that the choice to employ healthy participants in the training model is likely deliberate, serving as a norm against which atypical populations are compared. To provide a more comprehensive understanding, referencing Tim Hans's counterargument to Bashyam's perspective could offer a more complete view (https://academic.oup.com/brain/article/144/3/e31/6214475?login=false).

      Thank you Reviewer 2 for bringing up this issue. We have now revised the paragraph in question and added nuances on the usage of Brain Age for normative vs. case-control studies. We also cited Tim Hahn’s article that explained the conceptual foundation of the use of Brain Age in case-control studies. Please see below. Additionally, we also made a statement about our study not being able to address issues about the case-control studies directly in the newly written conclusion (see Reviewer 3 Recommendations for the Authors #3).

      Discussion:

      “There is a notable difference between studies investigating the utility of Brain Age in explaining cognitive functioning, including ours and others (e.g., Butler et al., 2021; Cole, 2020, 2020; Jirsaraie et al., 2023) and those explaining neurological/psychological disorders (e.g., Bashyam et al., 2020; Rokicki et al., 2021). We consider the former as a normative type of study and the latter as a case-control type of study (Insel et al., 2010; Marquand et al., 2016). Those case-control Brain Age studies focusing on neurological/psychological disorders often build age-prediction models from MRI data of largely healthy participants (e.g., controls in a case-control design or large samples in a population-based design), apply the built age-prediction models to participants without vs. with neurological/psychological disorders and compare Brain Age indices between the two groups. On the one hand, this means that case-control studies treat Brain Age as a method to detect anomalies in the neurological/psychological group (Hahn et al., 2021). On the other hand, this also means that case-control studies have to ignore under-fitted models when applied prediction models built from largely healthy participants to participants with neurological/psychological disorders (i.e., Brain Age may predict chronological age well for the controls, but not for those with a disorder). On the contrary, our study and other normative studies focusing on cognitive functioning often build age-prediction models from MRI data of largely healthy participants and apply the built age-prediction models to participants who are also largely healthy. Accordingly, the age-prediction models for explaining cognitive functioning in normative studies, while not allowing us to detect group-level anomalies, do not suffer from being under-fitted. This unfortunately might limit the generalisability of our study into just the normative type of study. Future work is still needed to test the utility of brain age in the case-control case.”

      (3) Overall, this paper makes a significant contribution to the field of brain-age and related brain indices and their utility.

      Thank you for the encouragement.

      Reviewer 3 (Public Review):

      The main question of this article is as follows: "To what extent does having information on brain-age improve our ability to capture declines in fluid cognition beyond knowing a person's chronological age?" This question is worthwhile, considering that there is considerable confusion in the field about the nature of brain-age.

      (1) Thank you to the authors for addressing so many of my concerns with this revision. There are a few points that I feel still need addressing/clarifying related to 1) calculating brain cognition, 2) the inevitability of their results, and 3) their continued recommendation to use brain-age metrics.

      Thank you Reviewer 3 for the comment. We addressed them in our response to Reviewer 3 Recommendations For The Authors #1-3 (see below).

      Recommendations for the authors:

      Reviewer 1 (Recommendations For The Authors):

      (1) I do not feel the authors have fully addressed the concern I raised about the stacked regression models. Despite the new figure, it is still not entirely clear what the authors are using as the training set in the final step. To be clear, the problem occurs because of the parameters, not the hyperparameters (which the authors now state that they are optimising via nested grid search). in other words, given a regression model y = X*beta, if the X are taken to be predictions from a lower level regression model, then they contain information that is derived from both the training set at the test set for the model that this was trained on. If the split is the same (i.e. the predictions are derived on the same test set as is being used at the second level), then this can lead to overfitting. It is not clear to me whether the authors have done this or not. Please provide additional detail to clarify this point.

      Thank you for allowing us an opportunity to clarify our stacked model. We wanted to confirm that we did not use test sets to build a stacked model in both lower and higher levels of the Elastic Net models. Test sets were there just for testing the performance of the models. We made additional clarification to make this clearer (see below). Let us explain what we did and provide the rationales below.

      From Methods:

      “We used nested cross-validation (CV) to build these prediction models (see Figure 7). We first split the data into five outer folds, leaving each outer fold with around 100 participants. This number of participants in each fold is to ensure the stability of the test performance across folds. In each outer-fold CV loop, one of the outer folds was treated as an outer-fold test set, and the rest was treated as an outer-fold training set. Ultimately, looping through the nested CV resulted in a) prediction models from each of the 18 sets of features as well as b) prediction models that drew information across different combinations of the 18 separate sets, known as “stacked models.” We specified eight stacked models: “All” (i.e., including all 18 sets of features), “All excluding Task FC”, “All excluding Task Contrast”, “Non-Task” (i.e., including only Rest FC and sMRI), “Resting and Task FC”, “Task Contrast and FC”, “Task Contrast” and “Task FC”. Accordingly, there were 26 prediction models in total for both Brain Age and Brain Cognition.

      To create these 26 prediction models, we applied three steps for each outer-fold loop. The first step aimed at tuning prediction models for each of 18 sets of features. This step only involved the outer-fold training set and did not involve the outer-fold test set. Here, we divided the outer-fold training set into five inner folds and applied inner-fold CV to tune hyperparameters with grid search. Specifically, in each inner-fold CV, one of the inner folds was treated as an inner-fold validation set, and the rest was treated as an inner-fold training set. Within each inner-fold CV loop, we used the inner-fold training set to estimate parameters of the prediction model with a particular set of hyperparameters and applied the estimated model to the inner-fold validation set. After looping through the inner-fold CV, we, then, chose the prediction models that led to the highest performance, reflected by coefficient of determination (R2), on average across the inner-fold validation sets. This led to 18 tuned models, one for each of the 18 sets of features, for each outer fold.

      The second step aimed at tuning stacked models. Same as the first step, the second step only involved the outer-fold training set and did not involve the outer-fold test set. Here, using the same outer-fold training set as the first step, we applied tuned models, created from the first step, one from each of the 18 sets of features, resulting in 18 predicted values for each participant. We, then, re-divided this outer-fold training set into new five inner folds. In each inner fold, we treated different combinations of the 18 predicted values from separate sets of features as features to predict the targets in separate “stacked” models. Same as the first step, in each inner-fold CV loop, we treated one out of five inner folds as an inner-fold validation set, and the rest as an inner-fold training set. Also as in the first step, we used the inner-fold training set to estimate parameters of the prediction model with a particular set of hyperparameters from our grid. We tuned the hyperparameters of stacked models using grid search by selecting the models with the highest R2 on average across the inner-fold validation sets. This led to eight tuned stacked models.

      The third step aimed at testing the predictive performance of the 18 tuned prediction models from each of the set of features, built from the first step, and eight tuned stacked models, built from the second step. Unlike the first two steps, here we applied the already tuned models to the outer-fold test set. We started by applying the 18 tuned prediction models from each of the sets of features to each observation in the outer-fold test set, resulting in 18 predicted values. We then applied the tuned stacked models to these predicted values from separate sets of features, resulting in eight predicted values.

      To demonstrate the predictive performance, we assessed the similarity between the observed values and the predicted values of each model across outer-fold test sets, using Pearson’s r, coefficient of determination (R2) and mean absolute error (MAE). Note that for R2, we used the sum of squares definition (i.e., R2 = 1 – (sum of squares residuals/total sum of squares)) per a previous recommendation (Poldrack et al., 2020). We considered the predicted values from the outer-fold test sets of models predicting age or fluid cognition, as Brain Age and Brain Cognition, respectively.”

      Author response image 1.

      Diagram of the nested cross-validation used for creating predictions for models of each set of features as well as predictions for stacked models.

      Note some previous research, including ours (Tetereva et al., 2022), splits the observations in the outer-fold training set into layer 1 and layer 2 and applies the first and second steps to layers 1 and 2, respectively. Here we decided against this approach and used the same outer-fold training set for both first and second steps in order to avoid potential bias toward the stacked models. This is because, when the data are split into two layers, predictive models built for each separate set of features only use the data from layer 1, while the stacked models use the data from both layers 1 and 2. In practice with large enough data, these two approaches might not differ much, as we demonstrated previously (Tetereva et al., 2022).

      (2) I also do not feel the authors have fully addressed the concern I raised about stability of the regression coefficients over splits of the data. I wanted to see the regression coefficients, not the predictions. The predictions can be stable when the coefficients are not.

      The focus of this article is on the predictions. Still, as pointed out by reviewer 1, it is informative for readers to understand how stable the feature importance (i.e., Elastic Net coefficients) is. To demonstrate the stability of feature importance, we now examined the rank stability of feature importance using Spearman’s ρ (see Figure 4). Specifically, we correlated the feature importance between two prediction models of the same features, used in two different outer-fold test sets. Given that there were five outer-fold test sets, we computed 10 Spearman’s ρ for each prediction model of the same features. We found Spearman’s ρ to be varied dramatically in both age-prediction (range=.31-.94) and fluid cognition-prediction (range=.16-.84) models. This means that some prediction models were much more stable in their feature importance than others. This is probably due to various factors such as a) the collinearity of features in the model, b) the number of features (e.g., 71,631 features in functional connectivity, which were further reduced to 75 PCAs, as compared to 19 features in subcortical volume based on the ASEG atlas), c) the penalisation of coefficients either with ‘Ridge’ or ‘Lasso’ methods, which resulted in reduction as a group of features or selection of a feature among correlated features, respectively, and d) the predictive performance of the models. Understanding the stability of feature importance is beyond the scope of the current article. As mentioned by Reviewer 1, “The predictions can be stable when the coefficients are not,” and we chose to focus on the prediction in the current article.

      Author response image 2.

      Stability of feature importance (i.e., Elastic Net Coefficients) of prediction models. Each dot represents rank stability (reflected by Spearman’s ρ) in the feature importance between two prediction models of the same features, used in two different outer-fold test sets. Given that there were five outer-fold test sets, there were 10 Spearman’s ρs for each prediction model. The numbers to the right of the plots indicate the mean of Spearman’s ρ for each prediction model.

      (3) I also must say that I agree with Reviewer 3 about the limitations of the brain-age and brain-cognition methods conceptually. In particular that the regression model used to predict fluid cognition will by construction explain more variance in cognition than a brain-age model that is trained to predict age. This suffers from the same problem the authors raise with brain-age and I agree that this would probably disappear if the authors had a separate measure of cognition against which to validate and were then to regress this out as they do for age correction. I am aware that these conceptual problems are more widespread than this paper alone (in fact throughout the brain-age literature), so I do not believe the authors should be penalised for that. However, I do think they can make these concerns more explicit and further tone down the comments they make about the utility of brain-cognition.

      Thank you so much for raising this point. Reviewer 2 (Public Review #1) and Reviewer 3 (Recommendations for the Authors #1) made a similar observation. We now made changes to the introduction and discussion to address this concern (see below).

      Briefly, we made it explicit that, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. That is, the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. More importantly, by examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age. And this is the third goal of this present study.

      From Introduction:

      “Third and finally, certain variation in fluid cognition is related to brain MRI, but to what extent does Brain Age not capture this variation? To estimate the variation in fluid cognition that is related to the brain MRI, we could build prediction models that directly predict fluid cognition (i.e., as opposed to chronological age) from brain MRI data. Previous studies found reasonable predictive performances of these cognition-prediction models, built from certain MRI modalities (Dubois et al., 2018; Pat et al., 2022; Rasero et al., 2021; Sripada et al., 2020; Tetereva et al., 2022; for review, see Vieira et al., 2022). Analogous to Brain Age, we called the predicted values from these cognition-prediction models, Brain Cognition. The strength of an out-of-sample relationship between Brain Cognition and fluid cognition reflects variation in fluid cognition that is related to the brain MRI and, therefore, indicates the upper limit of Brain Age’s capability in capturing fluid cognition. This is, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Consequently, if we included Brain Cognition, Brain Age and chronological age in the same model to explain fluid cognition, we would be able to examine the unique effects of Brain Cognition that explain fluid cognition beyond Brain Age and chronological age. These unique effects of Brain Cognition, in turn, would indicate the amount of co-variation between brain MRI and fluid cognition that is missed by Brain Age.”

      From Discussion:

      “Third, by introducing Brain Cognition, we showed the extent to which Brain Age indices were not able to capture the variation in fluid cognition that is related to brain MRI. More specifically, using Brain Cognition allowed us to gauge the variation in fluid cognition that is related to the brain MRI, and thereby, to estimate the upper limit of what Brain Age can do. Moreover, by examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age.

      From our results, Brain Cognition, especially from certain cognition-prediction models such as the stacked models, has relatively good predictive performance, consistent with previous studies (Dubois et al., 2018; Pat et al., 2022; Rasero et al., 2021; Sripada et al., 2020; Tetereva et al., 2022; for review, see Vieira et al., 2022). We then examined Brain Cognition using commonality analyses (Nimon et al., 2008) in multiple regression models having a Brain Age index, chronological age and Brain Cognition as regressors to explain fluid cognition. Similar to Brain Age indices, Brain Cognition exhibited large common effects with chronological age. But more importantly, unlike Brain Age indices, Brain Cognition showed large unique effects, up to around 11%. As explained above, the unique effects of Brain Cognition indicated the amount of co-variation between brain MRI and fluid cognition that was missed by a Brain Age index and chronological age. This missing amount was relatively high, considering that Brain Age and chronological age together explained around 32% of the total variation in fluid cognition. Accordingly, if a Brain Age index was used as a biomarker along with chronological age, we would have missed an opportunity to improve the performance of the model by around one-third of the variation explained.”

      Reviewer #3 (Recommendations For The Authors):

      Thank you to the authors for addressing so many of my concerns with this revision. There are a few points that I feel still need addressing/clarifying related to: 1) calculating brain cognition, 2) the inevitability of their results, and 3) their continued recommendation to use brain age metrics.

      (1) I understand your point here. I think the distinction is that it is fine to build predictive models, but then there is no need to go through this intermediate step of "brain-cognition". Just say that brain features can predict cognition XX well, and brain-age (or some related metric) can predict cognition YY well. It creates a confusing framework for the reader that can lead them to believe that "brain-cognition" is not just a predicted value of fluid cognition from a model using brain features to predict cognition. While you clearly state that that is in fact what it is in the text, which is a huge improvement, I do not see what is added by going through brain-cognition instead of simply just obtaining a change in R2 where the first model uses brain features alone to predict cognition, and the second adds on brain-age (or related metrics), or visa versa, depending on the question. Please do this analysis, and either compare and contrast it with going through "brain-cognition" in your paper, or switch to this analysis, as it more directly addresses the question of the incremental predictive utility of brain-age above and beyond brain features.

      Thank you so much for raising this point. Reviewer 1 (Public Review #2/Recommendations For The Authors #3) and Reviewer 2 (Public Review #1) made a similar observation. We now made changes to the introduction and discussion to address this concern (see our responses to Reviewer 1 Recommendations For The Authors #3 above).

      Briefly, as in our 2nd revision, we made it explicitly clear that we did not intend to compare Brain Age with Brain Cognition since, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. And, by examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age.

      We have thought about changing the name Brain Cognition into something along the lines of “predicted values of prediction models predicting fluid cognition based on brain MRI.” However, this made the manuscript hard to follow, especially with the commonality analyses. For instance, the sentence, “Here, we tested Brain Cognition’s unique effects in multiple regression models with a Brain Age index, chronological age and Brain Cognition as regressors to explain fluid cognition” would become “Here, we tested predicted values of prediction models predicting fluid cognition based on brain MRI unique effects in multiple regression models with a Brain Age index, chronological age and predicted values of prediction models predicting fluid cognition based on brain MRI as regressors to explain fluid cognition.” We believe, given our additional explanation (see our responses to Reviewer 1 Recommendations For The Authors #3 above), readers should understand what Brain Cognition is, and that we did not intend to compare Brain Age and Brain Cognition directly.

      As for the suggested analysis, “obtaining a change in R2 where the first model uses brain features alone to predict cognition, and the second adds on brain-age (or related metrics), or visa versa,” we have already done this in the form of commonality analysis (Nimon et al., 2008) (see Figure 7 below). That is, to obtain unique and common effects of the regressors, we need to look at all of the possible changes in R2 when all possible subsets of regressors were excluded or included, see equations 12 and 13 below.

      From Methods:

      “Similar to the above multiple regression model, we had chronological age, each Brain Age index and Brain Cognition as the regressors for fluid cognition:

      Fluid Cognitioni = β0 + β1 Chronological Agei + β2 Brain Age Indexi,j + β3 Brain Cognitioni + εi, (12)

      Applying the commonality analysis here allowed us, first, to investigate the addictive, unique effects of Brain Cognition, over and above chronological age and Brain Age indices. More importantly, the commonality analysis also enabled us to test the common, shared effects that Brain Cognition had with chronological age and Brain Age indices in explaining fluid cognition. We calculated the commonality analysis as follows (Nimon et al., 2017):

      Unique Effectchronological age = ΔR2chronological age = R2chronological age, Brain Age index, Brain Cognition – R2 Brain Age index, Brain Cognition

      Unique EffectBrain Age index = ΔR2Brain Age index = R2chronological age, Brain Age index, Brain Cognition – R2 chronological age, Brain Cognition

      Unique EffectBrain Cognition = ΔR2Brain Cognition = R2chronological age, Brain Age index, Brain Cognition – R2 chronological age, Brain Age Index

      Common Effectchronological age, Brain Age index = R2chronological age, Brain Cognition + R2 Brain Age index, Brain Cognition – R2 Brain Cognition – R2chronological age, Brain Age index, Brain Cognition

      Common Effectchronological age, Brain Cognition = R2chronological age, Brain Age Index + R2 Brain Age index, Brain Cognition – R2 Brain Age Index – R2chronological age, Brain Age index, Brain Cognition

      Common Effect Brain Age index, Brain Cognition = R2chronological age, Brain Age Index + R2 chronological age, Brain Cognition – R2 chronological age – R2chronological age, Brain Age index, Brain Cognition

      Common Effect chronological age, Brain Age index, Brain Cognition = R2 chronological age + R2 Brain Age Index + R2 Brain Cognition – R2chronological age, Brain Age Index – R2 chronological age, Brain Cognition – R2 Brain Age Index, Brain Cognition – R2chronological age, Brain Age index, Brain Cognition , (13)”

      (2) I agree that the solution is not to exclude age as a covariate, and that there is a big difference between inevitable and obvious. I simply think a further discussion of the inevitability of the results would be clarifying for the readers. There is a big opportunity in the brain-age literature to be as direct as possible about why you are finding what you are finding. People need to know not only what you found, but why you found what you found.

      Thank you. We agreed that we need to make this point more explicit and direct. In the revised manuscript, we had the statements in both Introduction and Discussion (see below) about the tight relationship between Brain Age and chronological age by design, making the small unique effects of Brain Age inevitable.

      Introduction:

      “Accordingly, by design, Brain Age is tightly close to chronological age. Because chronological age usually has a strong relationship with fluid cognition, to begin with, it is unclear how much Brain Age adds to what is already captured by chronological age.“

      Discussion:

      “First, Brain Age itself did not add much more information to help us capture fluid cognition than what we had already known from a person’s chronological age. This can clearly be seen from the small unique effects of Brain Age indices in the multiple regression models having Brain Age and chronological age as the regressors. While the unique effects of some Brain Age indices from certain age-prediction models were statistically significant, there were all relatively small. Without Brain Age indices, chronological age by itself already explained around 32% of the variation in fluid cognition. Including Brain Age indices only added around 1.6% at best. We believe the small unique effects of Brain Age were inevitable because, by design, Brain Age is tightly close to chronological age. Therefore, chronological age and Brain Age captured mostly a similar variation in fluid cognition.

      Investigating the simple regression models and the commonality analysis between each Brain Age index and chronological age provided additional insights….”

      (3) I believe it is very important to critically examine the use of brain-age and related metrics. As part of this process, I think we should be asking ourselves the following questions (among others): Why go through age prediction? Wouldn't the predictions of cognition (or another variable) using the same set of brain features always be as good or better? You still have not justified the use of brain-age. As I said before, if you are going to continue to recommend the use of brain-age, you need a very strong argument for why you are recommending this. What does it truly add? Otherwise, temper your statements to indicate possible better paths forward.

      Thank you Reviewer 3 for making an argument against the use of Brain Age. We largely agree with you. However, our work only focuses on one phenotype, fluid cognition, and on the normative situation (i.e., not having a case vs control group). As Reviewer 2 pointed out, Brain Age might still have utility in other cases, not studied here. Still, future studies that focus on other phenotypes may consider using our approach as a template to test the utility of Brain Age in other situations. We added the conclusion statement to reflect this.

      From Discussion:

      “Altogether, we examined the utility of Brain Age as a biomarker for fluid cognition. Here are the three conclusions. First, Brain Age failed to add substantially more information over and above chronological age. Second, a higher ability to predict chronological age did not correspond to a higher utility to capture fluid cognition. Third, Brain Age missed up to around one-third of the variation in fluid cognition that could have been explained by brain MRI. Yet, given our focus on fluid cognition, future empirical research is needed to test the utility of Brain Age on other phenotypes, especially when Brain Age is used for anomaly detection in case-control studies (e.g., Bashyam et al., 2020; Rokicki et al., 2021). We hope that future studies may consider applying our approach (i.e., using the commonality analysis that includes predicted values from a model that directly predicts the phenotype of interest) to test the utility of Brain Age as a biomarker for other phenotypes.”

      References

      Bashyam, V. M., Erus, G., Doshi, J., Habes, M., Nasrallah, I. M., Truelove-Hill, M., Srinivasan, D., Mamourian, L., Pomponio, R., Fan, Y., Launer, L. J., Masters, C. L., Maruff, P., Zhuo, C., Völzke, H., Johnson, S. C., Fripp, J., Koutsouleris, N., Satterthwaite, T. D., … on behalf of the ISTAGING Consortium, the P. A. disease C., ADNI, and CARDIA studies. (2020). MRI signatures of brain age and disease over the lifespan based on a deep brain network and 14 468 individuals worldwide. Brain, 143(7), 2312–2324. https://doi.org/10.1093/brain/awaa160

      Butler, E. R., Chen, A., Ramadan, R., Le, T. T., Ruparel, K., Moore, T. M., Satterthwaite, T. D., Zhang, F., Shou, H., Gur, R. C., Nichols, T. E., & Shinohara, R. T. (2021). Pitfalls in brain age analyses. Human Brain Mapping, 42(13), 4092–4101. https://doi.org/10.1002/hbm.25533

      Cole, J. H. (2020). Multimodality neuroimaging brain-age in UK biobank: Relationship to biomedical, lifestyle, and cognitive factors. Neurobiology of Aging, 92, 34–42. https://doi.org/10.1016/j.neurobiolaging.2020.03.014

      Dubois, J., Galdi, P., Paul, L. K., & Adolphs, R. (2018). A distributed brain network predicts general intelligence from resting-state human neuroimaging data. Philosophical Transactions of the Royal Society B: Biological Sciences, 373(1756), 20170284. https://doi.org/10.1098/rstb.2017.0284

      Hahn, T., Fisch, L., Ernsting, J., Winter, N. R., Leenings, R., Sarink, K., Emden, D., Kircher, T., Berger, K., & Dannlowski, U. (2021). From ‘loose fitting’ to high-performance, uncertainty-aware brain-age modelling. Brain, 144(3), e31–e31. https://doi.org/10.1093/brain/awaa454

      Insel, T., Cuthbert, B., Garvey, M., Heinssen, R., Pine, D. S., Quinn, K., Sanislow, C., & Wang, P. (2010). Research Domain Criteria (RDoC): Toward a New Classification Framework for Research on Mental Disorders. American Journal of Psychiatry, 167(7), 748–751. https://doi.org/10.1176/appi.ajp.2010.09091379

      Jirsaraie, R. J., Kaufmann, T., Bashyam, V., Erus, G., Luby, J. L., Westlye, L. T., Davatzikos, C., Barch, D. M., & Sotiras, A. (2023). Benchmarking the generalizability of brain age models: Challenges posed by scanner variance and prediction bias. Human Brain Mapping, 44(3), 1118–1128. https://doi.org/10.1002/hbm.26144

      Marquand, A. F., Rezek, I., Buitelaar, J., & Beckmann, C. F. (2016). Understanding Heterogeneity in Clinical Cohorts Using Normative Models: Beyond Case-Control Studies. Biological Psychiatry, 80(7), 552–561. https://doi.org/10.1016/j.biopsych.2015.12.023

      Nimon, K., Lewis, M., Kane, R., & Haynes, R. M. (2008). An R package to compute commonality coefficients in the multiple regression case: An introduction to the package and a practical example. Behavior Research Methods, 40(2), 457–466. https://doi.org/10.3758/BRM.40.2.457

      Pat, N., Wang, Y., Anney, R., Riglin, L., Thapar, A., & Stringaris, A. (2022). Longitudinally stable, brain‐based predictive models mediate the relationships between childhood cognition and socio‐demographic, psychological and genetic factors. Human Brain Mapping, hbm.26027. https://doi.org/10.1002/hbm.26027

      Poldrack, R. A., Huckins, G., & Varoquaux, G. (2020). Establishment of Best Practices for Evidence for Prediction: A Review. JAMA Psychiatry, 77(5), 534–540. https://doi.org/10.1001/jamapsychiatry.2019.3671

      Rasero, J., Sentis, A. I., Yeh, F.-C., & Verstynen, T. (2021). Integrating across neuroimaging modalities boosts prediction accuracy of cognitive ability. PLOS Computational Biology, 17(3), e1008347. https://doi.org/10.1371/journal.pcbi.1008347

      Rokicki, J., Wolfers, T., Nordhøy, W., Tesli, N., Quintana, D. S., Alnæs, D., Richard, G., de Lange, A.-M. G., Lund, M. J., Norbom, L., Agartz, I., Melle, I., Nærland, T., Selbæk, G., Persson, K., Nordvik, J. E., Schwarz, E., Andreassen, O. A., Kaufmann, T., & Westlye, L. T. (2021). Multimodal imaging improves brain age prediction and reveals distinct abnormalities in patients with psychiatric and neurological disorders. Human Brain Mapping, 42(6), 1714–1726. https://doi.org/10.1002/hbm.25323

      Sripada, C., Angstadt, M., Rutherford, S., Taxali, A., & Shedden, K. (2020). Toward a “treadmill test” for cognition: Improved prediction of general cognitive ability from the task activated brain. Human Brain Mapping, 41(12), 3186–3197. https://doi.org/10.1002/hbm.25007

      Tetereva, A., Li, J., Deng, J. D., Stringaris, A., & Pat, N. (2022). Capturing brain‐cognition relationship: Integrating task‐based fMRI across tasks markedly boosts prediction and test‐retest reliability. NeuroImage, 263, 119588. https://doi.org/10.1016/j.neuroimage.2022.119588

      Vieira, B. H., Pamplona, G. S. P., Fachinello, K., Silva, A. K., Foss, M. P., & Salmon, C. E. G. (2022). On the prediction of human intelligence from neuroimaging: A systematic review of methods and reporting. Intelligence, 93, 101654. https://doi.org/10.1016/j.intell.2022.101654

    1. Author response:

      The following is the authors’ response to the previous reviews.

      eLife assessment

      This study presents valuable data on the antigenic properties of neuraminidase proteins of human A/H3N2 influenza viruses sampled between 2009 and 2017. The antigenic properties are found to be generally concordant with genetic groups. Additional analysis have strengthened the revised manuscript, and the evidence supporting the claims is solid.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary

      The authors investigated the antigenic diversity of recent (2009-2017) A/H3N2 influenza neuraminidases (NAs), the second major antigenic protein after haemagglutinin. They used 27 viruses and 43 ferret sera and performed NA inhibition. This work was supported by a subset of mouse sera. Clustering analysis determined 4 antigenic clusters, mostly in concordance with the genetic groupings. Association analysis was used to estimate important amino acid positions, which were shown to be more likely close to the catalytic site. Antigenic distances were calculated and a random forest model used to determine potential important sites.

      This revision has addressed many of my concerns of inconsistencies in the methods, results and presentation. There are still some remaining weaknesses in the computational work.

      Strengths

      (1) The data cover recent NA evolution and a substantial number (43) of ferret (and mouse) sera were generated and titrated against 27 viruses. This is laborious experimental work and is the largest publicly available neuraminidase inhibition dataset that I am aware of. As such, it will prove a useful resource for the influenza community.

      (2) A variety of computational methods were used to analyse the data, which give a rounded picture of the antigenic and genetic relationships and link between sequence, structure and phenotype.

      (3) Issues raised in the previous review have been thoroughly addressed.

      Weaknesses

      (1). Some inconsistencies and missing data in experimental methods Two ferret sera were boosted with H1N2, while recombinant NA protein for the others. This, and the underlying reason, are clearly explained in the manuscript. The authors note that boosting with live virus did not increase titres. Additionally, one homologous serum (A/Kansas/14/2017) was not generated, although this would not necessarily have impacted the results.

      We agree with the reviewer and this point was addressed in the previous rebuttal.

      (2) Inconsistency in experimental results

      Clustering of the NA inhibition results identifies three viruses which do not cluster with their phylogenetic group. Again this is clearly pointed out in the paper and is consistent with the two replicate ferret sera. Additionally, A/Kansas/14/2017 is in a different cluster based on the antigenic cartography vs the clustering of the titres

      We agree with the reviewer and this point was addressed in the previous rebuttal.

      (3) Antigenic cartography plot would benefit from documentation of the parameters and supporting analyses

      a. The number of optimisations used

      We used 500 optimizations. This information is now included in the Methods section.

      b. The final stress and the difference between the stress of the lowest few (e.g. 5) optimisations, or alternatively a graph of the stress of all the optimisations. Information on the stress per titre and per point, and whether any of these were outliers

      The stress was obtained from 1, 5, 500, or even 5000 optimizations (resulting in stress values of respectively, 1366.47, 1366.47, 2908.60, and 3031.41). Besides limited variation or non-conversion of the stress values after optimization, the obtained maps were consistent in multiple runs. The map was obtained keeping the best optimization (stress value 1366.47, selected using the keepBestOptimization() function).

      Author response image 1.

      The stress per point is presented in the heat map below.

      The heat map indicates stress per serum (x-axis) and strain (y-axis) in blue to red scale.

      c. A measure of uncertainty in position (e.g. from bootstrapping)

      Bootstrap was performed using 1000 repeats and 100 optimizations per repeat. The uncertainty is represented in the blob plot below.

      Author response image 2.

      (4) Random forest

      The full dataset was used for the random forest model, including tuning the hyperparameters. It is more robust to have a training and test set to be able to evaluate overfitting (there are 25 features to classify 43 sera).

      Explicit cross validation is not necessary for random forests as the out of bag process with multiple trees implicitly covers cross validation. In the random forest function in R this is done by setting the mtry argument (number of variables randomly sampled as candidates at each split). R samples variables with replacement (the same variable can be sampled multiple times) of the candidates from the training set. RF will then automatically take the data that is not selected as candidates as test set. Overfit may happen when all data is used for training but the RF method implicitly does use a test set and does not use all data for training.

      Code:

      rf <- randomForest(X,y=Y,ntree=1500,mtry=25,keep.forest=TRUE,importance=TRUE)

      Reviewer #2 (Public Review):

      Summary:

      The authors characterized the antigenicity of N2 protein of 43 selected A(H3N2) influenza A viruses isolated from 2009-2017 using ferret and mice immune sera. Four antigenic groups were identified, which the authors claimed to be correlated with their respective phylogenic/ genetic groups. Among 102 amino acids differed by the 44 selected N2 proteins, the authors identified residues that differentiate the antigenicity of the four groups and constructed a machine-learning model that provides antigenic distance estimation. Three recent A(H3N2) vaccine strains were tested in the model but there was no experimental data to confirm the model prediction results.

      Strengths:

      This study used N2 protein of 44 selected A(H3N2) influenza A viruses isolated from 2009-2017 and generated corresponding panels of ferret and mouse sera to react with the selected strains. The amount of experimental data for N2 antigenicity characterization is large enough for model building.

      Weaknesses:

      The main weakness is that the strategy of selecting 43 A(H3N2) viruses from 2009-2017 was not explained. It is not clear if they represent the overall genetic diversity of human A(H3N2) viruses circulating during this time. In response to the reviewer's comment, the authors have provided a N2 phylogenetic tree using180 randomly selected N2 sequences from human A(H3N2) viruses from 2009-2017. While the 43 strains seems to scatter across the N2 tree, the four antigenic groups described by the author did not correlated with their respective phylogenic/ genetic groups as shown in Fig. 2. The authors should show the N2 phylogenic tree together with Fig. 2 and discuss the discrepancy observed.

      The discrepancies between the provided N2 phylogenetic tree using 180 selected N2 sequences was primarily due to visualization. In the tree presented in Figure 2 the phylogeny was ordered according to branch length in a decreasing way. Further, the tree represented in the rebuttal was built with PhyML 3.0 using JTT substitution model, while the tree in figure 2 was build in CLC Workbench 21.0.5 using Bishop-Friday substitution model. The tree below was built using the same methodology as Figure 2, including branch size ordering. No discrepancies are observed.

      Phylogenetic tree representing relatedness of N2 head domain. N2 NA sequences were ordered according to the branch length and phylogenetic clusters are colored as follows: G1: orange, G2: green, G3: blue, and G4: purple. NA sequences that were retained in the breadth panel are named according to the corresponding H3N2 influenza viruses. The other NA sequences are coded.

      Author response image 3.

      The second weakness is the use of double-immune ferret sera (post-infection plus immunization with recombinant NA protein) or mouse sera (immunized twice with recombinant NA protein) to characterize the antigenicity of the selected A(H3N2) viruses. Conventionally, NA antigenicity is characterized using ferret sera after a single infection. Repeated influenza exposure in ferrets has been shown to enhance antibody binding affinity and may affect the cross-reactivity to heterologous strains (PMID: 29672713). The increased cross-reactivity is supported by the NAI titers shown in Table S3, as many of the double immune ferret sera showed the highest reactivity not against its own homologous virus but to heterologous strains. In response to the reviewer's comment, the authors agreed the use of double-immune ferret sera may be a limitation of the study. It would be helpful if the authors can discuss the potential effect on the use of double-immune ferret sera in antigenicity characterization in the manuscript.

      Our study was designed to understand the breadth of the anti-NA response after the incorporation of NA as a vaccine antigens. Our data does not allow to conclude whether increased breadth of protection is merely due to increased antibody titers or whether an NA boost immunization was able to induce antibody responses against epitopes that were not previously recognized by primary response to infection. However, we now mention this possibility in the discussion and cite Kosikova et al. CID 2018, in this context.

      Another weakness is that the authors used the newly constructed a model to predict antigenic distance of three recent A(H3N2) viruses but there is no experimental data to validate their prediction (eg. if these viruses are indeed antigenically deviating from group 2 strains as concluded by the authors). In response to the comment, the authors have taken two strains out of the dataset and use them for validation. The results is shown as Fig. R7. However, it may be useful to include this in the main manuscript to support the validity of the model.

      The removal of 2 strains was performed to illustrate the predictive performance of the RF modeling. However, Random Forest does not require cross-validation. The reason is that RF modeling already uses an out-of-bag evaluation which, in short, consists of using only a fraction of the data for the creation of the decision trees (2/3 of the data), obviating the need for a set aside the test set:

      “…In each bootstrap training set, about one-third of the instances are left out. Therefore, the out-of-bag estimates are based on combining only about one- third as many classifiers as in the ongoing main combination. Since the error rate decreases as the number of combinations increases, the out-of-bag estimates will tend to overestimate the current error rate. To get unbiased out-of-bag estimates, it is necessary to run past the point where the test set error converges. But unlike cross-validation, where bias is present but its extent unknown, the out-of-bag estimates are unbiased…” from https://www.stat.berkeley.edu/%7Ebreiman/randomforest2001.pdf

      Reviewer #3 (Public Review):

      Summary:

      This paper by Portela Catani et al examines the antigenic relationships (measured using monotypic ferret and mouse sera) across a panel of N2 genes from the past 14 years, along with the underlying sequence differences and phylogenetic relationships. This is a highly significant topic given the recent increased appreciation of the importance of NA as a vaccine target, and the relative lack of information about NA antigenic evolution compared with what is known about HA. Thus, these data will be of interest to those studying the antigenic evolution of influenza viruses. The methods used are generally quite sound, though there are a few addressable concerns that limit the confidence with which conclusions can be drawn from the data/analyses.

      Strengths:

      • The significance of the work, and the (general) soundness of the methods. -Explicit comparison of results obtained with mouse and ferret sera

      Weaknesses:

      • Approach for assessing influence of individual polymorphisms on antigenicity does not account for potential effects of epistasis (this point is acknowledged by the authors).

      We agree with the reviewer and this point was addressed in the previous rebuttal.

      • Machine learning analyses neither experimentally validated nor shown to be better than simple, phylogenetic-based inference.

      We respectfully disagree with the reviewer. This point was addressed in the previous rebuttal as follows.

      This is a valid remark and indeed we have found a clear correlation between NAI cross reactivity and phylogenetic relatedness. However, besides achieving good prediction of the experimental data (as shown in Figure 5 and in FigureR7), machine Learning analysis has the potential to rank or indicate major antigenic divergences based on available sequences before it has consolidated as new clade. ML can also support the selection and design of broader reactive antigens. “

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      (1) Discuss the discrepancy between Fig. 2 and the newly constructed N2 phylogenetic tree with 180 randomly selected N2 sequences of A(H3N2) viruses from 2009-2017. Specifically please explain the antigenic vs. phylogenetic relationship observed in Fig. 2 was not observed in the large N2 phylogenetic tree.

      Discrepancies were due to different method and visualization. A new tree was provided.

      (2) Include a sentence to discuss the potential effect on the use of double-immune ferret sera in antigenic characterization.

      We prefer not to speculate on this.

      (3) Include the results of the exercise run (with the use of Swe17 and HK17) in the manuscript as a way to validate the model.

      The exercise was performed to illustrate predictive potential of the RF modeling to the reviewer. However, cross-validation is not a usual requirement for random forest, since it uses out-of-bag calculations. We prefer to not include the exercise runs within the main manuscript.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations for The Authors):

      To hopefully contribute to more strongly support the conclusions of the manuscript, I am including a series of concerns regarding the experiments, as well as some recommendations that could be followed to address these issues:

      (1) The Q-nMT bundle is largely unaffected by the nocodazole treatment in most phases during its formation. However, cells were only treated with nocodazole for a very short period of time (15 min). Have the authors analyzed Q-nMT stability after longer nocodazole exposures? Is a similar treatment enough to depolymerize the mitotic spindle? This result could be further substantiated by treatment with other MT-depolymerizing agents. Furthermore, the dynamicity of the Q-nMT bundle could be ideally also assessed by other techniques, such as FRAP.

      The experiments suggested by the reviewer have been published in our previous paper (Laporte et al, JCB 2013). In this previous study, we presented data demonstrating the resistance of the Q-nMT bundle to several MT poisons: TBZ, benomyl, MBC (Sup Fig 2D) and to an increasing amount of nocodazole after a 90 min treatment (Sup Fig2E). These published figures are provided below.

      Author response image 1.

      The nMT array contains highly stable MTS. (A) Variation Of nuclear MT length in function Of time (second) in proliferating cells. Cells express GFP•Tubl (green) and Nup2•RFP (red). Bars, 2 pm. N = l, n is indicated. (B) Variation of the nMT array length in function of time measured for BirnlGFP—expressing cells In = 161, for 6-d•old Dad2GFP—expressing cells In = 171, for Stu2GFP—expressing cells (n = 17), and 6•d-old Nuf2• GFP—expressing cells (n = 17). Examples Of corresponding time lapse are shown. Time is in minutes experiments). Bar, 2 pm. (CJ Nuf2•GFP dots detected along nMT array (arrow) are immobile. Several time lapse images of cells are shown. Time is in minutes. gar, 2 pm _ MT organizations in proliferating cells and 4-d•old quiescent cells before and after a 90-min treatment With indicated drugs. Bar, 2 pm. (E) MT organizations in Sci-old quiescent cells before and after a 90min treatment With increasing concentrations Of nocodazole.

      In the same article, we showed that Q-nMT bundles resist a 3h nocodazole treatment, while all MT structures assembled in proliferating cells, including mitotic spindle, vanished (see Fig 2E below). In addition, in our previous article, FRAP experiments were provided in Fig 2D.

      Author response image 2.

      The nuclear array is composed of stable MTS. Variation of the length in function of time of (A) aMTs in proliferating cells, (B) nMT array in quiescent cells (7 d), and the two MT structures in early quiescent cells (4 d). White arrows point ot dynamic aMTs. In A—C, N = 2, n is indicated ID) FRAP on 7-d-old quiescent cells. White arrows point to bleach areas. Error bars are SEM. In A—D. time is in seconds. (E) nMT array is not affected by nocodazole treatment. Before and various times after carbon exhaustion (red dashed line), cells were incubated for 3 h with 22.5 pg/pL nocodozole and then imaged. The corresponding control experiment is shown in Fig I A. In all panels, cells expressing GFP-TtJbl (green) and Nup2-RFP (red) are shown; bars, 2 pm.

      This previous study was mentioned in the introduction and is now re-cited at the beginning of the results section (line 107-108).

      As expected from our previous study, when proliferating cells were treated with Noc (30 µg/ml) in the same conditions as in Fig1, most of the short and the long mitotic spindles vanished after a 15 min treatment as shown in the graph below.

      Author response image 3.

      Proliferating cells expressing NOf2=GFP and mTQZ-TUb1 (00—2) were treated or not With NOC (30vgfmI) for 15 min.% Of cells With detectable MT and representative cells are shown. Khi-teet values are indicated. Bar: 2 pm,

      (2) The graph in Figure 1B is somewhat confusing. Is the X-axis really displaying the length of the MTs as stated in the legend? If so, one would expect to see a displacement of the average MT length of the population as cells progress from phase II to phase III, as previously demonstrated in Figure 1A. Likewise, no data points would be anticipated for those phases in which the MT length is 0 or close to 0. Moreover, when the length of half pre-anaphase mitotic spindle was measured as a control, how can one get MT lengths that are equal or close to 0 in these cells? The length of the pre-anaphase spindle is between 2-4 um, so MT length values should range from 1 to 2 um if half the spindle is measured.

      The graph in Fig1B represents the fluorescence intensity (a proxy for the Q-nMT bundle thickness) along the Q-nMT bundle length.

      Fluorescence intensity is measured along a “virtual line” that starts 0,5 µm before the extremity of the QnMT bundle that is in contact with the SPB. In other words, we aligned all intensity measurements at the fluorescence increasing onset on the SPB side. We arbitrarily set the ‘zero’ at 0,5um before the fluorescence increased onset. That is why the fluorescence intensity is zero between 0 and 0,5 µm – The X-axis represents this virtual line, the 0 being set 0,5 µm before the Q-nMT bundle extremity on the SPB side. This virtual line allows us to standardize our “thickness” measurements for all Q-nMT bundles.

      Using this standardization, it is clear that the length of the Q-nMT bundles increased from phase II to III (see the red arrow). Yet, as in phase II, Q-nMT bundles are not yet stable, their lengths are shorter in phase II than in phase II after a Noc treatment (compare the end of the orange line and the end of the blue line in phase II).

      Author response image 4.

      This is now explained in details in the Material and Methods section (line 539-545).

      This is the same for the inset of Fig 1B and in Sup Fig 1A, in which we measured fluorescence intensity along the halfmitotic spindle just as we did for MT bundle. The X-axis represent a virtual line along the mitotic spindle, starting 0,5 µm before the SBP spindle extremity.

      Author response image 5.

      (3) Microtubules seem to locate next to or to extend beyond the nucleus in the control cells (DMSO) in Figure 1H. Since both nuclear MTs and cytoplasmic MTs emanate from the SPBs, it would have been desirable to display the morphology of the nucleus when possible. Moreover, since the nucleus is a tridimensional structure, it would also be advisable to image different Z-sections.

      Analysis demonstrating that Q-nMT bundles are located inside the nucleus have been provided in our previous paper (Laporte et al, JCB 2013). In this article most of the images are maximal projections of Z-stacks in which the nuclear envelope is visualized via Nup2-RFP (see Fig1 of Laporte et al, JCB 2013 as an example below).

      Author response image 6.

      MTsare organized as a nuclear array in quiescent cells. (A) MT reorganization upon quiescence entry. Cells expressing GFP-Tub1 (green) and Nup2RFP (red) are shown. Glucose exhaustion is indicated as a red dashed line. Quiescent cells dl expressing Tub I-RFP and either Spc72GFP,

      In Laporte et al, JCB 2013, we also provided EM analysis both in cryo and immune-gold (Fig 1E below).

      Author response image 7.

      (top) or coexpr;sse8 with Tub I-RFP (bottom). Arrows point dot along the nMT array. Bars: (A—C)) 2 pm. (E) AMT arroy visualized in WT cells by EMI Yellow arrows, MTS; red arrowheads, nuclear membrane; pink arrow, SPB. Insets: nMT cut transversally. Bar, 100 nm.

      (4) Movies depicting the process of Q-nMT bundle formation in live cells would have been really informative to more precisely evaluate the MT dynamics. Likewise, together with still images (Fig 1D and Supp. Fig. 1D), movies depicting the changes in the localization of Nuf2-GFP would have further facilitated the analysis of this process.

      In a new Sup Fig 1E, we now provide images of Q-nMT bundle formation initiation in phase I, in which it can be observed that Nuf2-GFP accompanies the growth of MT (mTQZ-TUB1) at the onset of Q-nMT bundle formation. Unfortunately, it is technically very challenging to follow the entire process of Q-nMT bundle formation in individual cells, as it takes > 48h. Indeed, for movies longer than 24h, on both microscope pads or specific microfluidic devices (Jacquel, et al, eLife 2021), phototoxicity and oxygen availability become problematic and affect cells’ viability.

      (5) Western blot images displaying the relative protein levels for mTQZ-Tub1 and of the ADH2 promoter-driven mRuby-Tub1 at the different time points should be included to more strongly support the conclusion that new tubulin molecules are introduced in the Q-nMT bundle only after phase I. It is worth noting, in this sense, that the percentage of cells with 2 colors Q-nMT bundle is analyzed only 1 hour after expression of mRuby-Tub1 was induced for phase I cells, but after 24 hours for phase II cells.<br /> We have modified Fig 1F and now provide images of cells after 3, 6 and 24h after glucose exhaustion and the corresponding percentage of cells displaying Q-nMT bundle with the two colors. We also now provide a western blot in Sup Fig 1H using specific antibodies against mTQZ (anti-GFP) and mRuby (anti-RFP).

      (6) In order to demonstrate that Q-nMT formation is an active process induced by a transient signal and that the Q-nMT bundle is required for cell survival, the authors treated cells with nocodazole for 24 h (Fig 1H and Supp Fig 1K). Both events, however, could be associated with the toxic effects of the extremely prolonged nocodazole treatment leading to cell death.

      We have treated 5 days old cells for 24h with 30 µg/ml Noc. We then washed the drug and transferred the cells into a glucose free medium. We then followed both cell survival, using methylene blue, and the cell’s capacity to form a colony after refeeding. In these conditions, we did not observe any toxic effect of the nocodazole. This result is now provided in Sup Fig 1L and discussed line 172-176.

      (7) The "Tub1-only" mutant displays shorter but stable Q-nMT bundles in phase II, although they are thinner than in wild-type cells. What happens in the "Tub3-only" mutant, which also has beta-tubulin levels similar to wild-type cells (Supp. Fig. 2B)?

      In order to measure Q-nMT bundle length and thickness, we used Tub1 fused to GFP. This cannot be done in a Tub3-only mutant. Yet, we have measured Q-nMT bundle length in Tub3-only cells using Bim1-3GFP as a MT marker (as in Laporte et al, JCB 2013). As shown in the figure below, Q-nMT bundles were shorter in Tub3-only cells than in WT cells whatever the phase.

      Author response image 8.

      We do not know if this effect is directly linked to the absence of Tub1 or if it is very indirect and for example due to the fact that Tub1 and Tub3 interact differently with Bim1 or other proteins that are involved in Q-nMT bundle stabilization. As we cannot give a clear interpretation for that result, we decided not to present those data in our manuscript.

      (8) Why were wild-type and ndc80-1 cells imaged after a 20 min nocodazole treatment to evaluate the role of KT-MT attachments in Q-nMT bundle formation (Fig 3A)? Importantly, this experiment is also missing a control in which Q-nMT length is analyzed in both wild-type and ndc80-1 cells at 25ºC instead of 37ºC.

      In this experiment, we used nocodazole to test both the formation and the stability of the Q-nMT bundle. Fig 3A shows MT length distribution in WT (grey) and ndc80-1 (violet) cells expressing mTQZTub1 (green) and Nuf2-GFP (red), shifted to 37 °C at the onset of glucose exhaustion and kept at this non-permissive temperature for 12 or 96 h then treated with Noc. The control experiment was provided in Sup Fig 3B. Indeed, this figure shows MT length in WT (grey) and ndc80-1 (violet) expressing mTQZ-Tub1 (green) and Nuf2-GFP (red) grown for 4 d (96h) at 25 °C, and treated or not with Noc. This is now indicated in the text line 216 and in the figure legend line 976

      Author response image 9.

      (9) As a general comment linked to the previous concern, it is striking that in many instances, Q-nMT bundle length is measured after nocodazole treatment without any evident reason to do this and without displaying the results in untreated cells as a control. If nocodazole is used, the authors should explicitly indicate it and state the reason for it.

      We provide control experiments without nocodazole for all of the figures. For the sake of figure clarity, for Fig.3A the control without the drug is in Sup. Fig. 3B, for Fig. 3B it is shown in Sup. Fig. 3D, for Fig. 4B, it is shown in Sup. Fig 4A. This is now stated in the text and in the figure legend: for Fig. 3A: line 216 and in the figure legend line 976; for Fig. 3B: line 222 and figure legend line 984; for Fig. 4B: line 280 and in the figure legend line 1017.

      The only figures where the untreated cells are not shown is for Fig 1D since the goal of the experiment is to make dynamic MTs shorten.

      In Fig. 5C and Sup. Fig. 5D to F, we used nocodazole to get rid of dynamic cytoplasmic MTs that form upon quiescence exit in order to facilitate Q-nMT bundle measurement. This was explained in our previous study (Laporte et al, JCB 2013). We now mention it in the figure legends, see for example Fig. 5 legend line 1054.

      (10) Ipl1 inactivation using the ipl1-1 thermosensitive allele impedes Q-nMT bundle formation. The inhibitor-sensitive ipl1-as1 allele could have been further used to show whether this depends on its kinase activity, also avoiding the need to increase the temperature, which affects MT dynamics. As suggested, we have used the ipl1-5as allele. We have thus modified Fig 3B and now show that is it indeed the Ipl1 kinase activity that is required for Q-nMT bundle formation initiation (line 222). In any case, it is surprising that deletion of SLI15 does not affect Q-nMT formation (in fact, MT length is even larger), despite the fact that Sli15, which localizes and activates Ipl1, is present at the Q-nMT (Fig 3C). Likewise, deletion of BIR1 has barely any effect on MT length after 4 days in quiescence (Fig 3D). Do the previous observations mean that Ipl1 role is CPC-independent? Does the lack of Sli15 or Bir1 aggravate the defect in Q-nMT formation of ipl1-1 cells at non-permissive or semi-permissive temperature?

      Thanks to the Reviewer’s comments, we have re-checked our sli15Δ strain and found that it was accumulating suppressors very rapidly. To circumvent this problem, we utilized the previously described sli15-3 strain (Kim et al, JCB 1999). We found that sli15-3 was synthetic lethal with both ipl1-1, ipl1-2 (as described in Kim et al, JCB 1999) and with ipl1-as5, preventing us from addressing the CPC dependence of the Ipl1 effect asked by the Reviewer. However, using the sli15-3 strain, we now show that inactivation of Sli15 upon glucose exhaustion does prevent Q-nMT bundle formation (See new Sup Fig 3F and the text line 226-227).

      (11) Lack of both Bir1 and Bim1 act in a synergistic way with regard to the defect in Q-nMT bundle formation. Although the absence of both Sli15 and Bim1 is proposed to lead to a similar defect, this is not sustained by the data provided, particularly in the absence of nocodazole treatment (Supp. Fig 3E).

      Deletion of bir1 alone has only a subtle effect on Q-nMT bundle length in the absence of Noc, yet in bir1Δ cells, Q-nMT bundles are sensitive to Noc. Deletion of BIM1 (bim1Δ) aggravates this phenotype (Fig. 3D). As mentioned above, Q-nMT bundle formation is impaired in sli15-3 cells. In our hands, and as expected from (Zimnaik et al, Cur Biol 2012), this allele is synthetic lethal with bim1Δ.

      On the other hand, the simultaneous lack of Bir1 and Bim1 drastically reduces the viability of cells in quiescence and this is proposed to be evidence supporting that KT-MT attachments are critical for QnMT bundle assembly (Supp Fig 3G). However, similarly to what was indicated previously for the 24 h nocodazole treatment, here again, the lack of viability could be originated by other reasons that are associated with the lack of Bir1 and Bim1 and not necessarily with problems in Q-nMT formation. In fact, the viability defect of cells lacking Bir1 and Bim1 is similar to that of cells only lacking Bir1 (Supp Fig 3G).

      We have previously shown that many mutants impaired for Q-nMT bundle formation (dyn1Δ, nip100Δ etc) have a reduced viability in quiescence (Laporte et al, JCB 2013). In the current study, a very strong phenotype is observed for other mutants impaired for Q-nMT bundle formation such as bim1Δ bir1Δ cells, but also for slk19Δ bim1Δ.

      Importantly, as shown in the new Sup Fig 1L, in WT cells treated with Noc upon entry into quiescence, a treatment that prevents Q-nMT formation, showed a reduced viability, while a Noc treatment that does not affect Q-nMT bundle formation, i.e. a treatment in late quiescence, has no effect on cell survival. This solid set of data point to a clear correlation between the ability of cells to assemble a Q-nMT bundle and their ability to survive in quiescence. Yet, of course, we cannot formally exclude that in all these mutants, the reduction of cell viability in quiescence is due to another reason.

      (12) Both Mam1 and Spo13 are, to my knowledge, meiosis-specific proteins. It is therefore surprising that mutants in these proteins have an effect on MT bundle formation (Fig 3G-H, Supp. Fig. 3G). Are Mam1 and Spo13 also expressed during quiescence? Transcription of MAM1 or SPO13 does not seem to be induced by glucose depletion in previously published microarray experiments, but if Mam1 are Spo13 are expressed in quiescent cells, the authors should show this together with their results.<br /> Indeed, it is interesting to notice that Mam1 and Spo13 are involved in both meiosis and Q-nMT bundle formation. As suggested by the Reviewer we have performed western blots in order to address the expression of those proteins in proliferation and quiescence (4d). We tagged Spo13 with either GFP, HA or Myc but none of the fusion proteins were functional. Yet, as shown in the new Sup Fig 3I, Mam1-GFP, Csm1-GFP and Lsr4-GFP were expressed both in proliferation and quiescence.

      (13) In the laser ablation experiments that demonstrate that KT-MT attachments are not needed in order to maintain Q-nMT bundles once formed, anaphase spindles of proliferating cells were cut as a control (Supp. Fig 3I). However, late anaphase cells have already segregated the chromosomes, which lie next to the SPBs (this can be evidenced by looking at Dad2-GFP localization in Supp. Fig 3I), so that only interpolar MTs are severed in these experiments. The authors should have instead used metaphase cells as a control, since chromosomes are maintained at the spindle midzone and the length and width of the metaphase spindle is more similar to that of the Q-nMT bundle.

      We have tried to “cut” short metaphase spindles, but as they are < 1 µm, after the laser pulse, it is difficult to verify that spindles are indeed cut and not solely “bleached”. Furthermore, after the cut, the remaining MT structure that is detectable is very short, and we are not confident in our length measurements. Yet, this type of experiment has been done in S. pombe (Khodjakov et al, Cur Biol 2004 and Zareiesfandabadi et al, Biophys. J. 2022). In these articles the authors have demonstrated that after a cut, metaphase spindles are unstable and rapidly shrink through the action of Kinesin14 and dynein. This is now mentioned in the text line 265.

      (14) In the experiment that shows that cycloheximide prevents Q-nMT disassembly after quiescence exit, and therefore that this process requires de novo protein synthesis (Fig. 5A), cells are indicated to express only Spc42-RFP and Nuf2-GFP. However, Stu2-GFP images are also shown next to the graph and, according to the figure legend, it was indeed Stu2-GFP that was used to measure individual QnMT bundles in cells treated with cycloheximide. In the graph, additionally, time t=0 represents the onset of MT bundle depolymerization, but Q-nMT bundle disassembly does not take place after cycloheximide treatment. The authors should clarify these aspects of the experiment.

      Following the Reviewer’s suggestion, to clarify these aspects we have split Fig. 5A into 2 panels.

      Finally, some minor issues are:

      (1) The text should be checked for proper spelling and grammar.

      We have done our best.

      (2) In some instances, there is no indication of how many cells were imaged and analyzed.

      We now provide all these details either in the figure itself or in the figure legend.

      (3) Besides the Q-nMT bundle, it is sometimes noticeable an additional strong cytoplasmic fluorescent signal in cells that express mTQZ-Tub1 and/or mRuby-Tub1 (e.g., Figs 1F, 1H and, particularly, Supp Fig 1H). What is the nature of these cytoplasmic MT structures?

      We did mention this observation in the material and methods section (see line 526-528). This signal is a background fluorescence signal detected with our long pass GFP filter. It is not GFP as it is “yellowish” when we view it via the microscope oculars. This background signal can also be observed in quiescent WT cells that do not express any GFP. We do not know what molecule could be at the origin of that signal but it may be derivative of an adenylic metabolite that accumulates in quiescence and could be fluorescent in the 550nm –ish wavelength, but this is pure speculation.

      (4) It is remarkable that a 20-30% decrease in tubulin levels had such a strong impact on the assembly of the Q-nMT bundle (Supp. Fig. 2). Can this phenotype be recovered by increasing the amount of tubulin in the mutants impaired for tubulin folding?

      Yes, this is astonishing, but we believe our data are very solid since we observed that with both tub3Δ and in all the tubulin folding mutants we have tested (See Sup. Fig. 2). To answer Reviewer’s question, we would need to increase the amount of properly folded tubulin, in a tubulin folding mutant. One way to try to do that would be to find suppressors of GIM mutations, but this is a lengthy process that we feel would not add much strength to this conclusion.

      (5) The graphs displaying the length of the Q-nMT bundle in several mutants in microtubule motors throughout a time course are presented in a different manner than in previous experiments, with data points for individual cells being only shown for the most extreme values (Fig 4C, 4H). It would be advisable, for the sake of comparison, to unify the way to represent the data.

      We have now unified the way we present our figures.

      (6) How was the exit from quiescence established in the experiments evaluating Q-nMT disassembly? How synchronous is quiescence exit in the whole population of cells once they are transferred to a rich medium?

      We set the “zero” time upon cell refeeding with new medium. In fact, quiescence exit is NOT synchronous. We have reported this in previous publications, with the best description of this phenomena being in Laporte et al, MIC 2017 . <br /> The figures below are the same data but on the left graph, the kinetic is aligned upon SPB separation onset, while on the right graph (Fig 5A), it is aligned on MT shrinking onset.

      Author response image 10.

      We can add this piece of data in a Sup Figure if the Reviewer believes it is important.

      Reviewer #2 (Recommendations For The Authors):

      General:

      • In general, more precise language that accurately describes the experiments would improve the text. <br /> We have tried to do our best to improve the text.

      • The authors should clearly define what they mean by an active process and provide context to support this statement regarding the Q-nMT.

      We have strived to clarify this point in the text (see paragraph form line 146 to 178).

      • It is reasonable to assume that structures composed of microtubules are dynamic during the assembly process. The authors should clarify what they mean by "stable by default i.e., intrinsically stable." Do they mean that when Q-nMT assembly starts, it will proceed to completion regardless of a change in condition?

      We mean that in phase I the Q-nMT bundle is stabilized as it grows and that stabilization is concomitant with polymerization. By contrast, MTs polymerized during phase II are not stabilized upon elongation beyond the phase I polymer, and get stabilized later, in a separate phase (i.e. in phase III). We hope to have clarified this point in the text (see line 108-110).

      • In lines 33-34, the authors claim that the Q-nMT bundle functions as a "sort of checkpoint for cell cycle resumption." This wording is imprecise, and more significantly the authors do not provide evidence supporting a direct role for Q-nMT in a quiescence checkpoint that inhibits re-entry into the cell cycle.

      We have softened and clarified the text in the abstract (see line 29-30)., in the introduction (line 101104), in the result section (line 331-332) and in the discussion (line 426-430).

      • Many statements are qualitative and subjective. Quantitative statements supported by the results should be used where possible, and if not possible restated or removed.

      We provide statistical data analysis for all the figures.

      • The number of hours after glucose exhaustion used for each phase varies between assays. This is likely a logistical issue but should be explained.

      This is indeed a logistical issue and when pertinent, it is explained in the text.

      • It would be interesting to address how this process occurs in diploids. Do they form a Q-nMT? How does this relate to the decision to enter meiosis?

      Diploid cells enter meiosis when they are starved for nitrogen. Upon glucose exhaustion diploids do form a Q-nMT bundle. This is shown and measured in the new Sup Fig1C. In fact, in diploids, Q-nMT bundles are thicker than in haploid cells.

      • It would be interesting to address how the timescale of this process compares to the types of nutrient stress yeast would be exposed to in the environment.

      We have transferred proliferating yeast cells to water, to try to mimic what could happen when yeast cells face rain in the wild. As shown below, they do form a Q-nMT bundle that becomes nocodazole resistant after 30h. This data is now provided in the new Sup Fig 1D.

      • It is recommended that the authors use FRAP experiments to directly measure the stability of the QnMT bundles.

      This experiment was published in (Laporte et al, 2013). Please see response to Reviewer #1.

      • In many cases, the description of the experimental methods lacks sufficient detail to evaluate the approach or for independent verification of results.

      We have strived to provide a more detailed material and methods section, as well as more detailed figure legends and statistical informations.

      Specific comments on figures:

      • In Figure 1 c), what do the polygons represent? They do not contain all the points of the associated colour.

      The polygon represented the area of distribution of 90% of the data points. As they did not significantly add to the data presentation they have been removed.

      • In Figure 2 a), is the use of two different sets of markers to control for the effect of the markers on microtubule dynamics?

      Yes, we are always concerned about the influence of GFP on our results, so very often we replicate our experiments with different fluorescent proteins or even with different proteins tagged with GFP. This is now mentioned in the text (line 184-186).

      • Is it accurate to say (line 201, figure 3 a)) that no Q-nMT bundles were detected in ndc80-1 cells shifted to 37 degrees, or are they just shorter?

      As shown in Fig 3A, in ndc80-1 cells, most of the MT structures that we measured are below 0,5um. This has been re-phrased in the text (line 214-215).

      • Lines 265-269, figure 4 b), how can the phenotype observed in cin8∆ cells be explained given the low abundance of Cin8 that is detected in quiescent cells?

      Faint fluorescence signal is not synonymous of an absence of function. As shown in Sup Fig 4B, we do detect Cin8-GFP in quiescent cells.

      • Quantification is needed in Figure 4 panels c) and h).

      Fig 4C and 4H have been changed and quantification are provided in the figure legend.

      Reviewer #3 (Recommendations For The Authors):

      A few points should be addressed for clarity:

      (1) Sup. Fig. 1K: are only viable cells used for the colony-forming assay? How were these selected? If not, the assay would just measure survival (as in the viability assay).

      Yes, only viable cells were selected for the colony forming assay. We used methylene blue to stain dead cells. Then, we used a micromanipulation instrument (Singer Spore Play) that is commonly used for tetrad dissection to select “non blue cells” and position them on a plate (as we do with spores). Each micromanipulated cell is then allowed to grow on the plate and we count colonies (see picture in Sup Fig 1L right panel). This was described in Laporte et al, JCB 2011. We have added that piece of information in the legend (line 1129-1130) and in the M&M section (line 580-586).

      (2) Could Tub3 have a role in phase I? It is not clear why the authors conclude involvement only in phase II.

      As it can be seen in Fig 2D, MT bundle length and thickness are quite similar in WT and Tub1-only cells in phase I, indicating that the absence of Tub3 as no effect in phase I. In Tub1-only cells, MT bundles are thinner in both phase II and phase III, yet, they get fully stabilized in phase III. Thus, the effect of Tub3 is largely specific to the nucleation/elongation of phase II MTs. We hope to have clarified that point in the text (line 203-207).

      (3) Quantifications, statistics: for all quantifications, the authors should clearly state the number of experiments (replicates), and number of cells used in each, and what number was used for statistics. For all quantifications in cells, it seems that the values from the total number of cells across different experiments were plotted and used for statistics. This is not very useful and results in extremely small p values. I assume that the values for individual cells were obtained from multiple, independent experiments. Unless there are technical limitations that allow only a very small sample size (not the case here for most experiments), for experiments involving treatments the authors should determine values for each experiment and show statistics for comparison between experiments rather than individual cells pooled from multiple experiments.

      All the experiments have been done at least in replicate. In the new Fig. 1A, we now display each independent experiment with a specific color code. For Fig 2B and 2C we now provide the data obtained for each separate experiment in Sup Fig 2C. Additional details about quantifications and statistics are provided in the M&M section or in the specific figure legends.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1:

      I am satisfied with all clarifications and additional analyses performed by the authors. 

      The only concern I have is about changes in running after [AM+VM] mismatches. 

      The authors reported that they "found no evidence of a change in running speed or pupil diameter following [AM + VM] mismatch (Figures S5A)" (line 197). 

      Nevertheless, it seems that there is a clear increase in running speed for the [AM+VM] condition (S5A). Could this be more specifically quantified? I am concerned that part of the [AM+VM] could stem from this change in running behavior. Could one factor out the running contribution? 

      Please excuse, this was unintentionally omitted. We have added the quantification to Table S1 and included the results of the significance test in (Fig S2A, Fig S4A and Fig S5A). The increase in running speed upon MM presentation (0.5 – 1 s), compared to the baseline running speed in the time window preceding MM presentation (-0.5 – 0 s), was not significant in any of the tested conditions.

      In the process of adding the statistics, we noticed an unfortunate inconsistency in our figures that relates to Figure S5A. The data shown in all other Figures is aligned to the onset of audiomotor mismatch. In Figure S5A, however, the data were aligned to the onset of the visuomotor mismatch. As there is a differential delay in the closed loop coupling of auditory and visual feedback of approximately 170 ms (as described in the methods), visuomotor mismatch onset is slightly before audiomotor mismatch onset. We have corrected this now in the manuscript but have done the statistical analysis for both old and new versions of the figure. In neither case do we find evidence of a running speed response.

      The authors thoroughly addressed the concerns raised. In my opinion, this has substantially strengthened the manuscript, enabling much clearer interpretation of the results reported. I commend the authors for the response to review. Overall, I find the experiments elegantly designed, and the results robust, providing compelling evidence for non-hierarchical interactions across neocortical areas and more specifically for the exchange of sensorimotor prediction error signals across modalities. 

      We are happy to hear!

      Reviewer #2:

      The incorporation of the analysis of the animal's running speed and the pupil size upon sound interruption improves the interpretation of the data. The authors can now conclude that responses to the mismatch are not due to behavioral effects. 

      The issue of the relationship between mismatch responses and offset responses remains uncommented. The auditory system is sensitive to transitions, also to silence. See the work of the Linden or the Barkat labs (including the work of the first author of this manuscript) on offset responses, and also that of the Mesgarani lab (Khalighinejad et al., 2019) on responses to transitions 'to clean' (Figure 1c) in human auditory cortex. Offset responses, as the first author knows well, are modulated by intensity and stimulus length (after adaptation?). That responses to the interruption of the sound are similar in quality, if not quantity, in the closed and open loop conditions suggest that offset response might modulate the mismatch response. A mismatch response that reflects a break in predictability would presumably be less modulated by the exact details of the sensory input than an offset response. Therefore, what is the relationship between the mismatch response and the mean sound amplitude prior to the sound interruption (for example during the preceding 1 second)? And between the mismatch response and the mean firing rate over the same period? 

      Finally, how do visual stimuli modulate sound responses in the absence of a mismatch? Is the multimodal response potentiation specific to a mismatch?

      There are probably two points important to clarify before answering the question – just to make sure there is no semantic misunderstanding. 

      (1) In the jargon of predictive processing, a prediction error is a deviation from a predictable relationship. This can be sensorimotor coupling (as in audio- and visuomotor mismatch), stimulus history (as in oddball, or sound offset responses), surround sensory input (as in endstopping response and center-surround effects in visual processing), etc. A sound offset perceived by an animal in an open loop condition is thus a negative prediction error based on stimulus history (this assumes the animal has no way to predict the time of offset – as is the case in our experiments). We are primarily interested in our work here in characterizing negative prediction errors that result from motor-related predictions – hence the comparison we use is unpredictable sound offset in closed-loop coupling vs. unpredictable sound offset in open-loop coupling. The first is a mixture of an audiomotor prediction error and a stimulus history prediction error. The second is just a stimulus history prediction error. Thus, we compare the two types of responses to isolate the component that can only be attributed to audiomotor prediction errors. 

      (2) Audiomotor mismatch responses can of course be explained in a large variety of ways. For example, one could consider a sound offset a sensory stimulus. One could further assume that locomotion increases sensory responses. If so, one could explain audiomotor mismatch responses as a locomotion related gain of a sensory offset response. However, we need to further postulate that this locomotion related gain is stimulus specific, as for sound onset responses there is no detectable difference between locomotion and sitting. Thus, we are left with a model that explains audiomotor mismatch responses as a “stimulus specific locomotion gain of sensory responses”. This is correct – it is just not very satisfying, has no computational basis, and makes no useful predictions (see e.g. https://pubmed.ncbi.nlm.nih.gov/36821437/ for an extended treatise of exactly this point for visuomotor mismatch responses).

      That responses to the interruption of the sound are similar in quality, if not quantity, in the closed and open loop conditions suggest that offset response might modulate the mismatch response.

      Conceptually both a “sound offset” and an “audiomotor mismatch” are negative prediction errors. Could one describe the effect we see as an audiomotor mismatch modulating a sound offset? Certainly. But if the reviewer means modulate in the sense of neuromodulatory – we are not aware of a neuromodulatory responses that would be fast enough (or be strong enough to have these effects – we have looked into ACh, NA, and Ser (unpublished – no MM response)). Alternatively, they could simply add linearly (as predictive processing would predict). Given that AM mismatch responses are likely computed in auditory cortex, we see no reason to speculate that anything more complicated is happening than a linear summation of different prediction error responses. 

      A mismatch response that reflects a break in predictability would presumably be less modulated by the exact details of the sensory input than an offset response. Therefore, what is the relationship between the mismatch response and the mean sound amplitude prior to the sound interruption (for example during the preceding 1 second)? And between the mismatch response and the mean firing rate over the same period? 

      The reviewer’s intuition here – that mismatch responses have a lower resolution than what one thinks of as sensory responses (or sound offset responses) – is probably not warranted. Experiments that quantify the resolution of mismatch responses are relatively data intense – and to the best of our knowledge this has only been done once in the visual system for visuomotor mismatch responses (Zmarz and Keller, 2016). Here we found that visuomotor mismatch responses exhibited matched spatial (in visual space) resolution to that of visual responses. 

      Regarding the suggested analyses: In a closed loop session, the sound amplitude preceding the mismatch is directly related to the running speed of the mouse. In visual cortex, the amplitude of visuomotor mismatch responses linearly scales with running speed (and consequently visual flow speed) prior to the mismatch – as predicted by predictive processing. See e.g. figure 4B in (Zmarz and Keller, 2016). We have tried this analysis for audiomotor mismatches in the previous round of reviews, but we fear we do not have sufficient data to address this question properly. If we look at how mismatch responses change as a function of locomotion speed (sound amplitude) across the entire population of neurons, we have no evidence of a systematic change (and the effects are highly variable as a function of speed bins we choose). However, just looking at the most audiomotor mismatch responsive neurons, we find a trend for increased responses with increasing running speed (Author response image 1). We analyzed the top 5% of cells that showed the strongest response to mismatch (MM) and divided the MM trials into three groups based on running speed: slow (10-20 cm/s), middle (20-30 cm/s), and fast (>30 cm/s). Given the fact that we have on average 14 mismatch events in total per neuron, the analysis when split by running speed is under-powered.  

      Author response image 1.

      The average response of strongest AM MM responders to AM mismatches as a function of running speed (data are from 51 cells, 11 fields of view, 6 mice).

      Regarding the relationship between mismatch response and firing rate prior to mismatch, we are not sure we understand the intuition. Does the reviewer mean, the average firing rate of the mismatch neuron? Or the population mean? The first is likely uninterpretable as it is bound to be confounded by regression to the mean type artefacts. But in either case, we would have no prediction of what to expect.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1:

      Comment:

      The authors quantified information in gesture and speech, and investigated the neural processing of speech and gestures in pMTG and LIFG, depending on their informational content, in 8 different time-windows, and using three different methods (EEG, HD-tDCS and TMS). They found that there is a time-sensitive and staged progression of neural engagement that is correlated with the informational content of the signal (speech/gesture).

      Strengths:

      A strength of the paper is that the authors attempted to combine three different methods to investigate speech-gesture processing.

      We sincerely appreciate the reviewer’s recognition of our efforts in employing a multi-method approach, which integrates three complementary experimental paradigms, each leveraging distinct neurophysiological techniques to provide converging evidence.

      In Experiment 1, we found that the degree of inhibition in the pMTG and LIFG was strongly associated with the overlap in gesture-speech representations, as quantified by mutual information. Experiment 2 revealed the time-sensitive dynamics of the pMTG-LIFG circuit in processing both unisensory (gesture or speech) and multisensory information. Experiment 3, utilizing high-temporal-resolution EEG, independently replicated the temporal dynamics of gesture-speech integration observed in Experiment 2, further validating our findings.

      The striking convergence across these methodologically independent approaches significantly bolsters the robustness and generalizability of our conclusions regarding the neural mechanisms underlying multisensory integration.

      Comment 1: I thank the authors for their careful responses to my comments. However, I remain not convinced by their argumentation regarding the specificity of their spatial targeting and the time-windows that they used.

      The authors write that since they included a sham TMS condition, that the TMS selectively disrupted the IFG-pMTG interaction during specific time windows of the task related to gesture-speech semantic congruency. This to me does not show anything about the specificity of the time-windows itself, nor the selectivity of targeting in the TMS condition.

      (1) Selection of brain regions (IFG/pMTG)

      We thank the reviewer for their thoughtful consideration. The choice of the left IFG and pMTG as regions of interest (ROIs) was informed by a meta-analysis of fMRI studies on gesture-speech integration, which consistently identified these regions as critical hubs (see Author response table 1 for detailed studies and coordinates).

      Author response table 1.

      Meta-analysis of previous studies on gesture-speech integration.

      Based on the meta-analysis of previous studies, we selected the IFG and pMTG as ROIs for gesture-speech integration. The rationale for selecting these brain regions is outlined in the introduction in Lines 63-66: “Empirical studies have investigated the semantic integration between gesture and speech by manipulating their semantic relationship[15-18] and revealed a mutual interaction between them19-21 as reflected by the N400 latency and amplitude14 as well as common neural underpinnings in the left inferior frontal gyrus (IFG) and posterior middle temporal gyrus (pMTG)[15,22,23].”

      And further described in Lines 77-78: “Experiment 1 employed high-definition transcranial direct current stimulation (HD-tDCS) to administer Anodal, Cathodal and Sham stimulation to either the IFG or the pMTG”. And Lines 85-88: ‘Given the differential involvement of the IFG and pMTG in gesture-speech integration, shaped by top-down gesture predictions and bottom-up speech processing [23], Experiment 2 was designed to assess whether the activity of these regions was associated with relevant informational matrices”.

      In the Methods section, we clarified the selection of coordinates in Lines 194-200: “Building on a meta-analysis of prior fMRI studies examining gesture-speech integration[22], we targeted Montreal Neurological Institute (MNI) coordinates for the left IFG at (-62, 16, 22) and the pMTG at (-50, -56, 10). In the stimulation protocol for HD-tDCS, the IFG was targeted using electrode F7 as the optimal cortical projection site[36], with four return electrodes placed at AF7, FC5, F9, and FT9. For the pMTG, TP7 was selected as the cortical projection site[36], with return electrodes positioned at C5, P5, T9, and P9.”

      The selection of IFG or pMTG as integration hubs for gesture and speech has also been validated in our previous studies. Specifically, Zhao et al. (2018, J. Neurosci) applied TMS to both areas. Results demonstrated that disrupting neural activity in the IFG or pMTG via TMS selectively impaired the semantic congruency effect (reaction time costs due to semantic incongruence), while leaving the gender congruency effect unaffected.

      These findings identified the IFG and pMTG as crucial hubs for gesture-speech integration, guiding the selection of brain regions for our subsequent studies.

      (2) Selection of time windows

      The five key time windows (TWs) analyzed in this study were derived from our previous TMS work (Zhao et al., 2021, J. Neurosci), where we segmented the gesture-speech integration period (0–320 ms post-speech onset) into eight 40-ms windows. This interval aligns with established literature on gesture-speech integration, particularly the 200–300 ms window noted by the reviewer. As detailed in Lines (776-779): “Procedure of Experiment 2. Eight time windows (TWs, duration = 40 ms) were segmented in relative to the speech IP. Among the eight TWs, five (TW1, TW2, TW3, TW6, and TW7) were chosen based on the significant results in our prior study[23]. Double-pulse TMS was delivered over each of the TW of either the pMTG or the IFG”.

      In our prior work (Zhao et al., 2021, J. Neurosci), we employed a carefully controlled experimental design incorporating two key factors: (1) gesture-speech semantic congruency (serving as our primary measure of integration) and (2) gesture-speech gender congruency (implemented as a matched control factor). Using a time-locked, double-pulse TMS protocol, we systematically targeted each of the eight predefined time windows (TWs) within the left IFG, left pMTG, or vertex (serving as a sham control condition). Our results demonstrated that a TW-selective disruption of gesture-speech integration, indexed by the semantic congruency effect (i.e., a cost of reaction time because of semantic conflict), when stimulating the left pMTG in TW1, TW2, and TW7 but when stimulating the left IFG in TW3 and TW6. Crucially, no significant effects were observed during either sham stimulation or the controlled gender congruency factor (Figure 3 from Zhao et al., 2021, J. Neurosci).

      This triple dissociation - showing effects only for semantic integration, only in active stimulation, and only at specific time points - provides compelling causal evidence that IFG-pMTG connectivity plays a temporally precise role in gesture-speech integration.

      Noted that this work has undergone rigorous peer review by two independent experts who both endorsed our methodological approach. Their original evaluations, provided below:

      Reviewer 1: “significance: Using chronometric TMS-stimulation the data of this experiment suggests a feedforward information flow from left pMTG to left IFG followed by an information flow from left IFG back to the left pMTG.  The study is the first to provide causal evidence for the temporal dynamics of the left pMTG and left IFG found during gesture-speech integration.”

      Reviewer 2: “Beyond the new results the manuscript provides regarding the chronometrical interaction of the left inferior frontal gyrus and middle temporal gyrus in gesture-speech interaction, the study more basically shows the possibility of unfolding temporal stages of cognitive processing within domain-specific cortical networks using short-time interval double-pulse TMS. Although this method also has its limitations, a careful study planning as shown here and an appropiate discussion of the results can provide unique insights into cognitive processing.”

      References:

      Willems, R.M., Ozyurek, A., and Hagoort, P. (2009). Differential roles for left inferior frontal and superior temporal cortex in multimodal integration of action and language. Neuroimage 47, 1992-2004. 10.1016/j.neuroimage.2009.05.066.

      Drijvers, L., Jensen, O., and Spaak, E. (2021). Rapid invisible frequency tagging reveals nonlinear integration of auditory and visual information. Human Brain Mapping 42, 1138-1152. 10.1002/hbm.25282.

      Drijvers, L., and Ozyurek, A. (2018). Native language status of the listener modulates the neural integration of speech and iconic gestures in clear and adverse listening conditions. Brain and Language 177, 7-17. 10.1016/j.bandl.2018.01.003.

      Drijvers, L., van der Plas, M., Ozyurek, A., and Jensen, O. (2019). Native and non-native listeners show similar yet distinct oscillatory dynamics when using gestures to access speech in noise. Neuroimage 194, 55-67. 10.1016/j.neuroimage.2019.03.032.

      Holle, H., and Gunter, T.C. (2007). The role of iconic gestures in speech disambiguation: ERP evidence. J Cognitive Neurosci 19, 1175-1192. 10.1162/jocn.2007.19.7.1175.

      Kita, S., and Ozyurek, A. (2003). What does cross-linguistic variation in semantic coordination of speech and gesture reveal?: Evidence for an interface representation of spatial thinking and speaking. J Mem Lang 48, 16-32. 10.1016/S0749-596x(02)00505-3.

      Bernardis, P., and Gentilucci, M. (2006). Speech and gesture share the same communication system. Neuropsychologia 44, 178-190. 10.1016/j.neuropsychologia.2005.05.007.

      Zhao, W.Y., Riggs, K., Schindler, I., and Holle, H. (2018). Transcranial magnetic stimulation over left inferior frontal and posterior temporal cortex disrupts gesture-speech integration. Journal of Neuroscience 38, 1891-1900. 10.1523/Jneurosci.1748-17.2017.

      Zhao, W., Li, Y., and Du, Y. (2021). TMS reveals dynamic interaction between inferior frontal gyrus and posterior middle temporal gyrus in gesture-speech semantic integration. The Journal of Neuroscience, 10356-10364. 10.1523/jneurosci.1355-21.2021.

      Hartwigsen, G., Bzdok, D., Klein, M., Wawrzyniak, M., Stockert, A., Wrede, K., Classen, J., and Saur, D. (2017). Rapid short-term reorganization in the language network. Elife 6. 10.7554/eLife.25964.

      Jackson, R.L., Hoffman, P., Pobric, G., and Ralph, M.A.L. (2016). The semantic network at work and rest: Differential connectivity of anterior temporal lobe subregions. Journal of Neuroscience 36, 1490-1501. 10.1523/JNEUROSCI.2999-15.2016.

      Humphreys, G. F., Lambon Ralph, M. A., & Simons, J. S. (2021). A Unifying Account of Angular Gyrus Contributions to Episodic and Semantic Cognition. Trends in neurosciences, 44(6), 452–463. https://doi.org/10.1016/j.tins.2021.01.006

      Bonner, M. F., & Price, A. R. (2013). Where is the anterior temporal lobe and what does it do?. The Journal of neuroscience : the official journal of the Society for Neuroscience, 33(10), 4213–4215. https://doi.org/10.1523/JNEUROSCI.0041-13.2013

      Comment 2: It could still equally well be the case that other regions or networks relevant for gesture-speech integration are targeted, and it can still be the case that these timewindows are not specific, and effects bleed into other time periods. There seems to be no experimental evidence here that this is not the case.

      The selection of IFG and pMTG as regions of interest was rigorously justified through multiple lines of evidence. First, a comprehensive meta-analysis of fMRI studies on gesture-speech integration consistently identified these regions as central nodes (see response to comment 1). Second, our own previous work (Zhao et al., 2018, JN; 2021, JN) provided direct empirical validation of their involvement. Third, by employing the same experimental paradigm, we minimized the likelihood of engaging alternative networks. Fourth, even if other regions connected to IFG or pMTG might be affected by TMS, the distinct engagement of specific time windows of IFG and pMTG minimizes the likelihood of consistent influence from other regions.

      Regarding temporal specificity, our 2021 study (Zhao et al., 2021, JN, see details in response to comment 1) systematically examined the entire 0-320ms integration window and found that only select time windows showed significant effects for gesture-speech semantic congruency, while remaining unaffected during gender congruency processing. This double dissociation (significant effects for semantic integration but not gender processing in specific windows) rules out broad temporal spillover.

      Comment 3: To be more specific, the authors write that double-pulse TMS has been widely used in previous studies (as found in their table). However, the studies cited in the table do not necessarily demonstrate the level of spatial and temporal specificity required to disentangle the contributions of tightly-coupled brain regions like the IFG and pMTG during the speech-gesture integration process. pMTG and IFG are located in very close proximity, and are known to be functionally and structurally interconnected, something that is not necessarily the case for the relatively large and/or anatomically distinct areas that the authors mention in their table.

      Our methodological approach is strongly supported by an established body of research employing double-pulse TMS (dpTMS) to investigate neural dynamics across both primary motor and higher-order cognitive regions. As documented in Author response table 1, multiple studies have successfully applied this technique to: (1) primary motor areas (tongue and lip representations in M1), and (2) semantic processing regions (including pMTG, PFC, and ATL). Particularly relevant precedents include:

      (1) Teige et al. (2018, Cortex): Demonstrated precise spatial and temporal specificity by applying 40ms-interval dpTMS to ATL, pMTG, and mid-MTG across multiple time windows (0-40ms, 125-165ms, 250-290ms, 450-490ms), revealing distinct functional contributions from ATL versus pMTG.

      (2) Vernet et al. (2015, Cortex): Successfully dissociated functional contributions of right IPS and DLPFC using 40ms-interval dpTMS, despite their anatomical proximity and functional connectivity.

      These studies confirm double-pulse TMS can discriminate interconnected nodes at short timescales. Our 2021 study further validated this for IFG-pMTG.

      Author response table 2.

      Double-pulse TMS studies on brain regions over 3-60 ms time interval

      References:

      Teige, C., Mollo, G., Millman, R., Savill, N., Smallwood, J., Cornelissen, P. L., & Jefferies, E. (2018). Dynamic semantic cognition: Characterising coherent and controlled conceptual retrieval through time using magnetoencephalography and chronometric transcranial magnetic stimulation. Cortex, 103, 329-349.

      Vernet, M., Brem, A. K., Farzan, F., & Pascual-Leone, A. (2015). Synchronous and opposite roles of the parietal and prefrontal cortices in bistable perception: a double-coil TMS–EEG study. Cortex, 64, 78-88.

      Comment 4: But also more in general: The mere fact that these methods have been used in other contexts does not necessarily mean they are appropriate or sufficient for investigating the current research question. Likewise, the cognitive processes involved in these studies are quite different from the complex, multimodal integration of gesture and speech. The authors have not provided a strong theoretical justification for why the temporal dynamics observed in these previous studies should generalize to the specific mechanisms of gesture-speech integration..

      The neurophysiological mechanisms underlying double-pulse TMS (dpTMS) are well-characterized. While it is established that single-pulse TMS can produce brief artifacts (typically within 0–10 ms) due to transient cortical depolarization (Romero et al., 2019, NC), the dynamics of double-pulse TMS (dpTMS) involve more intricate inhibitory interactions. Specifically, the first pulse increases membrane conductance via GABAergic shunting inhibition, effectively lowering membrane resistance and attenuating the excitatory impact of the second pulse. This results in a measurable reduction in cortical excitability at the paired-pulse interval, as evidenced by suppressed motor evoked potentials (MEPs) (Paulus & Rothwell, 2016, J Physiol). Importantly, this neurophysiological mechanism is independent of cognitive domain and has been robustly demonstrated across multiple functional paradigms.

      In our study, we did not rely on previously reported timing parameters but instead employed a dpTMS protocol using a 40-ms inter-pulse interval. Based on the inhibitory dynamics of this protocol, we designed a sliding temporal window sufficiently broad to encompass the integration period of interest. This approach enabled us to capture and localize the critical temporal window associated with ongoing integrative processing in the targeted brain region.

      We acknowledge that the previous phrasing may have been ambiguous, a clearer and more detailed description of the dpTMS protocol has now been provided in Lines 88-92: “To this end, we employed chronometric double-pulse transcranial magnetic stimulation, which is known to transiently reduce cortical excitability at the inter-pulse interval]27]. Within a temporal period broad enough to capture the full duration of gesture–speech integration[28], we targeted specific timepoints previously implicated in integrative processing within IFG and pMTG [23].”

      References:

      Romero, M.C., Davare, M., Armendariz, M. et al. Neural effects of transcranial magnetic stimulation at the single-cell level. Nat Commun 10, 2642 (2019). https://doi.org/10.1038/s41467-019-10638-7

      Paulus W, Rothwell JC. Membrane resistance and shunting inhibition: where biophysics meets state-dependent human neurophysiology. J Physiol. 2016 May 15;594(10):2719-28. doi: 10.1113/JP271452. PMID: 26940751; PMCID: PMC4865581.

      Obermeier, C., & Gunter, T. C. (2015). Multisensory Integration: The Case of a Time Window of Gesture-Speech Integration. Journal of Cognitive Neuroscience, 27(2), 292-307. https://doi.org/10.1162/jocn_a_00688

      Comment 5: Moreover, the studies cited in the table provided by the authors have used a wide range of interpulse intervals, from 20 ms to 100 ms, suggesting that the temporal precision required to capture the dynamics of gesture-speech integration (which is believed to occur within 200-300 ms; Obermeier & Gunter, 2015) may not even be achievable with their 40 ms time windows.

      Double-pulse TMS has been empirically validated across neurocognitive studies as an effective method for establishing causal temporal relationships in cortical networks, with demonstrated sensitivity at timescales spanning 3-60 m. Our selection of a 40-ms interpulse interval represents an optimal compromise between temporal precision and physiological feasibility, as evidenced by its successful application in dissociating functional contributions of interconnected regions including ATL/pMTG (Teige et al., 2018) and IPS/DLPFC (Vernet et al., 2015). This methodological approach combines established experimental rigor with demonstrated empirical validity for investigating the precisely timed IFG-pMTG dynamics underlying gesture-speech integration, as shown in our current findings and prior work (Zhao et al., 2021).

      Our experimental design comprehensively sampled the 0-320 ms post-stimulus period, fully encompassing the critical 200-300 ms window associated with gesture-speech integration, as raised by the reviewer. Notably, our results revealed temporally distinct causal dynamics within this period: the significantly reduced semantic congruency effect emerged at IFG at 200-240ms, followed by feedback projections from IFG to pMTG at 240-280ms. This precisely timed interaction provides direct neurophysiological evidence for the proposed architecture of gesture-speech integration, demonstrating how these interconnected regions sequentially contribute to multisensory semantic integration.

      Comment 6: I do appreciate the extra analyses that the authors mention. However, my 5th comment is still unanswered: why not use entropy scores as a continous measure?

      Analysis with MI and entropy as continuous variables were conducted employing Representational Similarity Analysis (RSA) (Popal et.al, 2019). This analysis aimed to build a model to predict neural responses based on these feature metrics.

      To capture dynamic temporal features indicative of different stages of multisensory integration, we segmented the EEG data into overlapping time windows (40 ms in duration with a 10 ms step size). The 40 ms window was chosen based on the TMS protocol used in Experiment 2, which also employed a 40 ms time window. The 10 ms step size (equivalent to 5 time points) was used to detect subtle shifts in neural responses that might not be captured by larger time windows, allowing for a more granular analysis of the temporal dynamics of neural activity.

      Following segmentation, the EEG data were reshaped into a four-dimensional matrix (42 channels × 20 time points × 97 time windows × 20 features). To construct a neural similarity matrix, we averaged the EEG data across time points within each channel and each time window. The resulting matrix was then processed using the pdist function to compute pairwise distances between adjacent data points. This allowed us to calculate correlations between the neural matrix and three feature similarity matrices, which were constructed in a similar manner. These three matrices corresponded to (1) gesture entropy, (2) speech entropy, and (3) mutual information (MI). This approach enabled us to quantify how well the neural responses corresponded to the semantic dimensions of gesture and speech stimuli at each time window.

      To determine the significance of the correlations between neural activity and feature matrices, we conducted 1000 permutation tests. In this procedure, we randomized the data or feature matrices and recalculated the correlations repeatedly, generating a null distribution against which the observed correlation values were compared. Statistical significance was determined if the observed correlation exceeded the null distribution threshold (p < 0.05). This permutation approach helps mitigate the risk of spurious correlations, ensuring that the relationships between the neural data and feature matrices are both robust and meaningful.

      Finally, significant correlations were subjected to clustering analysis, which grouped similar neural response patterns across time windows and channels. This clustering allowed us to identify temporal and spatial patterns in the neural data that consistently aligned with the semantic features of gesture and speech stimuli, thus revealing the dynamic integration of these multisensory modalities across time. Results are as follows:

      (1)  Two significant clusters were identified for gesture entropy (Figure 1 left). The first cluster was observed between 60-110 ms (channels F1 and F3), with correlation coefficients (r) ranging from 0.207 to 0.236 (p < 0.001). The second cluster was found between 210-280 ms (channel O1), with r-values ranging from 0.244 to 0.313 (p < 0.001).

      (2)  For speech entropy (Figure 1 middle), significant clusters were detected in both early and late time windows. In the early time windows, the largest significant cluster was found between 10-170 ms (channels F2, F4, F6, FC2, FC4, FC6, C4, C6, CP4, and CP6), with r-values ranging from 0.151 to 0.340 (p = 0.013), corresponding to the P1 component (0-100 ms). In the late time windows, the largest significant cluster was observed between 560-920 ms (across the whole brain, all channels), with r-values ranging from 0.152 to 0.619 (p = 0.013).

      (3)  For mutual information (MI) (Figure 1 right), a significant cluster was found between 270-380 ms (channels FC1, FC2, FC3, FC5, C1, C2, C3, C5, CP1, CP2, CP3, CP5, FCz, Cz, and CPz), with r-values ranging from 0.198 to 0.372 (p = 0.001).

      Author response image 1.

      Results of RSA analysis.

      These additional findings suggest that even using a different modeling approach, neural responses, as indexed by feature metrics of entropy and mutual information, are temporally aligned with distinct ERP components and ERP clusters, as reported in the current manuscript. This alignment serves to further consolidate the results, reinforcing the conclusion we draw. Considering the length of the manuscript, we did not include these results in the current manuscript.

      Reference:

      Popal, H., Wang, Y., & Olson, I. R. (2019). A guide to representational similarity analysis for social neuroscience. Social cognitive and affective neuroscience, 14(11), 1243-1253.

      Comment 7: In light of these concerns, I do not believe the authors have adequately demonstrated the spatial and temporal specificity required to disentangle the contributions of the IFG and pMTG during the gesture-speech integration process. While the authors have made a sincere effort to address the concerns raised by the reviewers, and have done so with a lot of new analyses, I remain doubtful that the current methodological approach is sufficient to draw conclusions about the causal roles of the IFG and pMTG in gesture-speech integration.

      To sum up:

      (1) Empirical validation from our prior work (Zhao et al., 2018,2021,JN): The selection of IFG and pMTG as target regions was informed by both: (1) a comprehensive meta-analysis of fMRI studies on gesture-speech integration, and (2) our own prior causal evidence from Zhao et al. (2018, J Neurosci), with detailed stereotactic coordinates provided in the attached Response to Editors and Reviewers letter. The temporal parameters were similarly grounded in empirical data from Zhao et al. (2021, J Neurosci), where we systematically examined eight consecutive 40-ms windows spanning the full integration period (0-320 ms). This study revealed a triple dissociation of effects - occurring exclusively during: (i)semantic integration (but not control tasks), (ii) active stimulation (but not sham), and (iii) specific time windows (but not all time windows)- providing robust causal evidence for the spatiotemporal specificity of IFG-pMTG interactions in gesture-speech processing. Notably, all reviewers recognized the methodological strength of this dpTMS approach in their evaluations (see attached JN assessment for details).

      (2) Convergent evidence from Experiment 3: Our study employed a multi-method approach incorporating three complementary experimental paradigms, each utilizing distinct neurophysiological techniques to provide converging evidence. Specifically, Experiment 3 implemented high-temporal-resolution EEG, which independently replicated the time-sensitive dynamics of gesture-speech integration observed in our double-pulse TMS experiments. The remarkable convergence between these methodologically independent approaches -demonstrating consistent temporal staging of IFG-pMTG interactions across both causal (TMS) and correlational (EEG) measures - significantly strengthens the validity and generalizability of our conclusions regarding the neural mechanisms underlying multisensory integration.

      (3) Established precedents in double-pulse TMS literature: The double-pulse TMS methodology employed in our study is firmly grounded in established neuroscience research. As documented in our detailed Response to Editors and Reviewers letter (citing 11 representative studies), dpTMS has been extensively validated for investigating causal temporal dynamics in cortical networks, with demonstrated sensitivity at timescales ranging from 3-60 ms. Particularly relevant precedents include: 1. Teige et al. (2018, Cortex) successfully dissociated functional contributions of anatomically proximal regions (ATL vs. pMTG vs.mid-MTG) using 40-ms-interval double-pulse TMS; 2. Vernet et al. (2015, Cortex) effectively distinguished neural processing in interconnected frontoparietal regions (right IPS vs. DLPFC) using 40-ms double-pulse TMS parameters. Both parameters are identical to those employed in our current study.

      (4) Neurophysiological Plausibility: The neurophysiological basis for the transient double-pulse TMS effects is well-established through mechanistic studies of TMS-induced cortical inhibition (Romero et al.,2019; Paulus & Rothwell, 2016).

      Taking together, we respectfully submit that our methodology provides robust support for our conclusions.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Summary:

      The authors tried to identify the relationships between gut microbiota, lipid metabolites and the host in type 2 diabetes (T2DM) by using spontaneously developed T2DM in macaques, considered among the best human models.

      Strengths:

      The authors compared comprehensively the gut microbiota, plasma fatty acids between spontaneous T2DM and the control macaques, and tried verified the results with macaques in high-fat diet-fed mice model.

      Weaknesses:

      The observed multi-omics on macaques can be done on humans, which weakens the conclusion of the manuscript, unless the observation/data on macaques could cover during the onset of T2DM that would be difficult to obtain from humans.

      Regarding the metabolomic analysis on fatty acids, the authors did not include the results obtained form the macaque fecal samples which should be important considering the authors claimed the importance of gut microbiota in the pathogenesis of T2DM. Instead, the authors measured palmitic acid in the mouse model and tried to validate their conclusions with that.

      In murine experiments, palmitic acid-containing diet were fed to mice to induce diabetic condition, but this does not mimic spontaneous T2DM in macaques, since the authors did not measure in macaque feces (or at least did not show the data from macaque feces of) palmitic acid or other fatty acids; instead, they assumed from blood metabolome data that palmitic acid would be absorbed from the intestine to affect the host metabolism, and added palmitic acid in the diet in mouse experiments. Here involves the probable leap of logic to support their conclusions and title of the study.

      In addition, the authors measured omics data after, but not before, the onset of spontaneous T2DM of macaques. This can reveal microbiota dysbiosis driven purely by disease progression, but does not support the causative effect of gut microbiota on T2DM development that the authors claims.

      We are sorry for misunderstanding your point and failing to address your question regarding macaque fecal metabolomics in our previous response. Our study performed untargeted metabolomics on macaque feces and indeed detected the differential metabolite palmitic acid (PA) content, which showed an obvious decrease in T2DM macaques compared with the control (Table 1). However, the difference in PA level between the two groups was not significant (p = 0.17). It may be attributed to the limitation of untargeted metabolomics methodology in absolute quantitative analysis. In addition, we found many other long-chain fatty acids were down-regulated in the T2DM macaque feces (Table 1). Such results are consistent with our observation in murine experiments. We examined PA levels in the feces, ileum, and serum in mice and found that PA level was significantly decreased in fecal samples but increased in the ileum and serum. These findings demonstrated that without the transplantation of gut microbiota, the ileum could not absorb the PA effectively even at a high concentration of ingested PA. Only mice receiving fecal microbiota transplants from T2DM macaques and fed a high-PA diet showed a significant increase in the ileum and serum alongside a decrease in fecal PA concentration. Both the macaque metabolomics and mice experiment results suggest that gut microbiota mediated the absorption of excess PA in the ileum leading to the accumulation of PA in the serum. In the revised manuscript, we added the results of all differential metabolites in Table S2.

      Author response table 1.

      Table 1. Differential analysis of palmitic acid and other fatty acids from fecal untargeted metabolomics in macaques.

      Regarding the causative effect of gut microbiota on T2DM development, we agree with the reviewer that the omics data were obtained after, but not before, the onset of spontaneous T2DM macaques, the microbiota dysbiosis is probably driven by disease progression. For this reason, we have changed the title of our manuscript and some of our conclusions, which can be found in our response below.

      Reviewer #1 (Recommendations for the authors):

      As described above, the data presented does not support the notion that gut microbiota change in T2DM macaques promote the disease - rather it showed the outcome of the disease progression. In addition, the involvement of palmitic acid absorption was only shown in mice but not in macaques. Therefore, the authors should change their title and conclusions to more precisely reflect their observation.

      According to your suggestion, we changed the title and the conclusion to make them more precise and avoid emphasizing the causative effect of gut microbiota on T2DM. The new title is “Multi-omics investigation of spontaneous T2DM macaque emphasizes gut microbiota could up-regulate the absorption of excess palmitic acid in the T2DM progression”. We also revised the wording of the results and conclusions to acknowledge the limitation of our study, “We also revealed the specific structure of gut microbiota that promoted T2DM development by regulating the absorption of excess PA in mice, providing experimental evidence for the functional role of gut microbiota in T2DM pathogenesis.” (Lines 122-125), “In particular, concentrations of PA, palmitoleic acid, and oleic acid were significantly higher in the T2DM group than control group (p<0.05 and VIP>1). The concentration of PA in the plasma of T2DM macaques increased, while the concentration of palmitic acid in the feces decreased (Figures 3F and G, Table S2).” (Lines 228-233), and “Our study confirms the functional role of gut microbiota and PA in the T2DM progression. The microbiota composition, specifically higher abundance of R. gnavus (current name: M. gnavus) and Coprococcus sp., and lower abundance of Treponema, F. succinogenes, Christensenellaceae, and F16, promoted the absorption of excess PA which is important for the development of T2DM. However, in this study, such microbial alterations were detected in macaques after they had developed the disease of T2DM instead of before or onset of T2DM, the causative effect of gut microbiota and their action mechanism on the development of T2DM is worth further investigation.” (Lines 450-458).

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #2 (Public Review):

      The authors responded that they would lose statistical power by studying RTE subfamilies with limited microarray probes, which is a fair point. However, the suggested analysis could have been conducted using the RNA-seq data they explored in the second round of revision. Choosing not to leverage RNA-seq to increase the granularity of their analysis is a matter of choice. In my opinion, however, the authors could have acknowledged in the discussion that some smaller yet potentially influential RTE species may be masked by their global approach."

      We will add one sentence addressing this in the Version of Record.


      The following is the authors’ response to the original reviews.

      We thank Reviewer #1 for their constructive comments.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Tsai and Seymen et al. investigate associations between RTE expression and methylation and age and inflammation, using multiple public datasets. Compared to the previous round of review, the text of the manuscript has been polished and the phrasing of several findings has been made clearer and more precise. The authors also provided ample discussion to the prior reviewer comments in their rebuttal, including new analyses. All these changes are in the correct direction, however, I believe that part of the content of the rebuttal should be incorporated in the main text, for reasons that I will outline below. 

      Both reviewers found the reliance on microarray expression data to detract from the study. The authors argued that their choices are supported by existing publications which performed a similar quantification of TE expression using microarray data. It could still be argued that (as far as I can tell) Reichmann et al. used a substantially larger number of probes than this study, as a consequence of starting from different arrays, however, this is a minor point which the authors do not need to address. It is still undeniable that including the validation with RNA-seq data performed in the rebuttal would strengthen the manuscript. I especially believe that many readers would want to see this analysis be prominent in the manuscript, considering that both reviewers independently converged on the issue with microarray expression data. Personally, I would have included an RNA-seq dataset next to the microarray data in the main figures, however, I understand that this would require considerable restructuring and that placing RNAseq data besides array data might be misleading. Instead, I would ask that the authors include their rebuttal figures R1 and R2 as supplementary figures. 

      I would suggest introducing a new paragraph, between the section dedicated to expression data and the one dedicated to DNA methylation, mentioning the issues with microarray data (Some of which were mentioned by the reviewers and other which were mentioned by the authors in the discussion and introduction) to then introduce the validation with RNA-seq data. 

      We appreciate the reviewer’s understanding and detailed feedback. As suggested, Author response images 1 and 2 were added as supplementary figures to the manuscript, and one paragraph was added to the section investigating the correlation between RTE expression and chronological age. We have also added new descriptions to the introduction, discussion, and BAR analysis sections.

      Author response image 3 is also a good addition and should be expanded to include the GTP and MESA study and possibly mentioned in the paragraph titled "RTE expression positively correlates with BAR gene signature scores except for SINEs." 

      We have updated Author response image 3 (now Author response image 1) to include GTP and MESA cohorts in the analysis. As shown in Author response image 1, except IFN-I and senescence scores on the MESA cohort that positively correlate with chronological ageing, the rest of the gene signatures display no positive correlation with chronological ageing.  

      Author response image 1 was originally created to separate the effect of chronological age and RTE expression on BAR gene signature scores. As it was meant to discriminate between BAR and chronological age, it doesn't provide additional information regarding the positive correlation between RTE expression and BAR gene signature that was not already present in the manuscript. Therefore, we did not add it to the manuscript.

      Author response image 1.

      Generalized linear models (GLM) analysis (BAR gene signature scores ~ RTE expression +chronological age). For each RTE family, we separately performed GLM. Age (RTE family) indicates the chronological age when used in the design formula for that specific RTE family.

      "In this study, we did not compare MESA with GTP etc. We have analysed each dataset separately based on the available data for that dataset. Therefore, sacrificing one analysis because of the lack of information from the other does not make sense. We would do that if we were after comparing different datasets. Moreover, the datasets are not comparable because they were collected from different types of blood samples." 

      Indeed, the datasets are not compared directly, but the associations between age, BER and TE expression for each dataset are plotted and discussed right next to each other. It is therefore natural to wonder if the differences between datasets are due to differences in the type of blood sample or if they are a consequence of the different probe sets. Using a common set of probes would help answer that question.  

      We understand that the reviewer is proposing a method to eliminate the possible causes of differences across datasets. However, incorporating such change would compromise the statistic power of MESA and GARP cohorts and also change our analysis structurally and digress from our main focus. Hence, we disagree to use the identical set of probes for all three cohorts.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      We thank you for the time you took to review our work and for your feedback! 

      The major changes to the manuscript are:

      (1) We have added visual flow speed and locomotion velocity traces to Figure 5 as suggested.

      (2) We have rephrased the abstract to more clearly indicate that our statement regarding acetylcholine enabling faster switching of internal representations in layer 5 is speculative.

      (3) We have further clarified the positioning of our findings regarding the basal forebrain cholinergic signal in visual cortex in the introduction.

      (4) We have added a video (Video S1) to illustrate different mouse running speeds covered by our data.

      A detailed point-by-point response to all reviewer concerns is provided below.

      Reviewer #1 (Recommendations For The Authors):

      The authors have addressed most of the concerns raised in the initial review. While the paper has been improved, there are still some points of concern in the revised version. 

      Major comments

      (1) Page 1, Line 21: The authors claim, "Our results suggest that acetylcholine augments the responsiveness of layer 5 neurons to inputs from outside of the local network, enabling faster switching between internal representations during locomotion." However, it is not clear which specific data or results support the claim of "switching between internal representations." ... 

      Authors' response: "... That acetylcholine enables a faster switching between internal representations in layer 5 is a speculation. We have attempted to make this clearer in the discussion. ..." 

      In the revised version, there is no new data added to directly support the claim - "Our results suggest acetylcholine ..., enabling faster switching between internal representations during locomotion" (in the abstract). The authors themselves acknowledge that this statement is speculative. The present data only demonstrate that ACh reduces the response latency of L5 neurons to visual stimuli, but not that ACh facilitates quicker transitions in neuronal responses from one visual stimulus to another. To maintain scientific rigor and clarity, I recommend the authors amend this sentence to more accurately reflect the findings. 

      This might be a semantic disagreement? We would argue both a gray screen and a grating are visual stimuli. Hence, we are not sure we understand what the reviewer means by “but not that ACh facilitates quicker transitions in neuronal responses from one visual stimulus to another”. We concur, our data only address one of many possible transitions, but it is a switch between distinct visual stimuli that is sped up by ACh. Nevertheless, we have rephrased the sentence in question by changing “our data suggest” to “based on this we speculate” - but are not sure whether this addresses the reviewer’s concern.  

      (2) Page 4, Line 103: "..., a direct measurement of the activity of cholinergic projection from basal forebrain to the visual cortex during locomotion has not been made." This statement is incorrect. An earlier study by Reimer et al. indeed imaged cholinergic axons in the visual cortex of mice running on a wheel. 

      Authors' response: "We have clarified this as suggested. However, we disagree slightly with the reviewer here. The key question is whether the cholinergic axons imaged originate in basal forebrain. While Reimer et al. 2016 did set out to do this, we believe a number of methodological considerations prevent this conclusion: ... Collins et al. 2023 inject more laterally and thus characterize cholinergic input to S1 and A1, ..."

      The authors pointed out some methodological caveats in previous studies that measured the BF input in V1, and I agree with them on several points. Nonetheless, the statement that "a direct measurement of the activity of cholinergic projection from basal forebrain to visual cortex during locomotion has not been made. ... Prior measurements of the activity of cholinergic axons in visual cortex have all relied on data from a cross of ChAT-Cre mice with a reporter line ..." (Page 4, Line 103) seems to be an oversimplification. In fact, contrary to what the authors noted, Collins et al. (2023) conducted direct imaging of BF cholinergic axons in V1 (Fig. 1) - "Selected axon segments were chosen from putative retrosplenial, somatosensory, primary and secondary motor, and visual cortices". They used a viral approach to express GCaMP in BF axons to bypass the limitations associated with the use of a GCaMP reporter mouse line - "Viral injections were used for BF- ACh studies to avoid imaging axons or dendrites from cholinergic projections not arising from the BF (e.g. cortical cholinergic interneurons)." The authors should reconsider the text. 

      The reason we think that our statement here was – while simplified – accurate, is that Collins et al. do record from cholinergic axons in V1, but they don’t show these data (they only show pooled data across all recordings sites). By superimposing the recording locations of the Collins paper on the Allen mouse brain atlas (Figure R1), we estimate that of the approximately 50 recording sites, most are in somatosensory and somatomotor areas of cortex, and only 1 appears to be in V1, something that is often missed as it is not really highlighted in that paper. If this is indeed correct, we would argue that the data in the Collins et al. paper are not representative of cholinergic activity in visual cortex (we fear only the authors would know for sure). Nevertheless, we have rephrased again. 

      Author response image 1.

      Overlay of the Collins et al. imaging sites (red dots, black outline and dashed circle) on the Allen mouse brain atlas (green shading). Very few (we estimate that it was only 1) of the recording sites appear to be in V1 (the lightest green area), and maybe an additional 4 appear to be in secondary visual areas.  

      Minor comments

      (1) It is unclear which BF subregion(s) were targeted in this study. 

      Authors' response: Thanks for pointing this out. We targeted the entire basal forebrain (medial septum, vertical and horizontal limbs of the diagonal band, and nucleus basalis) with our viral injections. ... We have now added the labels for basal forebrain subregions targeted next to the injection coordinates in the manuscript. 

      The authors provided the coordinates for their virus injections targeting the BF subregions - "(AP, ML, DV (in mm): ... ; +0.6, +0.6, -4.9 (nucleus basalis) ..." Is this the right coordinates for the nucleus basalis? 

      Thank you for catching this - this was indeed incorrect. The coordinates were correct, but our annotation of brain region was not (as the reviewer correctly points out, these coordinates are in the horizontal limb of the diagonal band, not the nucleus basalis). We have corrected this.

      Reviewer #2 (Recommendations For The Authors):

      Thank you for addressing most of the points raised in my original review. I still some concerns relating to the analysis of the data. 

      (1) I appreciate the authors point that getting mice to reliably during head-fixed recordings can require training. Since mice in this study were not trained to run, their low speed of locomotion limits the interpretation of the results. I think this is an important potential caveat and I have retained it in the public review. 

      This might be a misunderstanding. The Jordan paper was a bit of an outlier in that we needed mice to run at very high rates due to fact that our recording times was only minutes. Mice were chosen such that they would more or less continuously run, to maximize the likelihood that they would run during the intracellular recordings. This was what we tried to convey in our previous response. The speed range covered by the analysis in this paper is 0 cm/s to 36 cm/s. 36 cm/s is not far away from the top speed mice can reach on this treadmill (30 cm/s is 1 revolution of the treadmill per second). In our data, the top speed we measured across all mice was 36 cm/s. In the Jordan paper, the peak running speed across the entire dataset was 44 cm/s. Based on the reviewer’s comment, we suspect that the reviewer may be under the impression that 30 cm/s is a relatively slow running speed. To illustrate what this looks like we have made added a video (Video S1) to illustrate different running speeds. 

      (2) The majority of the analyses in the revised manuscript focus on grand average responses, which may mask heterogeneity in the underlying neural populations. This could be addressed by analysing the magnitude and latency of responses for individual neurons. For example, if I understand correctly, the analyses include all neurons, whether or not they are activated, inhibited, or unaffected by visual stimulation and locomotion. For example, while on average layer 2/3 neurons are suppressed by the grating stimulus (Figure 4A), presumable a subset are activated. Evaluating the effects of optogenetic stimulation and locomotion without analyzing them at the level of individual neurons could result in misleading conclusions. This could be presented in the form of a scatter plot, depicting the magnitude of neuronal responses in locomotion vs stationary condition, and opto+ vs no opto conditions. 

      We might be misunderstanding. The first part of the comment is a bit too unspecific to address directly. In cases in which we find the variability is relevant to our conclusions, we do show this for individual cells (e.g.the latencies to running onset are shown as histograms for all cells and axons in Figure S1). It is also unclear to us what the reviewer means by “Evaluating the effects of optogenetic stimulation and locomotion without analyzing them at the level of individual neurons could result in misleading conclusions”. Our conclusions relate to the average responses in L2/3, consistent with the analysis shown. All data will be freely available for anyone to perform follow-up analysis of things we may have missed. E.g., the specific suggestion of presenting the data shown in Figure 4 as a scatter plot is shown below (Figure R2). This is something we had looked at but found not to be relevant to our conclusions. The problem with this analysis is that it is difficult to estimate how much the different sources of variability contribute to the total variability observed in the data, and no interesting pattern is clearly apparent. All relevant and clear conclusions are already captured by the mean differences shown in Figure 4. 

      Author response image 2.

      Optogenetic activation of cholinergic axons in visual cortex primarily enhances responses of layer 5, but not layer 2/3 neurons. Related to Figure 4. (A) Average calcium response of layer 2/3 neurons in visual cortex to full field drifting grating in the absence or presence of locomotion. Each dot is the average calcium activity of an individual neuron during the two conditions. (B) As in A, but for layer 5 neurons. (C) As in A, but comparing the average response while the mice were stationary, to that while cholinergic axons were optogenetically stimulated. (D) As in C, but for layer 5 neurons. (E) Average calcium response of layer 2/3 neurons in visual cortex to visuomotor mismatch, without and with optogenetic stimulation of cholinergic axons in visual cortex. (F) As in E, but for layer 5 neurons. (G) Average calcium response of layer 2/3 neurons in visual cortex to locomotion onset in closed loop, without and with optogenetic stimulation of cholinergic axons in visual cortex. (H) As in G, but for layer 5 neurons.

      (3) To help the reader understand the experimental conditions in open loop experiments, please include average visual flow speed traces for each condition in Figure 5. 

      We have added the locomotion velocity and visual flow speeds to the corresponding conditions in Figure

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1:

      Summary:

      The work by Combrisson and colleagues investigates the degree to which reward and punishment learning signals overlap in the human brain using intracranial EEG recordings. The authors used information theory approaches to show that local field potential signals in the anterior insula and the three sub regions of the prefrontal cortex encode both reward and punishment prediction errors, albeit to different degrees. Specifically, the authors found that all four regions have electrodes that can selectively encode either the reward or the punishment prediction errors. Additionally, the authors analyzed the neural dynamics across pairs of brain regions and found that the anterior insula to dorsolateral prefrontal cortex neural interactions were specific for punishment prediction errors whereas the ventromedial prefrontal cortex to lateral orbitofrontal cortex interactions were specific to reward prediction errors. This work contributes to the ongoing efforts in both systems neuroscience and learning theory by demonstrating how two differing behavioral signals can be differentiated to a greater extent by analyzing neural interactions between regions as opposed to studying neural signals within one region.

      Strengths:

      The experimental paradigm incorporates both a reward and punishment component that enables investigating both types of learning in the same group of subjects allowing direct comparisons.

      The use of intracranial EEG signals provides much needed insight into the timing of when reward and punishment prediction errors signals emerge in the studied brain regions.

      Information theory methods provide important insight into the interregional dynamics associated with reward and punishment learning and allows the authors to assess that reward versus punishment learning can be better dissociated based on interregional dynamics over local activity alone.

      We thank the reviewer for this accurate summary. Please find below our answers to the weaknesses raised by the reviewer.

      Weaknesses:

      The analysis presented in the manuscript focuses solely on gamma band activity. The presence and potential relevance of other frequency bands is not discussed. It is possible that slow oscillations, which are thought to be important for coordinating neural activity across brain regions could provide additional insight.

      We thank the reviewer for pointing us to this missing discussion in the first version of the manuscript. We now made this point clearer in the Methods sections entitled “iEEG data analysis” and “Estimate of single-trial gamma-band activity”:

      “Here, we focused solely on broadband gamma for three main reasons. First, it has been shown that the gamma band activity correlates with both spiking activity and the BOLD fMRI signals (Lachaux et al., 2007; Mukamel et al., 2004; Niessing et al., 2005; Nir et al., 2007), and it is commonly used in MEG and iEEG studies to map task-related brain regions (Brovelli et al., 2005; Crone et al., 2006; Vidal et al., 2006; Ball et al., 2008; Jerbi et al., 2009; Darvas et al., 2010; Lachaux et al., 2012; Cheyne and Ferrari, 2013; Ko et al., 2013). Therefore, focusing on the gamma band facilitates linking our results with the fMRI and spiking literatures on probabilistic learning. Second, single-trial and time-resolved high-gamma activity can be exploited for the analysis of cortico-cortical interactions in humans using MEG and iEEG techniques (Brovelli et al., 2015; 2017; Combrisson et al., 2022). Finally, while previous analyses of the current dataset (Gueguen et al., 2021) reported an encoding of PE signals at different frequency bands, the power in lower frequency bands were shown to carry redundant information compared to the gamma band power.”

      The data is averaged across all electrodes which could introduce biases if some subjects had many more electrodes than others. Controlling for this variation in electrode number across subjects would ensure that the results are not driven by a small subset of subjects with more electrodes.

      We thank the reviewer for raising this important issue. We would like to point out that the gamma activity was not averaged across bipolar recordings within an area, nor measures of connectivity. Instead, we used a statistical approach proposed in a previous paper that combines non-parametric permutations with measures of information (Combrisson et al., 2022). As we explain in the “Statistical analysis” section, mutual information (MI) is estimated between PE signals and single-trial modulations in gamma activity separately for each contact (or for each pair of contacts). Then, a one-sample t-test is computed across all of the recordings of all subjects to form the effect size at the group-level. We will address the point of the electrode number in our answer below.

      The potential variation in reward versus punishment learning across subjects is not included in the manuscript. While the time course of reward versus punishment prediction errors is symmetrical at the group level, it is possible that some subjects show faster learning for one versus the other type which can bias the group average. Subject level behavioral data along with subject level electrode numbers would provide more convincing evidence that the observed effects are not arising from these potential confounds.

      We thank the reviewer for the two points raised. We performed additional analyses at the single-participant level to address the issues raised by the reviewer. We should note, however, that these results are descriptive and cannot be generalized to account for population-level effects. As suggested by the reviewer, we prepared two new figures. The first supplementary figure summarizes the number of participants that had iEEG contacts per brain region and pair of brain regions (Fig. S1A in the Appendix). It can be seen that the number of participants sampled in different brain regions is relatively constant (left panel) and the number of participants with pairs of contacts across brain regions is relatively homogeneous, ranging from 7 to 11 (right panel). Fig. S1B shows the number of bipolar derivations per subject and per brain region.

      Author response image 1.

      Single subject anatomical repartition. (A) Number of unique subject per brain region and per pair of brain regions (B) Number of bipolar derivations per subject and per brain region

      The second supplementary figure describes the estimated prediction error for rewarding and punishing trials for each subject (Fig. S2). The single-subject error bars represent the 95th percentile confidence interval estimated using a bootstrap approach across the different pairs of stimuli presented during the three to six sessions. As the reviewer anticipated, there are indeed variations across subjects, but we observe that RPE and PPE are relatively symmetrical, even at the subject level, and tend toward zero around trial number 10. These results therefore corroborate the patterns observed at the group-level.

      Author response image 2.

      Single-subject estimation of predictions errors. Single-subject trial-wise reward PE (RPE - blue) and punishment PE (PPE - red), ± 95% confidence interval.

      Finally, to assess the variability of local encoding of prediction errors across participants, we quantified the proportion of subjects having at least one significant bipolar derivation encoding either the RPE or PPE (Fig. S4). As expected, we found various proportions of unique subjects with significant R/PPE encoding per region. The lowest proportion was achieved in the ventromedial prefrontal cortex (vmPFC) and lateral orbitofrontal cortex (lOFC) for encoding PPE and RPE, respectively, with approximately 30% of the subjects having the effect. Conversely, we found highly reproducible encodings in the anterior insula (aINS) and dorsolateral prefrontal cortex (dlPFC) with a maximum of 100% of the 9 subjects having at least one bipolar derivation encoding PPE in the dlPFC.

      Author response image 3.

      Taken together, we acknowledge a certain variability per region and per condition. Nevertheless, the results presented in the supplementary figures suggest that the main results do not arise from a minority of subjects.

      We would like to point out that in order to assess across-subject variability, a much larger number of participants would have been needed, given the low signal-to-noise ratios observed at the single-participant level. We thus prefer to add these results as supplementary material in the Appendix, rather than in the main text.

      It is unclear if the findings in Figures 3 and 4 truly reflect the differential interregional dynamics in reward versus punishment learning or if these results arise as a statistical byproduct of the reward vs punishment bias observed within each region. For instance, the authors show that information transfer from anterior insula to dorsolateral prefrontal cortex is specific to punishment prediction error. However, both anterior insula and dorsolateral prefrontal cortex have higher prevalence of punishment prediction error selective electrodes to begin with. Therefore the findings in Fig 3 may simply be reflecting the prevalence of punishment specificity in these two regions above and beyond a punishment specific neural interaction between the two regions. Either mathematical or analytical evidence that assesses if the interaction effect is simply reflecting the local dynamics would be important to make this result convincing.

      This is an important point that we partly addressed in the manuscript. More precisely, we investigated whether the synergistic effects observed between the dlPFC and vmPFC encoding global PEs (Fig. 5) could be explained by their respective local specificity. Indeed, since we reported larger proportions of recordings encoding the PPE in the dlPFC and the RPE in the vmPFC (Fig. 2B), we checked whether the synergy between dlPFC and vmPFC could be mainly due to complementary roles where the dlPFC brings information about the PPE only and the vmPFC brings information to the RPE only. To address this point, we selected PPE-specific bipolar derivations from the dlPFC and RPE-specific from the vmPFC and, as the reviewer predicted, we found synergistic II between the two regions probably mainly because of their respective specificity. In addition, we included the II estimated between non-selective bipolar derivations (i.e. recordings with significant encoding for both RPE and PPE) and we observed synergistic interactions (Fig. 5C and Fig. S9). Taken together, the local specificity certainly plays a role, but this is not the only factor in defining the type of interactions.

      Concerning the interaction information results (II, Fig. 3), several lines of evidence suggest that local specificity cannot account alone for the II effects. For example, the local specificity for PPE is observed across all four areas (Fig. 2A) and the percentage of bipolar derivations displaying an effect is large (equal or above 10%) for three brain regions (aINS, dlPLF and lOFC). If the local specificity were the main driving cause, we would have observed significant redundancy between all pairs of brain regions. On the other hand, the interaction between the aINS and lOFC displayed no significant redundant effect (Fig. 3B). Another example is the result observed in lOFC: approximately 30% of bipolar derivations display a selectivity for PPE (Fig. 2B, third panel from the left), but do not show clear signs of redundant encoding at the level of within-area interactions (Fig. 3A, bottom-left panel). Similarly, the local encoding for RPE is observed across all four brain regions (Fig. 2A) and the percentage of bipolar derivations displaying an effect is large (equal or above 10%) for three brain regions (aINS, dlPLF and vmPFC). Nevertheless, significant between-regions interactions have been observed only between the lOFC and vmPFC (Fig. 3B bottom right panel).

      To further support the reasoning, we performed a simulation to show that it is possible to observe synergistic interactions between two regions with the same specificity. As an example, we may consider one region locally encoding early trials of RPE and a second region encoding the late trials of the RPE. Combining the two with the II would lead to synergistic interactions, because each one of them carries information that is not carried by the other. To illustrate this point, we simulated the data of two regions (x and y). To simulate redundant interactions (first row), each region receives a copy of the prediction (one-to-all) and for the synergy (second row), x and y receive early and late PE trials, respectively (all-to-one). This toy example illustrates that the local specificity is not the only factor determining the type of their interactions. We added the following result to the Appendix.

      Author response image 4.

      Local specificity does not fully determine the type of interactions. Within-area local encoding of PE using the mutual information (MI, in bits) for regions X and Y and between-area interaction information (II, in bits) leading to (A) redundant interactions and (B) synergistic interactions about the PE

      Regarding the information transfer results (Fig. 4), similar arguments hold and suggest that the prevalence is not the main factor explaining the arising transfer entropy between the anterior insula (aINS) and dorsolateral prefrontal cortex (dlPFC). Indeed, the lOFC has a strong local specificity for PPE, but the transfer entropy between the lOFC and aINS (or dlPFC) is shown in Fig. S7 does not show significant differences in encoding between PPE and RPE.

      Indeed, such transfer can only be found when there is a delay between the gamma activity of the two regions. In this example, the transfer entropy quantifies the amount of information shared between the past activity of the aINS and the present activity of the dlPFC conditioned on the past activity of the dlPFC. The conditioning ensures that the present activity of the dlPFC is not only explained by its own past. Consequently, if both regions exhibit various prevalences toward reward and punishment but without delay (i.e. at the same timing), the transfer entropy would be null because of the conditioning. As a fact, between 10 to -20% of bipolar recordings show a selectivity to the reward PE (represented by a proportion of 40-60% of subjects, Fig.S4). However, the transfer entropy estimated from the aINS to the dlPFC across rewarding trials is flat and clearly non-significant. If the transfer entropy was a byproduct of the local specificity then we should observe an increase, which is not the case here.

      Reviewer #2:

      Summary:

      Reward and punishment learning have long been seen as emerging from separate networks of frontal and subcortical areas, often studied separately. Nevertheless, both systems are complimentary and distributed representations of rewards and punishments have been repeatedly observed within multiple areas. This raised the unsolved question of the possible mechanisms by which both systems might interact, which this manuscript went after. The authors skillfully leveraged intracranial recordings in epileptic patients performing a probabilistic learning task combined with model-based information theoretical analyses of gamma activities to reveal that information about reward and punishment was not only distributed across multiple prefrontal and insular regions, but that each system showed specific redundant interactions. The reward subsystem was characterized by redundant interactions between orbitofrontal and ventromedial prefrontal cortex, while the punishment subsystem relied on insular and dorsolateral redundant interactions. Finally, the authors revealed a way by which the two systems might interact, through synergistic interaction between ventromedial and dorsolateral prefrontal cortex.

      Strengths:

      Here, the authors performed an excellent reanalysis of a unique dataset using innovative approaches, pushing our understanding on the interaction at play between prefrontal and insular cortex regions during learning. Importantly, the description of the methods and results is truly made accessible, making it an excellent resource to the community.

      This manuscript goes beyond what is classically performed using intracranial EEG dataset, by not only reporting where a given information, like reward and punishment prediction errors, is represented but also by characterizing the functional interactions that might underlie such representations. The authors highlight the distributed nature of frontal cortex representations and propose new ways by which the information specifically flows between nodes. This work is well placed to unify our understanding of the complementarity and specificity of the reward and punishment learning systems.

      We thank the reviewer for the positive feedback. Please find below our answers to the weaknesses raised by the reviewer.

      Weaknesses:

      The conclusions of this paper are mostly supported by the data, but whether the findings are entirely generalizable would require further information/analyses.

      First, the authors found that prediction errors very quickly converge toward 0 (less than 10 trials) while subjects performed the task for sets of 96 trials. Considering all trials, and therefore having a non-uniform distribution of prediction errors, could potentially bias the various estimates the authors are extracting. Separating trials between learning (at the start of a set) and exploiting periods could prove that the observed functional interactions are specific to the learning stages, which would strengthen the results.

      We thank the reviewer for this question. We would like to note that the probabilistic nature of the learning task does not allow a strict distinction between the exploration and exploitation phases. Indeed, the probability of obtaining the less rewarding outcome was 25% (i.e., for 0€ gain in the reward learning condition and -1€ loss in the punishment learning condition). Thus, participants tended to explore even during the last set of trials in each session. This is evident from the average learning curves shown in Fig. 1B of (Gueguen et al., 2021). Learning curves show rates of correct choice (75% chance of 1€ gain) in the reward condition (blue curves) and incorrect choice (75% chance of 1€ loss) in the punishment condition (red curves).

      For what concerns the evolution of PEs, as reviewer #1 suggested, we added a new figure representing the single-subject estimates of the R/PPE (Fig S2). Here, the confidence interval is obtained across all pairs of stimuli presented during the different sessions. We retrieved the general trend of the R/PPE converging toward zero around 10 trials. Both average reward and punishment prediction errors converge toward zero in approximately 10 trials, single-participant curves display large variability, also at the end of each session. As a reminder, the 96 trials represent the total number of trials for one session for the four pairs and the number of trials for each stimulus was only 24.

      Author response image 5.

      Single-subject estimation of predictions errors. Single-subject trial-wise reward PE (RPE - blue) and punishment PE (PPE - red), ± 95% confidence interval

      However, the convergence of the R/PPE is due to the average across the pairs of stimuli. In the figure below, we superimposed the estimated R/PPE, per pair of stimuli, for each subject. It becomes very clear that high values of PE can be reached, even for late trials. Therefore, we believe that the split into early/late trials because of the convergence of PE is far from being trivial.

      Author response image 6.

      Single-subject estimation of predictions errors per pair of stimuli. Single-subject trial-wise reward PE (RPE - blue) and punishment PE (PPE - red)

      Consequently, nonzero PRE and PPE occur during the whole session and separating trials between learning (at the start of a set) and exploiting periods, as suggested by the reviewer, does not allow a strict dissociation between learning vs no-learning. Nevertheless, we tested the analysis proposed by the reviewer, at the local level. We splitted the 24 trials of each pair of stimuli into early, middle and late trials (8 trials each). We then reproduced Fig. 2 by computing the mutual information between the gamma activity and the R/PPE for subsets of trials: early (first row) and late trials (second row). We retrieved significant encoding of both R/PPE in the aINS, dlPFC and lOFC in both early and late trials. The vmPFC also showed significant encoding of both during early trials. The only difference emerges in the late trials of the vmPFC where we found a strong encoding of the RPE only. It should also be noted that here since we are sub-selecting the trials, the statistical analyses are only performed using a third of the trials.

      Taken together, the combination of high values of PE achieved even for late trials and the fact that most of the findings are reproduced even with a third of the trials does not justify the split into early and late trials here. Crucially, this latest analysis confirms that the neural correlates of learning that we observed reflect PE signals rather than early versus late trials in the session.

      Author response image 7.

      MI between gamma activity and R/PPE using early and late trials. Time courses of MI estimated between the gamma power and both RPE (blue) and PPE (red) using either early or late trials (first and second row, respectively). Horizontal thick lines represent significant clusters of information (p<0.05, cluster-based correction, non-parametric randomization across epochs).

      Importantly, it is unclear whether the results described are a common feature observed across subjects or the results of a minority of them. The authors should report and assess the reliability of each result across subjects. For example, the authors found RPE-specific interactions between vmPFC and lOFC, even though less than 10% of sites represent RPE or both RPE/PPE in lOFC. It is questionable whether such a low proportion of sites might come from different subjects, and therefore whether the interactions observed are truly observed in multiple subjects. The nature of the dataset obviously precludes from requiring all subjects to show all effects (given the known limits inherent to intracerebral recording in patients), but it should be proven that the effects were reproducibly seen across multiple subjects.

      We thank the reviewer for this remark that has also been raised by the first reviewer. This issue was raised by the first reviewer. Indeed, we added a supplementary figure describing the number of unique subjects per brain region and per pair of brain regions (Fig. S1A) such as the number of bipolar derivations per region and per subject (Fig. S1B).

      Author response image 8.

      Single subject anatomical repartition. (A) Number of unique subject per brain region and per pair of brain regions (B) Number of bipolar derivations per subject and per brain region

      Regarding the reproducibility of the results across subjects for the local analysis (Fig. 2), we also added the instantaneous proportion of subjects having at least one bipolar derivation showing a significant encoding of the RPE and PPE (Fig. S4). We found a minimum proportion of approximately 30% of unique subjects having the effect in the lOFC and vmPFC, respectively with the RPE and PPE. On the other hand, both the aINS and dlPFC showed between 50 to 100% of the subjects having the effect. Therefore, local encoding of RPE and PPE was never represented by a single subject.

      Author response image 9.

      Similarly, we performed statistical analysis on interaction information at the single-subject level and counted the proportion of unique subjects having at least one pair of recordings with significant redundant and synergistic interactions about the RPE and PPE (Fig. S5). Consistently with the results shown in Fig. 3, the proportions of significant redundant and synergistic interactions are negative and positive, respectively. For the within-regions interactions, approximately 60% of the subjects with redundant interactions are about R/PPE in the aINS and about the PPE in the dlPFC and 40% about the RPE in the vmPFC. For the across-regions interactions, 60% of the subjects have redundant interactions between the aINS-dlPFC and dlPFC-lOFC about the PPE, and 30% have redundant interactions between lOFC-vmPFC about the RPE. Globally, we reproduced the main results shown in Fig. 3.

      Author response image 10.

      Inter-subjects reproducibility of redundant interactions about PE signals. Time-courses of proportion of subjects having at least one pair of bipolar derivation with a significant interaction information (p<0.05, cluster-based correction, non-parametric randomization across epochs) about the RPE (blue) or PPE (red). Data are aligned to the outcome presentation (vertical line at 0 seconds). Proportion of subjects with redundant (solid) and synergistic (dashed) interactions are respectively going downward and upward.

      Finally, the timings of the observed interactions between areas preclude one of the authors' main conclusions. Specifically, the authors repeatedly concluded that the encoding of RPE/PPE signals are "emerging" from redundancy-dominated prefrontal-insular interactions. However, the between-region information and transfer entropy between vmPFC and lOFC for example is observed almost 500ms after the encoding of RPE/PPE in these regions, questioning how it could possibly lead to the encoding of RPE/PPE. It is also noteworthy that the two information measures, interaction information and transfer entropy, between these areas happened at non overlapping time windows, questioning the underlying mechanism of the communication at play (see Figures 3/4). As an aside, when assessing the direction of information flow, the authors also found delays between pairs of signals peaking at 176ms, far beyond what would be expected for direct communication between nodes. Discussing this aspect might also be of importance as it raises the possibility of third-party involvement.

      The local encoding of RPE in the vmPFC and lOFC is observed in a time interval ranging from approximately 0.2-0.4s to 1.2-1.4s after outcome presentation (blue bars in Fig. 2A). The encoding of RPE by interaction information covers a time interval from approximately 1.1s to 1.5s (blue bars in Fig. 3B, bottom right panel). Similarly, significant TE modulations between the vmPFC and lOFC specific for PPE occur mainly in the 0.7s-1.1s range. Thus, it seems that the local encoding of PPE precedes the effects observed at the level of the neural interactions (II and TE). On the other hand, the modulations in MI, II and TE related to PPE co-occur in a time window from 0.2s to 0.7s after outcome presentation. Thus, we agree with the reviewer that a generic conclusion about the potential mechanisms relating the three levels of analysis cannot be drawn. We thus replaced the term “emerge from” by “occur with” from the manuscript which may be misinterpreted as hinting at a potential mechanism. We nevertheless concluded that the three levels of analysis (and phenomena) co-occur in time, thus hinting at a potential across-scales interaction that needs further study. Indeed, our study suggests that further work, beyond the scope of the current study, is required to better understand the interaction between scales.

      Regarding the delay for the conditioning of the transfer entropy, the value of 176 ms reflects the delay at which we observed a maximum of transfer entropy. However, we did not use a single delay for conditioning, we used every possible delay between [116, 236] ms, as explained in the Method section. We would like to stress that transfer entropy is a directed metric of functional connectivity, and it can only be interpreted as quantifying statistical causality defined in terms of predictacìbility according to the Wiener-Granger principle, as detailed in the methods. Thus, it cannot be interpreted in Pearl’s causal terms and as indexing any type of direct communication between nodes. This is a known limitation of the method, which has been stressed in past literature and that we believe does not need to be addressed here.

      To account for this, we revised the discussion to make sure this issue is addressed in the following paragraph:

      “Here, we quantified directional relationships between regions using the transfer entropy (Schreiber, 2000), which is a functional connectivity measure based on the Granger-Wiener causality principle. Tract tracing studies in the macaque have revealed strong interconnections between the lOFC and vmPFC in the macaque (Carmichael and Price, 1996; Öngür and Price, 2000). In humans, cortico-cortical anatomical connections have mainly been investigated using diffusion magnetic resonance imaging (dMRI). Several studies found strong probabilities of structural connectivity between the anterior insula with the orbitofrontal cortex and dorsolateral part of the prefrontal cortex (Cloutman et al., 2012; Ghaziri et al., 2017), and between the lOFC and vmPFC (Heather Hsu et al., 2020). In addition, the statistical dependency (e.g. coherence) between the LFP of distant areas could be potentially explained by direct anatomical connections (Schneider et al., 2021; Vinck et al., 2023). Taken together, the existence of an information transfer might rely on both direct or indirect structural connectivity. However, here we also reported differences of TE between rewarding and punishing trials given the same backbone anatomical connectivity (Fig. 4). [...] “

      Reviewer #3:

      Summary:

      The authors investigated that learning processes relied on distinct reward or punishment outcomes in probabilistic instrumental learning tasks were involved in functional interactions of two different cortico-cortical gamma-band modulations, suggesting that learning signals like reward or punishment prediction errors can be processed by two dominated interactions, such as areas lOFC-vmPFC and areas aINS-dlPFC, and later on integrated together in support of switching conditions between reward and punishment learning. By performing the well-known analyses of mutual information, interaction information, and transfer entropy, the conclusion was accomplished by identifying directional task information flow between redundancy-dominated and synergy-dominated interactions. Also, this integral concept provided a unifying view to explain how functional distributed reward and/or punishment information were segregated and integrated across cortical areas.

      Strengths:

      The dataset used in this manuscript may come from previously published works (Gueguen et al., 2021) or from the same grant project due to the methods. Previous works have shown strong evidence about why gamma-band activities and those 4 areas are important. For further analyses, the current manuscript moved the ideas forward to examine how reward/punishment information transfer between recorded areas corresponding to the task conditions. The standard measurements such mutual information, interaction information, and transfer entropy showed time-series activities in the millisecond level and allowed us to learn the directional information flow during a certain window. In addition, the diagram in Figure 6 summarized the results and proposed an integral concept with functional heterogeneities in cortical areas. These findings in this manuscript will support the ideas from human fMRI studies and add a new insight to electrophysiological studies with the non-human primates.

      We thank the reviewer for the summary such as for highlighting the strengths. Please find below our answers regarding the weaknesses of the manuscript.

      Weaknesses:

      After reading through the manuscript, the term "non-selective" in the abstract confused me and I did not actually know what it meant and how it fits the conclusion. If I learned the methods correctly, the 4 areas were studied in this manuscript because of their selective responses to the RPE and PPE signals (Figure 2). The redundancy- and synergy-dominated subsystems indicated that two areas shared similar and complementary information, respectively, due to the negative and positive value of interaction information (Page 6). For me, it doesn't mean they are "non-selective", especially in redundancy-dominated subsystem. I may miss something about how you calculate the mutual information or interaction information. Could you elaborate this and explain what the "non-selective" means?

      In the study performed by Gueguen et al. in 2021, the authors used a general linear model (GLM) to link the gamma activity to both the reward and punishment prediction errors and they looked for differences between the two conditions. Here, we reproduced this analysis except that we used measures from the information theory (mutual information) that were able to capture linear and non-linear relationships (although monotonic) between the gamma activity and the prediction errors. The clusters we reported reflect significant encoding of either the RPE and/or the PPE. From Fig. 2, it can be seen that the four regions have a gamma activity that is modulated according to both reward and punishment PE. We used the term “non-selective”, because the regions did not encode either one or the other, but various proportions of bipolar derivations encoding either one or both of them.

      The directional information flows identified in this manuscript were evidenced by the recording contacts of iEEG with levels of concurrent neural activities to the task conditions. However, are the conclusions well supported by the anatomical connections? Is it possible that the information was transferred to the target via another area? These questions may remain to be elucidated by using other approaches or animal models. It would be great to point this out here for further investigation.

      We thank the reviewer for this interesting question. We added the following paragraph to the discussion to clarify the current limitations of the transfer entropy and the link with anatomical connections :

      “Here, we quantified directional relationships between regions using the transfer entropy (Schreiber, 2000), which is a functional connectivity measure based on the Granger-Wiener causality principle. Tract tracing studies in the macaque have revealed strong interconnections between the lOFC and vmPFC in the macaque (Carmichael and Price, 1996; Öngür and Price, 2000). In humans, cortico-cortical anatomical connections have mainly been investigated using diffusion magnetic resonance imaging (dMRI). Several studies found strong probabilities of structural connectivity between the anterior insula with the orbitofrontal cortex and dorsolateral part of the prefrontal cortex (Cloutman et al., 2012; Ghaziri et al., 2017), and between the lOFC and vmPFC (Heather Hsu et al., 2020). In addition, the statistical dependency (e.g. coherence) between the LFP of distant areas could be potentially explained by direct anatomical connections (Schneider et al., 2021). Taken together, the existence of an information transfer might rely on both direct or indirect structural connectivity. However, here we also reported differences of TE between rewarding and punishing trials given the same backbone anatomical connectivity (Fig. 4). Our results are further supported by a recent study involving drug-resistant epileptic patients with resected insula who showed poorer performance than healthy controls in case of risky loss compared to risky gains (Von Siebenthal et al., 2017).”

      References

      Carmichael ST, Price J. 1996. Connectional networks within the orbital and medial prefrontal cortex of macaque monkeys. J Comp Neurol 371:179–207.

      Cloutman LL, Binney RJ, Drakesmith M, Parker GJM, Lambon Ralph MA. 2012. The variation of function across the human insula mirrors its patterns of structural connectivity: Evidence from in vivo probabilistic tractography. NeuroImage 59:3514–3521. oi:10.1016/j.neuroimage.2011.11.016

      Combrisson E, Allegra M, Basanisi R, Ince RAA, Giordano BL, Bastin J, Brovelli A. 2022. Group-level inference of information-based measures for the analyses of cognitive brain networks from neurophysiological data. NeuroImage 258:119347. doi:10.1016/j.neuroimage.2022.119347

      Ghaziri J, Tucholka A, Girard G, Houde J-C, Boucher O, Gilbert G, Descoteaux M, Lippé S, Rainville P, Nguyen DK. 2017. The Corticocortical Structural Connectivity of the Human Insula. Cereb Cortex 27:1216–1228. doi:10.1093/cercor/bhv308

      Gueguen MCM, Lopez-Persem A, Billeke P, Lachaux J-P, Rheims S, Kahane P, Minotti L, David O, Pessiglione M, Bastin J. 2021. Anatomical dissociation of intracerebral signals for reward and punishment prediction errors in humans. Nat Commun 12:3344. doi:10.1038/s41467-021-23704-w

      Heather Hsu C-C, Rolls ET, Huang C-C, Chong ST, Zac Lo C-Y, Feng J, Lin C-P. 2020. Connections of the Human Orbitofrontal Cortex and Inferior Frontal Gyrus. Cereb Cortex 30:5830–5843. doi:10.1093/cercor/bhaa160

      Lachaux J-P, Fonlupt P, Kahane P, Minotti L, Hoffmann D, Bertrand O, Baciu M. 2007. Relationship between task-related gamma oscillations and BOLD signal: new insights from combined fMRI and intracranial EEG. Hum Brain Mapp 28:1368–1375. doi:10.1002/hbm.20352

      Mukamel R, Gelbard H, Arieli A, Hasson U, Fried I, Malach R. 2004. Coupling Between Neuronal Firing, Field Potentials, and fMRI in Human Auditory Cortex. Cereb Cortex 14:881.

      Niessing J, Ebisch B, Schmidt KE, Niessing M, Singer W, Galuske RA. 2005. Hemodynamic signals correlate tightly with synchronized gamma oscillations. science 309:948–951.

      Nir Y, Fisch L, Mukamel R, Gelbard-Sagiv H, Arieli A, Fried I, Malach R. 2007. Coupling between neuronal firing rate, gamma LFP, and BOLD fMRI is related to interneuronal correlations. Curr Biol 17:1275–1285.

      Öngür D, Price JL. 2000. The organization of networks within the orbital and medial prefrontal cortex of rats, monkeys and humans. Cereb Cortex 10:206–219.

      Schneider M, Broggini AC, Dann B, Tzanou A, Uran C, Sheshadri S, Scherberger H, Vinck M. 2021. A mechanism for inter-areal coherence through communication based on connectivity and oscillatory power. Neuron 109:4050-4067.e12. doi:10.1016/j.neuron.2021.09.037

      Schreiber T. 2000. Measuring information transfer. Phys Rev Lett 85:461.

      Von Siebenthal Z, Boucher O, Rouleau I, Lassonde M, Lepore F, Nguyen DK. 2017. Decision-making impairments following insular and medial temporal lobe resection for drug-resistant epilepsy. Soc Cogn Affect Neurosci 12:128–137. doi:10.1093/scan/nsw152

      Recommendations for the authors

      Reviewer #1

      (1) Overall, the writing of the manuscript is dense and makes it hard to follow the scientific logic and appreciate the key findings of the manuscript. I believe the manuscript would be accessible to a broader audience if the authors improved the writing and provided greater detail for their scientific questions, choice of analysis, and an explanation of their results in simpler terms.

      We extensively modified the introduction to better describe the rationale and research question.

      (2) In the introduction the authors state "we hypothesized that reward and punishment learning arise from complementary neural interactions between frontal cortex regions". This stated hypothesis arrives rather abruptly after a summary of the literature given that the literature summary does not directly inform their stated hypothesis. Put differently, the authors should explicitly state what the contradictions and/or gaps in the literature are, and what specific combinations of findings guide them to their hypothesis. When the authors state their hypothesis the reader is still left asking: why are the authors focusing on the frontal regions? What do the authors mean by complementary interactions? What specific evidence or contradiction in the literature led them to hypothesize that complementary interactions between frontal regions underlie reward and punishment learning?

      We extensively modified the introduction and provided a clearer description of the brain circuits involved and the rationale for searching redundant and synergistic interactions between areas.

      (3) Related to the above point: when the authors subsequently state "we tested whether redundancy- or synergy dominated interactions allow the emergence of collective brain networks differentially supporting reward and punishment learning", the Introduction (up to the point of this sentence) has not been written to explain the synergy vs. redundancy framework in the literature and how this framework comes into play to inform the authors' hypothesis on reward and punishment learning.

      We extensively modified the introduction and provided a clearer description of redundant and synergistic interactions between areas.

      (4) The explanation of redundancy vs synergy dominated brain networks itself is written densely and hard to follow. Furthermore, how this framework informs the question on the neural substrates of reward versus punishment learning is unclear. The authors should provide more precise statements on how and why redundancy vs. synergy comes into play in reward and punishment learning. Put differently, this redundancy vs. synergy framework is key for understanding the manuscript and the introduction is not written clearly enough to explain the framework and how it informs the authors' hypothesis and research questions on the neural substrates of reward vs. punishment learning.

      Same as above

      (5) While the choice of these four brain regions in context of reward and punishment learning does makes sense, the authors do not outline a clear scientific justification as to why these regions were selected in relation to their question.

      Same as above

      (6) Could the authors explain why they used gamma band power (as opposed to or in addition to the lower frequency bands) to investigate MI. Relatedly, when the authors introduce MI analysis, it would be helpful to briefly explain what this analysis measures and why it is relevant to address the question they are asking.

      Please see our answer to the first public comment. We added a paragraph to the discussion section to justify our choice of focusing on the gamma band only. We added the following sentence to the result section to justify our choice for using mutual-information:

      The MI allowed us to detect both linear and non-linear relationships between the gamma activity and the PE

      An extended explanation justifying our choice for the MI was already present in the method section.

      (7) The authors state that "all regions displayed a local "probabilistic" encoding of prediction errors with temporal dynamics peaking around 500 ms after outcome presentation". It would be helpful for the reader if the authors spelled out what they mean by probabilistic in this context as the term can be interpreted in many different ways.

      We agree with the reviewer that the term “probabilistic” can be interpreted in different ways. In the revised manuscript we changed “probabilistic” for “mixed”.

      (8) The authors should include a brief description of how they compute RPE and PPE in the beginning of the relevant results section.

      The explanation of how we estimated the PE is already present in the result section: “We estimated trial-wise prediction errors by fitting a Q-learning model to behavioral data. Fitting the model consisted in adjusting the constant parameters to maximize the likelihood of observed choices etc.”

      (9) It is unclear from the Methods whether the authors have taken any measures to address the likely difference in the number of electrodes across subjects. For example, it is likely that some subjects have 10 electrodes in vmPFC while others may have 20. In group analyses, if the data is simply averaged across all electrodes then each subject contributes a different number of data points to the analysis. Hence, a subject with more electrodes can bias the group average. A starting point would be to state the variation in number of electrodes across subjects per brain region. If this variation is rather small, then simple averaging across electrodes might be justified. If the variation is large then one idea would be to average data across electrodes within subjects prior to taking the group average or use a resampling approach where the minimum number of electrodes per brain area is subsampled.

      We addressed this point in our public answers. As a reminder, the new version of the manuscript contains a figure showing the number of unique patients per region, the PE at per participant level together with local-encoding at the single participant level.

      (10) One thing to consider is whether the reward and punishment in the task is symmetrical in valence. While 1$ increase and 1$ decrease is equivalent in magnitude, the psychological effect of the positive (vs. the negative) outcome may still be asymmetrical and the direction and magnitude of this asymmetry can vary across individuals. For instance, some subjects may be more sensitive to the reward (over punishment) while others are more sensitive to the punishment (over reward). In this scenario, it is possible that the differentiation observed in PPE versus RPE signals may arise from such psychological asymmetry rather than the intrinsic differences in how certain brain regions (and their interactions) may encode for reward vs punishment. Perhaps the authors can comment on this possibility, and/or conduct more in depth behavioral analysis to determine if certain subjects adjust their choice behavior faster in response to reward vs. punishment contexts.

      While it could be possible that individuals display different sensitivities vis-à-vis positive and negative prediction errors (and, indeed, a vast body of human reinforcement learning literature seems to point in this direction; Palminteri & Lebreton, 2022), it is unclear to us how such differences would explain into the recruitment of anatomically distinct areas reward and punishment prediction errors. It is important to note here that our design partially orthogonalized positive and reward vs. negative and punishment PEs, because the neutral outcome can generate both positive and negative prediction errors, as a function of the learning context (reward-seeking and punishment avoidance). Back to the main question, for instance, Lefebvre et al (2017) investigated with fMRI the neural correlates of reward prediction errors only and found that inter-individual differences in learning rates for positive and negative prediction errors correlated with differences in the degree of striatal activation and not with the recruitment of different areas. To sum up, while we acknowledge that individuals may display different sensitivity to prediction errors (and reward magnitudes), we believe that such differences should translated in difference in the degree of activation of a given system (the reward systems vs the punishment one) rather than difference in neural system recruitment

      (11) As summarized in Fig 6, the authors show that information transfer between aINS to dlPFC was PPE specific whereas the information transfer between vmPFC to lOFC was RPE specific. What is unclear is if these findings arise as an inevitable statistical byproduct of the fact that aINS has high PPE-specificity and that vmPFC has high RPE-specificity. In other words, it is possible that the analysis in Fig 3,4 are sensitive to fact that there is a larger proportion of electrodes with either PPE or RPE sensitivity in aINS and vmPFC respectively - and as such, the II analysis might reflect the dominant local encoding properties above and beyond reflecting the interactions between regions per se. Simply put, could the analysis in Fig 3B turn out in any other way given that there are more PPE specific electrodes in aINS and more RPE specific electrodes in vmPFC? Some options to address this question would be to limit the electrodes included in the analyses (in Fig 3B for example) so that each region has the same number of PPE and RPE specific electrodes included.

      Please see the simulation we added to the revised manuscript (Fig. S10) demonstrating that synergistic interactions can emerge between regions with the same specificity.

      Regarding the possibility that Fig. 3 and 4 are sensitive to the number of bipolar derivations being R/PPE specific, a counter-example is the vmPFC. The vmPFC has a few recordings specific to punishment (Fig. 2) in almost 30% of the subjects (Fig. S4). However, there is no II about the PPE between recordings of the vmPFC (Fig. 3). The same reasoning also holds for the lOFC. Therefore, the proportion of recordings being RPE or PPE-specific is not sufficient to determine the type of interactions.

      (12)  Related to the point above, what would the results presented in Fig 3A (and 3B) look like if the authors ran the analyses on RPE specific and PPE specific electrodes only. Is the vmPFC-vmPFC RPE effect in Fig 3A arising simply due to the high prevalence of RPE specific electrodes in vmPFC (as shown in Fig. 2)?

      Please see our answer above.

      Reviewer #2:

      Regarding Figure 2A, the authors argued that their findings "globally reproduced their previously published findings" (from Gueguen et al, 2021). It is worth noting though that in their original analysis, both aINS and lOFC show differential effects (aINS showing greater punishment compared to reward, and the opposite for lOFC) compared to the current analysis. Although I would be akin to believe that the nonlinear approach used here might explain part of the differences (as the authors discussed), I am very wary of the other argument advanced: "the removal of iEEG sites contaminated with pathological activity". This raised some red flags. Does that mean some of the conclusions observed in Gueguen et al (2021) are only the result of noise contamination, and therefore should be disregarded? The author might want to add a short supplementary figure using the same approach as in Gueguen (2021) but using the subset of contacts used here to comfort potential readers of the validity of their previous manuscript.

      We appreciate the reviewer's concerns and understand the request for additional information. However, we would like to point out that the figure suggested by the reviewer is already present in the supplementary files of Gueguen et al. 2021 (see Fig. S2). The results of this study should not be disregarded, as the supplementary figure reproduces the results of the main text after excluding sites with pathological activity. Including or excluding sites contaminated with epileptic activity does not have a significant impact on the results, as analyses are performed at each time-stamp and across trials, and epileptic spikes are never aligned in time across trials.

      That being said, there are some methodological differences between the two studies. To extract gamma power, Gueguen et al. filtered and averaged 10 Hz sub-bands, while we used multi-tapers. Additionally, they used a temporal smoothing of 250 ms, while we used less smoothing. However, as explained in the main text, we used information-theoretical approaches to capture the statistical dependencies between gamma power and PE. Despite divergent methodologies, we obtained almost identical results.

      The data and code supporting this manuscript should be made available. If raw data cannot be shared for ethical reasons, single-trial gamma activities should at least be provided. Regarding the code used to process the data, sharing it could increase the appeal (and use) of the methods applied.

      We thank the reviewer for this suggestion. We added a section entitled “Code and data availability” and gave links to the scripts, notebooks and preprocessed data.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      I appreciate the efforts the authors made to clarify and justify their statements and methodology, respectively. I additionally appreciate the efforts they made to provide me with detailed information - including figures - to aid my comprehension. However, there are two things I nevertheless recommend the authors to include in the main manuscript.

      (1) Statement about animal wellbeing: The authors state that they were constrained in their imaging session duration not because of a commonly reported technical limitation, such as photobleaching (which I honestly assumed), but rather the general wellbeing of the animals, who exhibited signs of distress after longer imaging periods. I find this to be a critical issue and perhaps the best argument against performing longer imaging experiments (which would have increased the number of trials, thus potentially boosting the performance of their model). To say that they put animal welfare above all other scientific and technical considerations speaks to a strong ethical adherence to animal welfare policy, and I believe this should be somehow incorporated into the methods.

      We have now included this at the top of page 26:

      “Mice fully recovered from the brief isoflurane anesthesia, showing a clear blinking reflex, whisking and sniffing behaviors and normal body posture and movements, immediately after head fixation. In our experimental conditions, mice were imaged in sessions of up to 25 min since beyond this time we started observing some signs of distress or discomfort. Thus, we avoided longer recording times at the expense of collecting larger trial numbers, in strong adherence of animal welfare and ethics policy. A pilot group of mice were habituated to the head fixed condition in daily 20 min sessions for 3 days, however we did not observe a marked contrast in the behavior of habituated versus unhabituated mice beyond our relatively short 25 min imaging sessions. In consequence imaging sessions never surpassed a maximum of 25 min, after which the mouse was returned to its home cage.”

      (2) Author response image 2: I sincerely thank the authors for providing us reviewers with this figure, which compares the performance of the naïve Bayesian classifier their ultimately use in the study with other commonly implemented models. Also here I falsely assumed that other models, which take correlated activity into account, did not generally perform better than their ultimate model of choice. Although dwelling on it would be distractive (and outside the primary scope of the study), I would encourage the authors to include it as a figure supplement (and simply mention these controls en passant when they justify their choice of the naïve Bayesian classifier).

      This figure was now included in the revised manuscript as supplemental figure 3.

      Page 10 now reads:

      “We performed cross-validated, multi-class classification of the single-trial population responses (decoding, Fig. 2A) using a naive Bayes classifier to evaluate the prediction errors as the absolute difference between the stimulus azimuth and the predicted azimuth (Fig. 2A). We chose this classification algorithm over others due to its generally good performance with limited available data. We visualized the cross-validated prediction error distribution in cumulative plots where the observed prediction errors were compared to the distribution of errors for random azimuth sampling (Fig. 2B). When decoding all simultaneously recorded units, the observed classifier output was not significantly better (shifted towards smaller prediction errors) than the chance level distribution (Fig. 2B). The classifier also failed to decode complete DCIC population responses recorded with neuropixels probes (Fig. 3A). Other classifiers performed similarly (Suppl. Fig. 3A).”

      The bottom paragraph in page 19 now reads:

      “To characterize how the observed positive noise correlations could affect the representation of stimulus azimuth by DCIC top ranked unit population responses, we compared the decoding performance obtained by classifying the single-trial response patterns from top ranked units in the modeled decorrelated datasets versus the acquired data (with noise correlations). With the intention to characterize this with a conservative approach that would be less likely to find a contribution of noise correlations as it assumes response independence, we relied on the naive Bayes classifier for decoding throughout the study. Using this classifier, we observed that the modeled decorrelated datasets produced stimulus azimuth prediction error distributions that were significantly shifted towards higher decoding errors (Fig. 6B, C) and, in our imaging datasets, were not significantly different from chance level (Fig. 6B). Altogether, these results suggest that the detected noise correlations in our simultaneously acquired datasets can help reduce the error of the IC population code for sound azimuth. We observed a similar, but not significant tendency with another classifier that does not assume response independence (KNN classifier), though overall producing larger decoding errors than the Bayes classifier (Suppl. Fig. 3B).”

      Reviewer #3 (Recommendations for the authors):

      I am generally happy with the response to the reviews.

      I find the Author response image 3 quite interesting. The neuropixel data looks somewhat like I expected (especially for mouse #3 and maybe mouse #4). I find the distribution of weights across units in the imaging dataset compared to in the pixel dataset intriguing (though it probably is just the dimensionality of the data being so much higher).

      I'm not too familiar with facial movements but is it the case that the DCIC would be more modulated by ipsilateral movement compared to contralateral movements? Are face movements in mice conjugate or do both sides of the face move more or less independently? If not it may be interesting in future work to record bilaterally and see if that provides more information about DCIC responses.

      We sincerely thank the editors and reviewers for their careful appraisal, commendation of our effort and helpful constructive feedback which greatly improved the presentation of our study. Below in green font is a point by point reply to the comments provided by the reviewers.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary: In this study, the authors address whether the dorsal nucleus of the inferior colliculus (DCIC) in mice encodes sound source location within the front horizontal plane (i.e., azimuth). They do this using volumetric two-photon Ca2+ imaging and high-density silicon probes (Neuropixels) to collect single-unit data. Such recordings are beneficial because they allow large populations of simultaneous neural data to be collected. Their main results and the claims about those results are the following:

      (1) DCIC single-unit responses have high trial-to-trial variability (i.e., neural noise);

      (2) approximately 32% to 40% of DCIC single units have responses that are sensitive to sound source azimuth;

      (3) single-trial population responses (i.e., the joint response across all sampled single units in an animal) encode sound source azimuth "effectively" (as stated in title) in that localization decoding error matches average mouse discrimination thresholds;

      (4) DCIC can encode sound source azimuth in a similar format to that in the central nucleus of the inferior colliculus (as stated in Abstract);

      (5) evidence of noise correlation between pairs of neurons exists;

      and (6) noise correlations between responses of neurons help reduce population decoding error.

      While simultaneous recordings are not necessary to demonstrate results #1, #2, and #4, they are necessary to demonstrate results #3, #5, and #6.

      Strengths:

      - Important research question to all researchers interested in sensory coding in the nervous system.

      - State-of-the-art data collection: volumetric two-photon Ca2+ imaging and extracellular recording using high-density probes. Large neuronal data sets.

      - Confirmation of imaging results (lower temporal resolution) with more traditional microelectrode results (higher temporal resolution).

      - Clear and appropriate explanation of surgical and electrophysiological methods. I cannot comment on the appropriateness of the imaging methods.

      Strength of evidence for claims of the study:

      (1) DCIC single-unit responses have high trial-to-trial variability - The authors' data clearly shows this.

      (2) Approximately 32% to 40% of DCIC single units have responses that are sensitive to sound source azimuth - The sensitivity of each neuron's response to sound source azimuth was tested with a Kruskal-Wallis test, which is appropriate since response distributions were not normal. Using this statistical test, only 8% of neurons (median for imaging data) were found to be sensitive to azimuth, and the authors noted this was not significantly different than the false positive rate. The Kruskal-Wallis test was not performed on electrophysiological data. The authors suggested that low numbers of azimuth-sensitive units resulting from the statistical analysis may be due to the combination of high neural noise and relatively low number of trials, which would reduce statistical power of the test. This may be true, but if single-unit responses were moderately or strongly sensitive to azimuth, one would expect them to pass the test even with relatively low statistical power. At best, if their statistical test missed some azimuthsensitive units, they were likely only weakly sensitive to azimuth. The authors went on to perform a second test of azimuth sensitivity-a chi-squared test-and found 32% (imaging) and 40% (e-phys) of single units to have statistically significant sensitivity. This feels a bit like fishing for a lower p-value. The Kruskal-Wallis test should have been left as the only analysis. Moreover, the use of a chi-squared test is questionable because it is meant to be used between two categorical variables, and neural response had to be binned before applying the test.

      The determination of what is a physiologically relevant “moderate or strong azimuth sensitivity” is not trivial, particularly when comparing tuning across different relays of the auditory pathway like the CNIC, auditory cortex, or in our case DCIC, where physiologically relevant azimuth sensitivities might be different. This is likely the reason why azimuth sensitivity has been defined in diverse ways across the bibliography (see Groh, Kelly & Underhill, 2003 for an early discussion of this issue). These diverse approaches include reaching a certain percentage of maximal response modulation, like used by Day et al. (2012, 2015, 2016) in CNIC, and ANOVA tests, like used by Panniello et al. (2018) and Groh, Kelly & Underhill (2003) in auditory cortex and IC respectively. Moreover, the influence of response variability and biases in response distribution estimation due to limited sampling has not been usually accounted for in the determination of azimuth sensitivity.

      As Reviewer #1 points out, in our study we used an appropriate ANOVA test (KruskalWallis) as a starting point to study response sensitivity to stimulus azimuth at DCIC. Please note that the alpha = 0.05 used for this test is not based on experimental evidence about physiologically relevant azimuth sensitivity but instead is an arbitrary p-value threshold. Using this test on the electrophysiological data, we found that ~ 21% of the simultaneously recorded single units reached significance (n = 4 mice). Nevertheless these percentages, in our small sample size (n = 4) were not significantly different from our false positive detection rate (p = 0.0625, Mann-Whitney, See Author response image 1).  In consequence, for both our imaging (Fig. 3C) and electrophysiological data, we could not ascertain if the percentage of neurons reaching significance in these ANOVA tests were indeed meaningfully sensitive to azimuth or this was due to chance.

      Author response image 1.

      Percentage of the neuropixels recorded DCIC single units across mice that showed significant median response tuning, compared to false positive detection rate (α = 0.05, chance level).

      We reasoned that the observed markedly variable responses from DCIC units, which frequently failed to respond in many trials (Fig. 3D, 4A), in combination with the limited number of trial repetitions we could collect, results in under-sampled response distribution estimations. This under-sampling can bias the determination of stochastic dominance across azimuth response samples in Kruskal-Wallis tests. We would like to highlight that we decided not to implement resampling strategies to artificially increase the azimuth response sample sizes with “virtual trials”, in order to avoid “fishing for a smaller p-value”, when our collected samples might not accurately reflect the actual response population variability.

      As an alternative to hypothesis testing based on ranking and determining stochastic dominance of one or more azimuth response samples (Kruskal-Wallis test), we evaluated the overall statistical dependency to stimulus azimuth of the collected responses.  To do this we implement the Chi-square test by binning neuronal responses into categories. Binning responses into categories can reduce the influence of response variability to some extent, which constitutes an advantage of the Chi-square approach, but we note the important consideration that these response categories are arbitrary.

      Altogether, we acknowledge that our Chi-square approach to define azimuth sensitivity is not free of limitations and despite enabling the interrogation of azimuth sensitivity at DCIC, its interpretability might not extend to other brain regions like CNIC or auditory cortex. Nevertheless we hope the aforementioned arguments justify why the Kruskal-Wallis test simply could not “have been left as the only analysis”.

      (3) Single-trial population responses encode sound source azimuth "effectively" in that localization decoding error matches average mouse discrimination thresholds - If only one neuron in a population had responses that were sensitive to azimuth, we would expect that decoding azimuth from observation of that one neuron's response would perform better than chance. By observing the responses of more than one neuron (if more than one were sensitive to azimuth), we would expect performance to increase. The authors found that decoding from the whole population response was no better than chance. They argue (reasonably) that this is because of overfitting of the decoder modeltoo few trials used to fit too many parameters-and provide evidence from decoding combined with principal components analysis which suggests that overfitting is occurring. What is troubling is the performance of the decoder when using only a handful of "topranked" neurons (in terms of azimuth sensitivity) (Fig. 4F and G). Decoder performance seems to increase when going from one to two neurons, then decreases when going from two to three neurons, and doesn't get much better for more neurons than for one neuron alone. It seems likely there is more information about azimuth in the population response, but decoder performance is not able to capture it because spike count distributions in the decoder model are not being accurately estimated due to too few stimulus trials (14, on average). In other words, it seems likely that decoder performance is underestimating the ability of the DCIC population to encode sound source azimuth.

      To get a sense of how effective a neural population is at coding a particular stimulus parameter, it is useful to compare population decoder performance to psychophysical performance. Unfortunately, mouse behavioral localization data do not exist. Therefore, the authors compare decoder error to mouse left-right discrimination thresholds published previously by a different lab. However, this comparison is inappropriate because the decoder and the mice were performing different perceptual tasks. The decoder is classifying sound sources to 1 of 13 locations from left to right, whereas the mice were discriminating between left or right sources centered around zero degrees. The errors in these two tasks represent different things. The two data sets may potentially be more accurately compared by extracting information from the confusion matrices of population decoder performance. For example, when the stimulus was at -30 deg, how often did the decoder classify the stimulus to a lefthand azimuth? Likewise, when the stimulus was +30 deg, how often did the decoder classify the stimulus to a righthand azimuth?

      The azimuth discrimination error reported by Lauer et al. (2011) comes from engaged and highly trained mice, which is a very different context to our experimental setting with untrained mice passively listening to stimuli from 13 random azimuths. Therefore we did not perform analyses or interpretations of our results based on the behavioral task from Lauer et al. (2011) and only made the qualitative observation that the errors match for discussion.

      We believe it is further important to clarify that Lauer et al. (2011) tested the ability of mice to discriminate between a positively conditioned stimulus (reference speaker at 0º center azimuth associated to a liquid reward) and a negatively conditioned stimulus (coming from one of five comparison speakers positioned at 20º, 30º, 50º, 70 and 90º azimuth, associated to an electrified lickport) in a conditioned avoidance task. In this task, mice are not precisely “discriminating between left or right sources centered around zero degrees”, making further analyses to compare the experimental design of Lauer et al (2011) and ours even more challenging for valid interpretation.

      (4) DCIC can encode sound source azimuth in a similar format to that in the central nucleus of the inferior colliculus - It is unclear what exactly the authors mean by this statement in the Abstract. There are major differences in the encoding of azimuth between the two neighboring brain areas: a large majority of neurons in the CNIC are sensitive to azimuth (and strongly so), whereas the present study shows a minority of azimuth-sensitive neurons in the DCIC. Furthermore, CNIC neurons fire reliably to sound stimuli (low neural noise), whereas the present study shows that DCIC neurons fire more erratically (high neural noise).

      Since sound source azimuth is reported to be encoded by population activity patterns at CNIC (Day and Delgutte, 2013), we refer to a population activity pattern code as the “similar format” in which this information is encoded at DCIC. Please note that this is a qualitative comparison and we do not claim this is the “same format”, due to the differences the reviewer precisely describes in the encoding of azimuth at CNIC where a much larger majority of neurons show stronger azimuth sensitivity and response reliability with respect to our observations at DCIC. By this qualitative similarity of encoding format we specifically mean the similar occurrence of activity patterns from azimuth sensitive subpopulations of neurons in both CNIC and DCIC, which carry sufficient information about the stimulus azimuth for a sufficiently accurate prediction with regard to the behavioral discrimination ability.

      (5) Evidence of noise correlation between pairs of neurons exists - The authors' data and analyses seem appropriate and sufficient to justify this claim.

      (6) Noise correlations between responses of neurons help reduce population decoding error - The authors show convincing analysis that performance of their decoder increased when simultaneously measured responses were tested (which include noise correlation) than when scrambled-trial responses were tested (eliminating noise correlation). This makes it seem likely that noise correlation in the responses improved decoder performance. The authors mention that the naïve Bayesian classifier was used as their decoder for computational efficiency, presumably because it assumes no noise correlation and, therefore, assumes responses of individual neurons are independent of each other across trials to the same stimulus. The use of decoder that assumes independence seems key here in testing the hypothesis that noise correlation contains information about sound source azimuth. The logic of using this decoder could be more clearly spelled out to the reader. For example, if the null hypothesis is that noise correlations do not carry azimuth information, then a decoder that assumes independence should perform the same whether population responses are simultaneous or scrambled. The authors' analysis showing a difference in performance between these two cases provides evidence against this null hypothesis.

      We sincerely thank the reviewer for this careful and detailed consideration of our analysis approach. Following the reviewer’s constructive suggestion, we justified the decoder choice in the results section at the last paragraph of page 18:

      “To characterize how the observed positive noise correlations could affect the representation of stimulus azimuth by DCIC top ranked unit population responses, we compared the decoding performance obtained by classifying the single-trial response patterns from top ranked units in the modeled decorrelated datasets versus the acquired data (with noise correlations). With the intention to characterize this with a conservative approach that would be less likely to find a contribution of noise correlations as it assumes response independence, we relied on the naive Bayes classifier for decoding throughout the study.

      Using this classifier, we observed that the modeled decorrelated datasets produced stimulus azimuth prediction error distributions that were significantly shifted towards higher decoding errors (Fig. 5B, C) and, in our imaging datasets, were not significantly different from chance level (Fig. 5B). Altogether, these results suggest that the detected noise correlations in our simultaneously acquired datasets can help reduce the error of the IC population code for sound azimuth.”

      Minor weakness:

      - Most studies of neural encoding of sound source azimuth are done in a noise-free environment, but the experimental setup in the present study had substantial background noise. This complicates comparison of the azimuth tuning results in this study to those of other studies. One is left wondering if azimuth sensitivity would have been greater in the absence of background noise, particularly for the imaging data where the signal was only about 12 dB above the noise. The description of the noise level and signal + noise level in the Methods should be made clearer. Mice hear from about 2.5 - 80 kHz, so it is important to know the noise level within this band as well as specifically within the band overlapping with the signal.

      We agree with the reviewer that this information is useful. In our study, the background R.M.S. SPL during imaging across the mouse hearing range (2.5-80kHz) was 44.53 dB and for neuropixels recordings 34.68 dB. We have added this information to the methods section of the revised manuscript.

      Reviewer #2 (Public Review):

      In the present study, Boffi et al. investigate the manner in which the dorsal cortex of the of the inferior colliculus (DCIC), an auditory midbrain area, encodes sound location azimuth in awake, passively listening mice. By employing volumetric calcium imaging (scanned temporal focusing or s-TeFo), complemented with high-density electrode electrophysiological recordings (neuropixels probes), they show that sound-evoked responses are exquisitely noisy, with only a small portion of neurons (units) exhibiting spatial sensitivity. Nevertheless, a naïve Bayesian classifier was able to predict the presented azimuth based on the responses from small populations of these spatially sensitive units. A portion of the spatial information was provided by correlated trial-to-trial response variability between individual units (noise correlations). The study presents a novel characterization of spatial auditory coding in a non-canonical structure, representing a noteworthy contribution specifically to the auditory field and generally to systems neuroscience, due to its implementation of state-of-the-art techniques in an experimentally challenging brain region. However, nuances in the calcium imaging dataset and the naïve Bayesian classifier warrant caution when interpreting some of the results.

      Strengths:

      The primary strength of the study lies in its methodological achievements, which allowed the authors to collect a comprehensive and novel dataset. While the DCIC is a dorsal structure, it extends up to a millimetre in depth, making it optically challenging to access in its entirety. It is also more highly myelinated and vascularised compared to e.g., the cerebral cortex, compounding the problem. The authors successfully overcame these challenges and present an impressive volumetric calcium imaging dataset. Furthermore, they corroborated this dataset with electrophysiological recordings, which produced overlapping results. This methodological combination ameliorates the natural concerns that arise from inferring neuronal activity from calcium signals alone, which are in essence an indirect measurement thereof.

      Another strength of the study is its interdisciplinary relevance. For the auditory field, it represents a significant contribution to the question of how auditory space is represented in the mammalian brain. "Space" per se is not mapped onto the basilar membrane of the cochlea and must be computed entirely within the brain. For azimuth, this requires the comparison between miniscule differences between the timing and intensity of sounds arriving at each ear. It is now generally thought that azimuth is initially encoded in two, opposing hemispheric channels, but the extent to which this initial arrangement is maintained throughout the auditory system remains an open question. The authors observe only a slight contralateral bias in their data, suggesting that sound source azimuth in the DCIC is encoded in a more nuanced manner compared to earlier processing stages of the auditory hindbrain. This is interesting, because it is also known to be an auditory structure to receive more descending inputs from the cortex.

      Systems neuroscience continues to strive for the perfection of imaging novel, less accessible brain regions. Volumetric calcium imaging is a promising emerging technique, allowing the simultaneous measurement of large populations of neurons in three dimensions. But this necessitates corroboration with other methods, such as electrophysiological recordings, which the authors achieve. The dataset moreover highlights the distinctive characteristics of neuronal auditory representations in the brain. Its signals can be exceptionally sparse and noisy, which provide an additional layer of complexity in the processing and analysis of such datasets. This will be undoubtedly useful for future studies of other less accessible structures with sparse responsiveness.

      Weaknesses:                                                                                               

      Although the primary finding that small populations of neurons carry enough spatial information for a naïve Bayesian classifier to reasonably decode the presented stimulus is not called into question, certain idiosyncrasies, in particular the calcium imaging dataset and model, complicate specific interpretations of the model output, and the readership is urged to interpret these aspects of the study's conclusions with caution.

      I remain in favour of volumetric calcium imaging as a suitable technique for the study, but the presently constrained spatial resolution is insufficient to unequivocally identify regions of interest as cell bodies (and are instead referred to as "units" akin to those of electrophysiological recordings). It remains possible that the imaging set is inadvertently influenced by non-somatic structures (including neuropil), which could report neuronal activity differently than cell bodies. Due to the lack of a comprehensive ground-truth comparison in this regard (which to my knowledge is impossible to achieve with current technology), it is difficult to imagine how many informative such units might have been missed because their signals were influenced by spurious, non-somatic signals, which could have subsequently misled the models. The authors reference the original Nature Methods article (Prevedel et al., 2016) throughout the manuscript, presumably in order to avoid having to repeat previously published experimental metrics. But the DCIC is neither the cortex nor hippocampus (for which the method was originally developed) and may not have the same light scattering properties (not to mention neuronal noise levels). Although the corroborative electrophysiology data largely eleviates these concerns for this particular study, the readership should be cognisant of such caveats, in particular those who are interested in implementing the technique for their own research.

      A related technical limitation of the calcium imaging dataset is the relatively low number of trials (14) given the inherently high level of noise (both neuronal and imaging). Volumetric calcium imaging, while offering a uniquely expansive field of view, requires relatively high average excitation laser power (in this case nearly 200 mW), a level of exposure the authors may have wanted to minimise by maintaining a low the number of repetitions, but I yield to them to explain.

      We assumed that the levels of heating by excitation light measured at the neocortex in Prevedel et al. (2016), were representative for DCIC also. Nevertheless, we recognize this approximation might not be very accurate, due to the differences in tissue architecture and vascularization from these two brain areas, just to name a few factors. The limiting factor preventing us from collecting more trials in our imaging sessions was that we observed signs of discomfort or slight distress in some mice after ~30 min of imaging in our custom setup, which we established as a humane end point to prevent distress. In consequence imaging sessions were kept to 25 min in duration, limiting the number of trials collected. However we cannot rule out that with more extensive habituation prior to experiments the imaging sessions could be prolonged without these signs of discomfort or if indeed influence from our custom setup like potential heating of the brain by illumination light might be the causing factor of the observed distress. Nevertheless, we note that previous work has shown that ~200mW average power is a safe regime for imaging in the cortex by keeping brain heating minimal (Prevedel et al., 2016), without producing the lasting damages observed by immunohistochemisty against apoptosis markers above 250mW (Podgorski and Ranganathan 2016, https://doi.org/10.1152/jn.00275.2016).

      Calcium imaging is also inherently slow, requiring relatively long inter-stimulus intervals (in this case 5 s). This unfortunately renders any model designed to predict a stimulus (in this case sound azimuth) from particularly noisy population neuronal data like these as highly prone to overfitting, to which the authors correctly admit after a model trained on the entire raw dataset failed to perform significantly above chance level. This prompted them to feed the model only with data from neurons with the highest spatial sensitivity. This ultimately produced reasonable performance (and was implemented throughout the rest of the study), but it remains possible that if the model was fed with more repetitions of imaging data, its performance would have been more stable across the number of units used to train it. (All models trained with imaging data eventually failed to converge.) However, I also see these limitations as an opportunity to improve the technology further, which I reiterate will be generally important for volume imaging of other sparse or noisy calcium signals in the brain.

      Transitioning to the naïve Bayesian classifier itself, I first openly ask the authors to justify their choice of this specific model. There are countless types of classifiers for these data, each with their own pros and cons. Did they actually try other models (such as support vector machines), which ultimately failed? If so, these negative results (even if mentioned en passant) would be extremely valuable to the community, in my view. I ask this specifically because different methods assume correspondingly different statistical properties of the input data, and to my knowledge naïve Bayesian classifiers assume that predictors (neuronal responses) are assumed to be independent within a class (azimuth). As the authors show that noise correlations are informative in predicting azimuth, I wonder why they chose a model that doesn't take advantage of these statistical regularities. It could be because of technical considerations (they mention computing efficiency), but I am left generally uncertain about the specific logic that was used to guide the authors through their analytical journey.

      One of the main reasons we chose the naïve Bayesian classifier is indeed because it assumes that the responses of the simultaneously recorded neurons are independent and therefore it does not assume a contribution of noise correlations to the estimation of the posterior probability of each azimuth. This model would represent the null hypothesis that noise correlations do not contribute to the encoding of stimulus azimuth, which would be verified by an equal decoding outcome from correlated or decorrelated datasets. Since we observed that this is not the case, the model supports the alternative hypothesis that noise correlations do indeed influence stimulus azimuth encoding. We wanted to test these hypotheses with the most conservative approach possible that would be least likely to find a contribution of noise correlations. Other relevant reasons that justify our choice of the naive Bayesian classifier are its robustness against the limited numbers of trials we could collect in comparison to other more “data hungry” classifiers like SVM, KNN, or artificial neuronal nets. We did perform preliminary tests with alternative classifiers but the obtained decoding errors were similar when decoding the whole population activity (Supplemental figure 3A). Dimensionality reduction following the approach described in the manuscript showed a tendency towards smaller decoding errors observed with an alternative classifier like KNN, but these errors were still larger than the ones observed with the naive Bayesian classifier (median error 45º). Nevertheless, we also observe a similar tendency for slightly larger decoding errors in the absence of noise correlations (decorrelated, Supplemental figure 3B). Sentences detailing the logic of classifier choice are now included in the results section at page 10 and at the last paragraph of page 18 (see responses to Reviewer 1).

      That aside, there remain other peculiarities in model performance that warrant further investigation. For example, what spurious features (or lack of informative features) in these additional units prevented the models of imaging data from converging?

      Considering the amount of variability observed throughout the neuronal responses both in imaging and neuropixels datasets, it is easy to suspect that the information about stimulus azimuth carried in different amounts by individual DCIC neurons can be mixed up with information about other factors (Stringer et al., 2019). In an attempt to study the origin of these features that could confound stimulus azimuth decoding we explored their relation to face movement (Supplemental Figure 2), finding a correlation to snout movements, in line with previous work by Stringer et al. (2019).

      In an orthogonal question, did the most spatially sensitive units share any detectable tuning features? A different model trained with electrophysiology data in contrast did not collapse in the range of top-ranked units plotted. Did this model collapse at some point after adding enough units, and how well did that correlate with the model for the imaging data?

      Our electrophysiology datasets were much smaller in size (number of simultaneously recorded neurons) compared to our volumetric calcium imaging datasets, resulting in a much smaller total number of top ranked units detected per dataset. This precluded the determination of a collapse of decoder performance due to overfitting beyond the range plotted in Fig 4G.

      How well did the form (and diversity) of the spatial tuning functions as recorded with electrophysiology resemble their calcium imaging counterparts? These fundamental questions could be addressed with more basic, but transparent analyses of the data (e.g., the diversity of spatial tuning functions of their recorded units across the population). Even if the model extracts features that are not obvious to the human eye in traditional visualisations, I would still find this interesting.

      The diversity of the azimuth tuning curves recorded with calcium imaging (Fig. 3B) was qualitatively larger than the ones recorded with electrophysiology (Fig. 4B), potentially due to the larger sampling obtained with volumetric imaging. We did not perform a detailed comparison of the form and a more quantitative comparison of the diversity of these functions because the signals compared are quite different, as calcium indicator signal is subject to non linearities due to Ca2+ binding cooperativity and low pass filtering due to binding kinetics. We feared this could lead to misleading interpretations about the similarities or differences between the azimuth tuning functions in imaged and electrophysiology datasets. Our model uses statistical response dependency to stimulus azimuth, which does not rely on features from a descriptive statistic like mean response tuning. In this context, visualizing the trial-to-trial responses as a function of azimuth shows “features that are not obvious to the human eye in traditional visualizations” (Fig. 3D, left inset).

      Finally, the readership is encouraged to interpret certain statements by the authors in the current version conservatively. How the brain ultimately extracts spatial neuronal data for perception is anyone's guess, but it is important to remember that this study only shows that a naïve Bayesian classifier could decode this information, and it remains entirely unclear whether the brain does this as well. For example, the model is able to achieve a prediction error that corresponds to the psychophysical threshold in mice performing a discrimination task (~30 {degree sign}). Although this is an interesting coincidental observation, it does not mean that the two metrics are necessarily related. The authors correctly do not explicitly claim this, but the manner in which the prose flows may lead a non-expert into drawing that conclusion.

      To avoid misleading the non-expert readers, we have clarified in the manuscript that the observed correspondence between decoding error and psychophysical threshold is explicitly coincidental.

      Page 13, end of middle paragraph:

      “If we consider the median of the prediction error distribution as an overall measure of decoding performance, the single-trial response patterns from subsamples of at least the 7 top ranked units produced median decoding errors that coincidentally matched the reported azimuth discrimination ability of mice (Fig 4G, minimum audible angle = 31º) (Lauer et al., 2011).”

      Page 14, bottom paragraph:

      “Decoding analysis (Fig. 4F) of the population response patterns from azimuth dependent top ranked units simultaneously recorded with neuropixels probes showed that the 4 top ranked units are the smallest subsample necessary to produce a significant decoding performance that coincidentally matches the discrimination ability of mice (31° (Lauer et al., 2011)) (Fig. 5F, G).”

      We also added to the Discussion sentences clarifying that a relationship between these two variables remains to be determined and it also remains to be determined if the DCIC indeed performs a bayesian decoding computation for sound localization.

      Page 20, bottom:

      “… Concretely, we show that sound location coding does indeed occur at DCIC on the single trial basis, and that this follows a comparable mechanism to the characterized population code at CNIC (Day and Delgutte, 2013). However, it remains to be determined if indeed the DCIC network is physiologically capable of Bayesian decoding computations. Interestingly, the small number of DCIC top ranked units necessary to effectively decode stimulus azimuth suggests that sound azimuth information is redundantly distributed across DCIC top ranked units, which points out that mechanisms beyond coding efficiency could be relevant for this population code.

      While the decoding error observed from our DCIC datasets obtained in passively listening, untrained mice coincidentally matches the discrimination ability of highly trained, motivated mice (Lauer et al., 2011), a relationship between decoding error and psychophysical performance remains to be determined. Interestingly, a primary sensory representations should theoretically be even more precise than the behavioral performance as reported in the visual system (Stringer et al., 2021).”

      Moreover, the concept of redundancy (of spatial information carried by units throughout the DCIC) is difficult for me to disentangle. One interpretation of this formulation could be that there are non-overlapping populations of neurons distributed across the DCIC that each could predict azimuth independently of each other, which is unlikely what the authors meant. If the authors meant generally that multiple neurons in the DCIC carry sufficient spatial information, then a single neuron would have been able to predict sound source azimuth, which was not the case. I have the feeling that they actually mean "complimentary", but I leave it to the authors to clarify my confusion, should they wish.

      We observed that the response patterns from relatively small fractions of the azimuth sensitive DCIC units (4-7 top ranked units) are sufficient to generate an effective code for sound azimuth, while 32-40% of all simultaneously recorded DCIC units are azimuth sensitive. In light of this observation, we interpreted that the azimuth information carried by the population should be redundantly distributed across the complete subpopulation of azimuth sensitive DCIC units.

      In summary, the present study represents a significant body of work that contributes substantially to the field of spatial auditory coding and systems neuroscience. However, limitations of the imaging dataset and model as applied in the study muddles concrete conclusions about how the DCIC precisely encodes sound source azimuth and even more so to sound localisation in a behaving animal. Nevertheless, it presents a novel and unique dataset, which, regardless of secondary interpretation, corroborates the general notion that auditory space is encoded in an extraordinarily complex manner in the mammalian brain.

      Reviewer #3 (Public Review):

      Summary: Boffi and colleagues sought to quantify the single-trial, azimuthal information in the dorsal cortex of the inferior colliculus (DCIC), a relatively understudied subnucleus of the auditory midbrain. They used two complementary recording methods while mice passively listened to sounds at different locations: a large volume but slow sampling calcium-imaging method, and a smaller volume but temporally precise electrophysiology method. They found that neurons in the DCIC were variable in their activity, unreliably responding to sound presentation and responding during inter-sound intervals. Boffi and colleagues used a naïve Bayesian decoder to determine if the DCIC population encoded sound location on a single trial. The decoder failed to classify sound location better than chance when using the raw single-trial population response but performed significantly better than chance when using intermediate principal components of the population response. In line with this, when the most azimuth dependent neurons were used to decode azimuthal position, the decoder performed equivalently to the azimuthal localization abilities of mice. The top azimuthal units were not clustered in the DCIC, possessed a contralateral bias in response, and were correlated in their variability (e.g., positive noise correlations). Interestingly, when these noise correlations were perturbed by inter-trial shuffling decoding performance decreased. Although Boffi and colleagues display that azimuthal information can be extracted from DCIC responses, it remains unclear to what degree this information is used and what role noise correlations play in azimuthal encoding.

      Strengths: The authors should be commended for collection of this dataset. When done in isolation (which is typical), calcium imaging and linear array recordings have intrinsic weaknesses. However, those weaknesses are alleviated when done in conjunction with one another - especially when the data largely recapitulates the findings of the other recording methodology. In addition to the video of the head during the calcium imaging, this data set is extremely rich and will be of use to those interested in the information available in the DCIC, an understudied but likely important subnucleus in the auditory midbrain.

      The DCIC neural responses are complex; the units unreliably respond to sound onset, and at the very least respond to some unknown input or internal state (e.g., large inter-sound interval responses). The authors do a decent job in wrangling these complex responses: using interpretable decoders to extract information available from population responses.

      Weaknesses:

      The authors observe that neurons with the most azimuthal sensitivity within the DCIC are positively correlated, but they use a Naïve Bayesian decoder which assume independence between units. Although this is a bit strange given their observation that some of the recorded units are correlated, it is unlikely to be a critical flaw. At one point the authors reduce the dimensionality of their data through PCA and use the loadings onto these components in their decoder. PCA incorporates the correlational structure when finding the principal components and constrains these components to be orthogonal and uncorrelated. This should alleviate some of the concern regarding the use of the naïve Bayesian decoder because the projections onto the different components are independent. Nevertheless, the decoding results are a bit strange, likely because there is not much linearly decodable azimuth information in the DCIC responses. Raw population responses failed to provide sufficient information concerning azimuth for the decoder to perform better than chance. Additionally, it only performed better than chance when certain principal components or top ranked units contributed to the decoder but not as more components or units were added. So, although there does appear to be some azimuthal information in the recoded DCIC populations - it is somewhat difficult to extract and likely not an 'effective' encoding of sound localization as their title suggests.

      As described in the responses to reviewers 1 and 2, we chose the naïve Bayes classifier as a decoder to determine the influence of noise correlations through the most conservative approach possible, as this classifier would be least likely to find a contribution of correlated noise. Also, we chose this decoder due to its robustness against limited numbers of trials collected, in comparison to “data hungry” non linear classifiers like KNN or artificial neuronal nets. Lastly, we observed that small populations of noisy, unreliable (do not respond in every trial) DCIC neurons can encode stimulus azimuth in passively listening mice matching the discrimination error of trained mice. Therefore, while this encoding is definitely not efficient, it can still be considered effective.

      Although this is quite a worthwhile dataset, the authors present relatively little about the characteristics of the units they've recorded. This may be due to the high variance in responses seen in their population. Nevertheless, the authors note that units do not respond on every trial but do not report what percent of trials that fail to evoke a response. Is it that neurons are noisy because they do not respond on every trial or is it also that when they do respond they have variable response distributions? It would be nice to gain some insight into the heterogeneity of the responses.

      The limited number of azimuth trial repetitions that we could collect precluded us from making any quantification of the unreliability (failures to respond) and variability in the response distributions from the units we recorded, as we feared they could be misleading. In qualitative terms, “due to the high variance in responses seen” in the recordings and the limited trial sampling, it is hard to make any generalization. In consequence we referred to the observed response variance altogether as neuronal noise. Considering these points, our datasets are publicly available for exploration of the response characteristics.

      Additionally, is there any clustering at all in response profiles or is each neuron they recorded in the DCIC unique?

      We attempted to qualitatively visualize response clustering using dimensionality reduction, observing different degrees of clustering or lack thereof across the azimuth classes in the datasets collected from different mice. It is likely that the limited number of azimuth trials we could collect and the high response variance contribute to an inconsistent response clustering across datasets.

      They also only report the noise correlations for their top ranked units, but it is possible that the noise correlations in the rest of the population are different.

      For this study, since our aim was to interrogate the influence of noise correlations on stimulus azimuth encoding by DCIC populations, we focused on the noise correlations from the top ranked unit subpopulation, which likely carry the bulk of the sound location information.  Noise correlations can be defined as correlation in the trial to trial response variation of neurons. In this respect, it is hard to ascertain if the rest of the population, that is not in the top rank unit percentage, are really responding and showing response variation to evaluate this correlation, or are simply not responding at all and show unrelated activity altogether. This makes observations about noise correlations from “the rest of the population” potentially hard to interpret.

      It would also be worth digging into the noise correlations more - are units positively correlated because they respond together (e.g., if unit x responds on trial 1 so does unit y) or are they also modulated around their mean rates on similar trials (e.g., unit x and y respond and both are responding more than their mean response rate). A large portion of trial with no response can occlude noise correlations. More transparency around the response properties of these populations would be welcome.

      Due to the limited number of azimuth trial repetitions collected, to evaluate noise correlations we used the non parametric Kendall tau correlation coefficient which is a measure of pairwise rank correlation or ordinal association in the responses to each azimuth. Positive rank correlation would represent neurons more likely responding together. Evaluating response modulation “around their mean rates on similar trials” would require assumptions about the response distributions, which we avoided due to the potential biases associated with limited sample sizes.

      It is largely unclear what the DCIC is encoding. Although the authors are interested in azimuth, sound location seems to be only a small part of DCIC responses. The authors report responses during inter-sound interval and unreliable sound-evoked responses. Although they have video of the head during recording, we only see a correlation to snout and ear movements (which are peculiar since in the example shown it seems the head movements predict the sound presentation). Additional correlates could be eye movements or pupil size. Eye movement are of particular interest due to their known interaction with IC responses - especially if the DCIC encodes sound location in relation to eye position instead of head position (though much of eye-position-IC work was done in primates and not rodent). Alternatively, much of the population may only encode sound location if an animal is engaged in a localization task. Ideally, the authors could perform more substantive analyses to determine if this population is truly noisy or if the DCIC is integrating un-analyzed signals.

      We unsuccessfully attempted eye tracking and pupillometry in our videos. We suspect that the reason behind this is a generally overly dilated pupil due to the low visible light illumination conditions we used which were necessary to protect the PMT of our custom scope.

      It is likely that DCIC population activity is integrating un-analyzed signals, like the signal associated with spontaneous behaviors including face movements (Stringer et al., 2019), which we observed at the level of spontaneous snout movements. However investigating if and how these signals are integrated to stimulus azimuth coding requires extensive behavioral testing and experimentation which is out of the scope of this study. For the purpose of our study, we referred to trial-to-trial response variation as neuronal noise. We note that this definition of neuronal noise can, and likely does, include an influence from un-analyzed signals like the ones from spontaneous behaviors.

      Although this critique is ubiquitous among decoding papers in the absence of behavioral or causal perturbations, it is unclear what - if any - role the decoded information may play in neuronal computations. The interpretation of the decoder means that there is some extractable information concerning sound azimuth - but not if it is functional. This information may just be epiphenomenal, leaking in from inputs, and not used in computation or relayed to downstream structures. This should be kept in mind when the authors suggest their findings implicate the DCIC functionally in sound localization.

      Our study builds upon previous reports by other independent groups relying on “causal and behavioral perturbations” and implicating DCIC in sound location learning induced experience dependent plasticity (Bajo et al., 2019, 2010; Bajo and King, 2012), which altogether argues in favor of DCIC functionality in sound localization.

      Nevertheless, we clarified in the discussion of the revised manuscript that a relationship between the observed decoding error and the psychophysical performance, or the ability of the DCIC network to perform Bayesian decoding computations, both remain to be determined (please see responses to Reviewer #2).

      It is unclear why positive noise correlations amongst similarly tuned neurons would improve decoding. A toy model exploring how positive noise correlations in conjunction with unreliable units that inconsistently respond may anchor these findings in an interpretable way. It seems plausible that inconsistent responses would benefit from strong noise correlations, simply by units responding together. This would predict that shuffling would impair performance because you would then be sampling from trials in which some units respond, and trials in which some units do not respond - and may predict a bimodal performance distribution in which some trials decode well (when the units respond) and poor performance (when the units do not respond).

      In samples with more that 2 dimensions, the relationship between signal and noise correlations is more complex than in two dimensional samples (Montijn et al., 2016) which makes constructing interpretable and simple toy models of this challenging. Montijn et al. (2016) provide a detailed characterization and model describing how the accuracy of a multidimensional population code can improve when including “positive noise correlations amongst similarly tuned neurons”. Unfortunately we could not successfully test their model based on Mahalanobis distances as we could not verify that the recorded DCIC population responses followed a multivariate gaussian distribution, due to the limited azimuth trial repetitions we could sample.

      Significance: Boffi and colleagues set out to parse the azimuthal information available in the DCIC on a single trial. They largely accomplish this goal and are able to extract this information when allowing the units that contain more information about sound location to contribute to their decoding (e.g., through PCA or decoding on top unit activity specifically). The dataset will be of value to those interested in the DCIC and also to anyone interested in the role of noise correlations in population coding. Although this work is first step into parsing the information available in the DCIC, it remains difficult to interpret if/how this azimuthal information is used in localization behaviors of engaged mice.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Summary:

      Ma & Yang et al. report a new investigation aimed at elucidating one of the key nutrients S. Typhimurium (STM) utilizes with the nutrient-poor intracellular niche within macrophage, focusing on the amino acid beta-alanine. From these data, the authors report that beta-alanine plays important roles in mediating STM infection and virulence. The authors employ a multidisciplinary approach that includes some mouse studies, and ultimately propose a mechanism by which panD, involved in B-Ala synthesis, mediates regulation of zinc homeostatisis in Salmonella.

      Strengths and weaknesses:

      The results and model are adequately supported by the authors' data. Further work will need to be performed to learn whether the Zn2+ functions as proposed in their mechanism. By performing a small set of confirmatory experiments in S. Typhi, the authors provide some evidence of relevance to human infections.

      Impact:

      This work adds to the body of literature on the metabolic flexibility of Salmonella during infection that enable pathogenesis.

      Reviewer #1 (Recommendations for the authors):

      No further suggestions. The authors have adequately addressed my prior concerns through new data and revisions to the text.

      Thank you for considering this work. We appreciate your efforts in aiding us to improve our manuscript.

      Reviewer #3 (Public review):

      Summary:

      Salmonella is interesting due to its life within a compact compartment, which we call SCV or Salmonella containing vacuole in the field of Salmonella. SCV is a tight-fitting vacuole where the acquisition of nutrients is a key factor by Salmonella. The authors among many nutrients, focussed on beta-alanine. It is also known that Salmonella requires beta-alanine from many other studies. The authors have done in vitro RAW macrophage infection assays and In vivo mouse infection assays to see the life of Salmonella in the presence of beta-alanine. They concluded by comprehending that beta-alanine modulates the expression of many genes including zinc transporters which is required for pathogenesis.

      Strengths:

      Made a couple of knockouts in Salmonella and did transcriptomic to understand the global gene expression pattern

      Weaknesses:

      (1) Transport of Beta-alanine to SCV is not yet elucidated. Is it possible to determine whether the Zn transporter is involved in B-alanine transport?

      Thank you for the comment. Following your suggestion, we investigated the growth of Salmonella WT and the ∆znuA mutant cultured in N-minimal and M9 minimal medium, with β-alanine as the sole carbon source. We observed no significant difference in growth kinetics between the ∆znuA mutant and WT strain under either culture condition (please refer to Author response image 1). The results indicate that ZnuA is not involved in β-alanine transport in Salmonella.

      Author response image 1.

      (2) Beta-alanine can also be shuttled to form carnosine along with histidine. If beta-alanine is channelled to make more carnosine, then the virulence phenotypes may be very different.

      Our study reveals that β-alanine availability, whether obtained from the host or synthesized de novo via the panD-dependent pathway, is important for Salmonella pathogenesis. We have shown that β-alanine influences Salmonella intracellular replication and in vivo virulence partly by enhancing the expression of the zinc transporter genes.

      Although β-alanine can also be shuttled to form carnosine along with histidine in animals, the Salmonella genome lacks canonical carnosine synthase (CARNS) orthologs that catalyze the condensation of β-alanine and histidine into carnosine. Therefore, we believe that the carnosine biosynthetic pathway does not influence the virulence phenotypes of Salmonella.

      (3) Some amino acid transporters can be knocked out to see if beta-alanine uptake is perturbed. Like ArgT transport Arginine, and its mutation perturbs the uptake of beta-alanine. What is the beta-alanine concentration in the SCV? SCVS can be purified at different time points, and the Beta-alanine concentration can be measured

      Thank you for the comment. As suggested, we have investigated the role of other amino acid transporters in the uptake of β-alanine. In E. coli, GabP transports γ-aminobutyric acid (GABA), a structural analogue of β-alanine, and may also transport β-alanine (J Bacteriol. 2021, 203(4):e00642-20). Nevertheless, SalmonellagabP mutant displayed no growth defect in minimal medium with β-alanine as the sole carbon source (Figure 1_figure Supplement 7, Figure 1_figure Supplement 8), indicating that GabP is not involved in β-alanine uptake in Salmonella. Strikingly, the Δ_argT_ mutant—defective in arginine uptake—showed markedly decreased growth in the minimal medium with β-alanine as the sole carbon source (Figure 1F),suggesting that ArgT also transports β-alanine in Salmonella. We have added the results in the revised manuscript (lines 167-179).

      It has been reported that ArgT is essential for Salmonella replication within macrophages and full virulence in vivo (PloS one. 2010, 5(12):e15466). Given that ArgT is involved in both arginine and β-alanine uptake (as verified in this study), whether the attenuated virulence of the ∆argT mutant is due to a deficiency in β-alanine or arginine requires further investigation. We have also included a discussion on this issue (lines 409-415).

      In this work, to avoid delays and alterations in metabolite concentrations during the isolation of bacterial contents from macrophages, we directly assessed the combined metabolite concentrations within infected cells and Salmonella. It has been previously verified that these metabolites are primarily of host origin (Nat Commun. 2021, 12(1):879.). We noted a decrease in β-alanine levels in macrophages infected with Salmonella. The process of separating SCV is intricate and encompasses dissociation and sonication (Nat Commun. 2018, 9(1):2091). These steps may potentially result in alterations of metabolite concentrations during the separation procedure. Therefore, we did not measure the β-alanine concentration in the SCV.

      Reviewer #3 (Recommendations for the authors):

      The Authors have done meticulous experiments to address the questions asked by the reviewers. My one question of beta-alanine transport inside the SCV remains undone, though the authors have tried.

      Was Zinc transporter mutant checked? It is possible that the Zn transporter can take up Beta-alanine.

      Thank you for the comment. Following your suggestion, we investigated the growth of Salmonella WT and the ∆znuA mutant cultured in N-minimal and M9 minimal medium, with β-alanine as the sole carbon source. We observed no significant difference in growth kinetics between the ∆znuA mutant and WT strain under either culture condition (please refer to Author response image 1). The results indicate that ZnuA is not involved in β-alanine transport in Salmonella.

      Additionally, we have investigated the role of other amino acid transporters in the uptake of β-alanine and have ultimately identified that ArgT, the arginine transporter, is involved in the uptake of β-alanine in Salmonella (please refer to our previous response).

    1. Author Response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study presents potentially useful findings describing how activity in the corticotropin-releasing hormone neurons in the paraventricular nucleus of the hypothalamus modulates sevoflurane anesthesia, as well as a phenomenon the authors term a "general anesthetic stress response". The technical approaches are solid and the data presented are largely clear. However, the primary conclusion, that the PVHCRH neurons are a mechanism of sevoflurane anesthesia, is inadequately supported.

      We appreciate the editors and reviewers for their thorough assessment and constructive feedback. We have provided clarifications and updated the manuscripts to better interpret our results, please see below. As for the primary conclusion, we revised it as PVH CRH neurons potently modulate states of anaesthesia in sevoflurane general anesthesia, being a part of anaesthesia regulatory network of sevoflurane.

      Combined Public Review:

      This study describes a group of CRH-releasing neurons, located in the paraventricular nucleus of the hypothalamus, which, in mice, affects both the state of sevoflurane anesthesia and a grooming behavior observed after it. PVH-CRH neurons showed elevated calcium activity during the post-anesthesia period. Optogenetic activation of these PVH-CRH neurons during sevoflurane anesthesia shifts the EEG from burst-suppression to a seemingly activated state (an apparent arousal effect), although without a behavioral correlate. Chemogenetic activation of the PVH-CRH neurons delays sevoflurane-induced loss of righting reflex (another apparent arousal effect). On the other hand, chemogenetic inhibition of PVH-CRH neurons delays recovery of the righting reflex and decreases sevoflurane-induced stress (an apparent decrease in the arousal effect). The authors conclude that PVH-CRH neurons are a common substrate for sevoflurane-induced anesthesia and stress. The PVH-CRH neurons are related to behavioral stress responses, and the authors claim that these findings provide direct evidence for a relationship between sevoflurane anesthesia and sevoflurane-mediated stress that might exist even when there is no surgical trauma, such as an incision. In its current form, the article does not achieve its intended goal.

      Thank you for the detailed review. We have carefully considered your comments and have revised the manuscript to provide a clearer interpretation of our findings. Our findings indicate that PVH CRH neurons integrate the anesthetic effect and post-anesthesia stress response of sevoflurane (GA), providing new evidence for understanding the neuronal regulation of sevoflurane GA and identifying a potential brain target for further investigation into modulating the post-anesthesia stress response. However, we did not propose that there was a direct relationship between sevoflurane anesthesia and sevoflurane-mediated stress in the absence of incision. Our results mainly concluded that PVH CRH neurons integrate the anaesthetic effect and post-anaesthesia stress response of sevoflurane GA, which offers new evidence for the neuronal regulation of sevoflurane GA and provides an important but ignored potential cause of the post-anesthesia stress response.

      Strengths:

      The manuscript uses targeted manipulation of the PVH-CRH neurons, and is technically sound. Also, the number of experiments is substantial.

      Thank you.

      Weaknesses:

      The most significant weaknesses are a) the lack of consideration and measurement of GABAergic mechanisms of sevoflurane anesthesia, b) the failure to use another anesthetic as a control, c) a failure to document a compelling post-anesthesia stress response to sevoflurane in humans, d) limitations in the novelty of the findings. These weaknesses are related to the primary concerns described below:

      Concerns about the primary conclusion, that PVH-CRH neurons mediate "the anesthetic effects and post-anesthesia stress response of sevoflurane GA".

      Thanks for the advice. Our responses are as below:

      1) Just because the activity of a given neural cell type or neural circuit alters an anesthetic's response, this does not mean that those neurons play a role in how the anesthetic creates its anesthetic state. For example, sevoflurane is commonly used in children. Its primary mechanism of action is through enhancement of GABA-mediated inhibition. Children with ADHD on Ritalin (a dopamine reuptake inhibitor) who take it on the day of surgery can often require increased doses of sevoflurane to achieve the appropriate anesthetic state. The mesocortical pathway through which Ritalin acts is not part of the mechanism of action of sevoflurane. Through this pathway, Ritalin is simply increasing cortical excitability making it more challenging for the inhibitory effects of sevoflurane at GABAergic synapses to be effective. Similarly, here, altering the activity of the PVHCRH neurons and seeing a change in anesthetic response to sevoflurane does not mean that these neurons play a role in the fundamental mechanism of this anesthetic's action. With the current data set, the primary conclusions should be tempered.

      Thank you for your comments. Our results adequately uncover PVH CRH neurons that modulate the state of consciousness as well as the stress response in sevoflurane GA, but are insufficient to demonstrate that these neurons play a role in the underlying mechanism of sevoflurane anesthesia. We will revise our conclusions and make them concrete. The primary conclusion has been revised as PVH CRH neurons potently modulate states of anaesthesia in sevoflurane GA, being a part of the anaesthesia regulatory network of sevoflurane.

      2) It is important to compare the effects of sevoflurane with at least one other inhaled ether anesthetic. Isoflurane, desflurane, and enflurane are ether anesthetics that are very similar to each other, as well as being similar to sevoflurane. It is important to distinguish whether the effects of sevoflurane pertain to other anesthetics, or, alternatively, relate to unique idiosyncratic properties of this gas that may not be a part of its anesthetic properties.

      For example, one study cited by the authors (Marana et al.. 2013) concludes that there is weak evidence for differences in stress-related hormones between sevoflurane and desflurane, with lower levels of cortisol and ACTH observed during the desflurane intraoperative period. It is not clear that this difference in some stress-related hormones is modeled by post-sevoflurane excess grooming in the mice, but using desflurane as a control could help determine this.

      Thank you for your suggestions. We completely agree on the importance of determining whether the effects of sevoflurane apply to other anesthetics or arise from unique idiosyncratic attributes separate from its anesthetic properties. However, it is challenging to definitively conclude whether the effects of sevoflurane observed in our study extend to other inhaled anesthetics, even with desflurane as a control. While sevoflurane shares many common anesthetic properties with other inhalation agents, it also exhibits distinct characteristics and potential idiosyncrasies that set it apart from its counterparts. Regarding studies related to desflurane's impact on hormone levels or stress-like behaviors, one study involving 20 women scheduled for elective total abdominal hysterectomy demonstrated that there was no significant correlation between the intra-operative depth of anesthesia achieved with desflurane and the extent of the endocrine-metabolic stress response (as indicated by the concentrations of plasma cortisol, glucose, and lactate)1. Besides, a study conducted with mice suggested the abilities related to sensorimotor functions, anxiety and depression did not undergo significant changes after 7 days of anesthesia administered with 8.0% desflurane for 6 h2. Furthermore, a study involving 50 Caucasian women undergoing laparoscopic surgery for benign ovarian cysts demonstrated that in low stress surgery, desflurane, when compared to sevoflurane, exhibited superior control over the intraoperative cortisol and ACTH response 3. Based on these findings, we propose that the effect we observed in this study is likely attributed to the unique idiosyncratic properties of sevoflurane. We will conduct additional experiments to investigate this proposal with other commonly used anaesthetics in our future studies.

      Concerns about the clinical relevance of the experiments

      In anesthesiology practice, perioperative stress observed in patients is more commonly related to the trauma of the surgical intervention, with inadequate levels of antinociception or unconsciousness intraoperatively and/or poor post-operative pain control. The authors seem to be suggesting that the anesthetic itself is causing stress, but there is no evidence of this from human patients cited. We were not aware that this is a documented clinical phenomenon. It is important to know whether sevoflurane effectively produces behavioral stress in the recovery room in patients that could be related to the putative stress response (excess grooming) observed in mice. For example, in surgeries or procedures that required only a brief period of unconsciousness that could be achieved by administering sevoflurane alone (comparable to the 30 min administered to the mice), is there clinical evidence of post-operative stress?

      Thank you for your question. There is currently no direct evidence available. Studies on sevoflurane in humans primarily focus on its use during surgical interventions, making it difficult to find studies that solely administer sevoflurane, as was done in our study with mice. Generally, a short anesthesia time refers to procedures that last less than one hour, while a long anesthesia time could be considered for procedures lasting several hours or more4. A study published in eLife investigated the patterns of reemerging consciousness and cognitive function in 30 healthy adults who underwent GA for three hours 5. This finding suggests that the cognitive dysfunction observed immediately and persistently after GA in healthy animals may not necessarily apply anesthesia and postoperative neurocognitive disorders could be influenced by factors other than GA, such as surgery or patient comorbidity. Therefore, further studies are needed to verify the post-operative stress in sevoflurane-only short time anesthesia.

      Indeed, stress after surgeries can result from multiple factors aside from anesthesia, including pain, anxiety, inflammation, but what we want to illustrate in this study is that anesthesia could be one of these factors that we ignored in previous studies. In our current study, we did not propose that there was a direct relationship between sevoflurane anesthesia and sevoflurane-mediated stress without incision. We observed stress-related behavioural changes after exposure of sevoflurane GA in mouse model, indicating sevoflurane-mediated stress might exist without surgical trauma. Importantly, whether anesthetic administration alone will cause post-operative stress is worth studying in different species especially human.

      Patients who receive sevoflurane as the primary anesthetic do not wake up more stressed than if they had had one of the other GABAergic anesthetics. If there were signs of stress upon emergence (increased heart rate, blood pressure, thrashing movements) from general anesthesia, the anesthesiologist would treat this right away. The most likely cause of post-operative stress behaviors in humans is probably inadequate anti-nociception during the procedure, which translates into inadequate post-op analgesia and likely delirium. It is the case that children receiving sevoflurane do have a higher likelihood of post-operative delirium. Perhaps the authors' studies address a mechanism for delirium associated with sevoflurane, but this is not considered. Delirium seems likely to be the closest clinical phenomenon to what was studied.

      We agree with your idea. We aim to establish a connection between post-operative delirium in humans and stress-like behaviors observed in mice following sevoflurane anesthesia. Specifically, we have observed that the increased grooming behavior exhibited by mice after sevoflurane anesthesia resembles the fuzzy state of consciousness experienced during post-operative delirium6. In our discussion, we also emphasized the occurrence of sevoflurane-induced emergence agitation, a common phenomenon reported in clinical studies with an incidence of up to 80%. This state is characterized by hyperactivity, confusion, delirium, and emotional agitation 7,8. Meanwhile, in our experimental tests, namely the open field test (OFT) and elevated plus maze (EPM) test, we observed that mice exposed to sevoflurane inhalation displayed reduced movement distances during both the OFT and EPM tests (Figure 7G and I). These findings suggest a decline in behavioral activity similar to what is observed in cases of delirium.

      Concerns about the novelty of the findings

      CRH is associated with arousal in numerous studies. In fact, the authors' own work, published in eLife in 2021, showed that stimulating the hypothalamic CRH cells leads to arousal and their inhibition promotes hypersomnia. In both papers, the authors use fos expression in CRH cells during a specific event to implicate the cells, then manipulate them and measure EEG responses. In the previous work, the cells were active during wakefulness; here- they were active in the awake state that follows anesthesia (Figure 1). Thus, the findings in the current work are incremental.

      Thank you for acknowledging our previous work focusing on the changes in the sleep-wake state of mice when PVH CRH neurons are manipulated. In this study, our primary objective was to identify the neuronal mechanisms mediating the anesthetic effects and post-anesthetic stress response of sevoflurane GA. While our study claims that activation of PVH CRH neurons leads to arousal, it provides evidence that PVH CRH neurons may play a role in the regulation of conscious states in GA. Our current findings uncover that PVH CRH neurons modulate the state of consciousness as well as the stress response in sevoflurane GA, and that the modulation of PVH CRH neurons bidirectionally altered the induction and recovery of sevoflurane GA. This identifies a new brain region involved in sevoflurane GA that goes beyond the arousal-related regions.

      The activation of CRH cells in PVN has already been shown to result in grooming by Jaideep Bains (cited as reference 58). Thus, the involvement of these cells in this behavior is expected. The authors perform elaborate manipulations of CRH cells and numerous analyses of grooming and related behaviors. For example, they compare grooming and paw licking after anesthesia with those after other stressors such as forced swim, spraying mice with water, physical attack, and restraint. However, the relevance of these behaviors to humans and generalization to other types of anesthetics is not clear.

      The hyperactivity of PVH CRH neurons and behavior (e.g., excessive self-grooming) in mice may partially mirror the observed agitation and underlying mechanisms during emergence from sevoflurane GA in patients. As mentioned in the Discussion section (page 16, lines 371-374), sevoflurane-induced emergence agitation represents a prevalent manifestation of the post-anesthesia stress response. It is frequently observed, with an incidence of up to 80% in clinical reports, and is characterized by hyperactivity, confusion, delirium, and emotional agitation7,8. Our aim in this study is to distinguish the excessive stress responses of patients to sevoflurane GA from stress triggered by other factors. Other stimuli, such as forced swimming, can be considered sources of both physical and emotional stress, which are associated with depression and anxiety in humans.

      Regarding generalization to other types of anesthetics, we propose that the stress-related behavioral effects observed in this study might occur in cases of the administration of certain types of anesthetics. For example, one study showed that intravenous ketamine infusion (10 mg/kg, 2 hours) elevated plasma corticosterone and progesterone levels in rats, reducing locomotor activity (sedation) 9. The administration of intravenous anesthesia with propofol combined with sevoflurane caused greater postoperative stress than the single use of propofol10. However, desflurane, a common inhaled ether anesthetic, when compared to sevoflurane, was associated with better control of intraoperative cortisol and ACTH response in low-stress surgeries8. Thus, these behaviors observed after exposure to sevoflurane GA may be related to the post-anesthesia stress response in humans, which might also occur in cases of the administration of certain types of anesthetics.

      Recommendations for the authors:

      Reviewer 1

      1) The CRH-Cre mouse line should be validated. There are several lines of these mice, and their fidelity varies.

      The CRH-Cre mouse line we used in this study is from The Jackson Laboratory (https://www.jax.org/strain/012704) with the name B6(Cg)-Crhtm1(cre)Zjh/J (Strain #: 012704). These CRH-ires-CRE knock-in mice have Cre recombinase expression directed to CRH positive neurons by the endogenous promoter/enhancer elements of the corticotropin releasing hormone locus (Crh). We have done standard PCR to validate the mouse line following genotyping protocols provided by the Jackson Laboratory. The protocol primers were: 10574 (SEQUENCE 5' → 3': CTT ACA CAT TTC GTC CTA GCC); 10575 (SEQUENCE 5' → 3': CAC GAC CAG GCT GCG GCT AAC); 10576 (SEQUENCE 5' → 3': CAA TGT ATC TTA TCA TGT CTG GAT CC). The 468-bp CRH-specific PCR product was amplified in mutant (CRH-Cre+/+) mice; in heterozygote (CRH-Cre+/-) mice, both the 468-bp and the 676-bp PCR products were detected; in wild type (WT) mice, only the 676-bp WT allele-specific PCR product was amplified. An example of PCR results is presented below. The heterozygote and mutant mice were included in our study.

      Author response image 1.

      1. It would be very helpful to validate the CRH antibody. Using any antiserum at 1:800 suggests that it may not be potent or highly specific.

      As requested, we used the same CRH antibody at a concentration of 1:800, following the methods described in the Method section. The results are displayed below.

      Author response image 2.

      1. In Figure 1C, the control sections are out of focus, any cells are blurry, reducing confidence in the analyses (locus ceruleus cells appear confluent in the control?)

      Sorry for the confusing figure and we have revised the control section part of Figure 1C:

      Author response image 3.

      Reviewer 2

      1) In the Abstract, to say that "General anesthetics benefit patients undergoing surgeries without consciousness. ..." is a gross understatement of the essential role that general anesthesia plays today to make surgery not only tolerable but humane. This opening sentence should be rewritten. General anesthesia is a fundamental process required to undertake safely and humanely a high fraction of surgeries and invasive diagnostic procedures.

      As requested, we rewrote this opening sentence, please see the follows:

      GA is a fundamental process required to undertake surgeries and invasive diagnostic procedures safely and humanely. However, the undesired stress response associated with GA can lead to delayed recovery and even increased morbidity in clinical settings.

      2) In the Abstract, when discussing the response of the PVN-CRH neurons to chemogenetic inhibition, say exactly what the "opposite effect" is.

      Thanks for your insights. We have rewritten our abstract as follows:

      Chemogenetic activation of these neurons delayed the induction and accelerated emergence from sevoflurane GA, whereas chemogenetic inhibition of PVH CRH neurons promoted induction and prolonged emergence from sevoflurane GA.

      3) In all spectrograms the dynamic range is compressed between 0.5 and 1. Please make use of the full range, as some details might be missed because of this compression.

      We are sorry for the incorrect unit of the spectrograms. We have provided the correct one with full range, please see below:

      Author response image 4.

      Author response image 5.

      4) The spectrogram in Figure 2D has several frequency chirps that do not seem physiological.

      Thank you for your comments. The frequency chips of the spectrogram during the During and Post 1 phase were caused by recording noises. To avoid confusion, we have deleted the spectrogram in Figure 2D.

      5) The 3D plots in Figures 3G and H are not helpful. Thanks for the comment. We'd like to keep the 3D plots as they aid visual comparison of three different features of grooming, which complements other panels in Figure 3.

      6) The spectrograms in Figures 5A and B are too small, while the spectra in Figures 5C and D are too large. Please invert this relationship, as it is interesting and important to see the details in the spectrograms. The same happens in Figure 6.

      We adjusted the layout of the Figure 5 and Figure 6 as requested, please see below:

      Author response image 6.

      Author response image 7.

      7) In Figure 6H, the authors compute the burst-suppression ratio during a period that seemingly has no bursts or suppressions (Figure 6B).

      The burst-suppression ratio was computed from data with the minimum duration of burst and suppression periods set at 0.5 s. Sorry for the confusion. We added a new supplementary figure (Figure 6-figure supplement 8) displaying a 40-second EEG with a burst suppression period to better visualize the burst suppression.

      Author response image 8.

      8) The data analyses are done in terms of p-values. They should be reported as confidence intervals so that any effect the authors wish to establish is measured along with its uncertainty.

      Thank you for your valuable suggestions regarding our manuscript. We appreciate your thoughtful consideration of our work. We understand your concern but we would like to provide some justification for our choice of reporting p-values and explain why we believe they are appropriate for our study. First, the use of p-values for hypothesis testing and significance assessment is a common practice in our field. Many previous studies in our area of research also report results in terms of p-values. For example, Wei Xu11 published in 2020 suggested sevoflurane inhibits MPB neurons through postsynaptic GABAA-Rs and background potassium channels, Ao Y12 demonstrated that activation of the TH:LC-PVT projections is helpful in facilitating the transition from isoflurane anesthesia to an arousal state, using P-value as data analyses. By adhering to this convention, we ensure that our findings are consistent with the existing body of literature. This makes it easier for readers to compare and integrate our results with previous work. Secondly, while confidence intervals can provide a measure of effect size and uncertainty, p-values offer a concise way to communicate statistical significance. They help readers quickly assess whether an effect is statistically significant or not, which is often the primary concern when interpreting research findings. We hope that by providing these reasons for our choice of reporting p-values, we can address your concern while maintaining the integrity and consistency of our study. If you believe there are specific instances where reporting confidence intervals would be more informative, please feel free to highlight those, and we will consider your suggestion on a case-by-case basis. 

      References

      1. Baldini, G., Bagry, H. & Carli, F. Depth of anesthesia with desflurane does not influence the endocrine-metabolic response to pelvic surgery. Acta Anaesthesiol Scand 52, 99-105, doi:10.1111/j.1399-6576.2007.01470.x (2008).
      2. Niikura, R. et al. Exploratory analyses of postanesthetic effects of desflurane using behavioral test battery of mice. Behav Pharmacol 31, 597-609, doi:10.1097/fbp.0000000000000567 (2020).
      3. Marana, E. et al. Desflurane versus sevoflurane: a comparison on stress response. Minerva Anestesiol 79, 7-14 (2013).
      4. Vutskits, L. & Xie, Z. Lasting impact of general anaesthesia on the brain: mechanisms and relevance. Nat Rev Neurosci 17, 705-717, doi:10.1038/nrn.2016.128 (2016).
      5. Mashour, G. A. et al. Recovery of consciousness and cognition after general anesthesia in humans. Elife 10, doi:10.7554/eLife.59525 (2021).
      6. Mattison, M. L. P. Delirium. Ann Intern Med 173, Itc49-itc64, doi:10.7326/aitc202010060 (2020).
      7. Dahmani, S. et al. Pharmacological prevention of sevoflurane- and desflurane-related emergence agitation in children: a meta-analysis of published studies. Br J Anaesth 104, 216-223, doi:10.1093/bja/aep376 (2010).
      8. Lim, B. G. et al. Comparison of the incidence of emergence agitation and emergence times between desflurane and sevoflurane anesthesia in children: A systematic review and meta-analysis. Medicine (Baltimore) 95, e4927, doi:10.1097/MD.0000000000004927 (2016).
      9. Radford, K. D. et al. Association between intravenous ketamine-induced stress hormone levels and long-term fear memory renewal in Sprague-Dawley rats. Behav Brain Res 378, 112259, doi:10.1016/j.bbr.2019.112259 (2020).
      10. Yang, L., Chen, Z. & Xiang, D. Effects of intravenous anesthesia with sevoflurane combined with propofol on intraoperative hemodynamics, postoperative stress disorder and cognitive function in elderly patients undergoing laparoscopic surgery. Pak J Med Sci 38, 1938-1944, doi:10.12669/pjms.38.7.5763 (2022).
      11. Xu, W. et al. Sevoflurane depresses neurons in the medial parabrachial nucleus by potentiating postsynaptic GABA(A) receptors and background potassium channels. Neuropharmacology 181, 108249, doi:10.1016/j.neuropharm.2020.108249 (2020).
      12. Ao, Y. et al. Locus Coeruleus to Paraventricular Thalamus Projections Facilitate Emergence From Isoflurane Anesthesia in Mice. Front Pharmacol 12, 643172, doi:10.3389/fphar.2021.643172 (2021).
    1. Author response:

      The following is the authors’ response to the current reviews.

      eLife assessment:

      The manuscript establishes a sophisticated mouse model for acute retinal artery occlusion (RAO) by combining unilateral pterygopalatine ophthalmic artery occlusion (UPOAO) with a silicone wire embolus and carotid artery ligation, generating ischemia-reperfusion injury upon removal of the embolus. This clinically relevant model is useful for studying the cellular and molecular mechanisms of RAO. The data overall are solid, presenting a novel tool for screening pathogenic genes and promoting further therapeutic research in RAO.

      Thank you for your thorough evaluation. We are pleased that you find our mouse model for acute retinal artery occlusion to be sophisticated and clinically relevant. Your recognition of the model’s utility in studying the cellular and molecular mechanisms of RAO, as well as its potential for advancing therapeutic research, is highly encouraging and underscores the significance of our work. We are grateful for your supportive feedback.

      Public Reviews:

      Reviewer #1:

      Summary:

      Wang, Y. et al. used a silicone wire embolus to definitively and acutely clot the pterygopalatine ophthalmic artery in addition to carotid artery ligation to completely block blood supply to the mouse inner retina, which mimic clinical acute retinal artery occlusion. A detailed characterization of this mouse model determined the time course of inner retina degeneration and associated functional deficits, which closely mimic human patients. Whole retina transcriptome profiling and comparison revealed distinct features associated with ischemia, reperfusion, and different model mechanisms. Interestingly and importantly, this team found a sequential event including reperfusion-induced leukocyte infiltration from blood vessels, residual microglial activation, and neuroinflammation that may lead to neuronal cell death.

      Strengths:

      Clear demonstration of the surgery procedure with informative illustrations, images, and superb surgical videos.

      Two time points of ischemia and reperfusion were studied with convincing histological and in vivo data to demonstrate the time course of various changes in retinal neuronal cell survivals, ERG functions, and inner/outer retina thickness.

      The transcriptome comparison among different retinal artery occlusion models provides informative evidence to differentiate these models.

      The potential applications of the in vivo retinal ischemia-reperfusion model and relevant readouts demonstrated by this study will certainly inspire further investigation of the dynamic morphological and functional changes of retinal neurons and glial cell responses during disease progression and before and after treatments.

      We sincerely appreciate your detailed and positive feedback. These evaluations are invaluable in highlighting the significance and impact of our work. Thank you for your thoughtful and supportive review.

      Weaknesses:

      The revised manuscript has been significantly improved in clarity and readability. It has addressed all my questions convincingly.

      Thank you for your positive feedback. We are pleased to hear that the revisions have significantly improved the manuscript's clarity and readability, and that we have convincingly addressed all your questions. Your encouraging words are of great importance to us.

      Reviewer #2 (Public Review):

      Summary:

      The authors of this manuscript aim to develop a novel animal model to accurately simulate the retinal ischemic process in retinal artery occlusion (RAO). A unilateral pterygopalatine ophthalmic artery occlusion (UPOAO) mouse model was established using silicone wire embolization combined with carotid artery ligation. This manuscript provided data to show the changes of major classes of retinal neural cells and visual dysfunction following various durations of ischemia (30 minutes and 60 minutes) and reperfusion (3 days and 7 days) after UPOAO. Additionally, transcriptomics was utilized to investigate the transcriptional changes and elucidate changes in the pathophysiological process in the UPOAO model post-ischemia and reperfusion. Furthermore, the authors compared transcriptomic differences between the UPOAO model and other retinal ischemic-reperfusion models, including HIOP and UCCAO, and revealed unique pathological processes.

      Strengths:

      The UPOAO model represents a novel approach for studying retinal artery occlusion. The study is very comprehensive.

      Thank you for your positive feedback. We are delighted that you find the UPOAO model to be a novel and comprehensive approach to studying retinal artery occlusion. Your recognition of the depth and significance of our study is highly valuable and encourages us in our ongoing research.

      Weaknesses:

      Originally, some statements were incorrect and confusing. However, the authors have made clarifications in the revised manuscript to avoid confusion.

      We sincerely appreciate your meticulous review of the manuscript. We have thoroughly addressed the inaccuracies identified in the revised version. Additionally, we have polished the article to ensure improved readability. We apologize for any confusion caused by these inaccuracies and genuinely. We appreciate your careful attention to detail, and your patience and meticulous suggestions have significantly improved the clarity and readability of our manuscript.


      The following is the authors’ response to the original reviews.

      Recommendations for the authors:

      Reviewer #1:

      The revised manuscript has been significantly improved in clarity and readability. It has addressed all my questions convincingly.

      Thank you for your positive feedback. We are pleased to hear that the revisions have significantly improved the manuscript's clarity and readability, and that we have convincingly addressed all your questions. Your encouraging words are of great importance to us.

      Reviewer #2:

      The authors have revised the manuscript and/or provided answers to the majority of prior comments, which have helped to strengthen the work. However, addressing the following concerns is still necessary to further improve the manuscript.

      Thank you for acknowledging our revisions and the improvements made to the manuscript. We appreciate your continued feedback and will address the remaining concerns to further enhance the quality of our work.

      The quantification method of RGCs is described in detail in the response letter, but this detailed methodology was not included in the revised manuscript to clarify the quantification process.

      Thank you for your helpful recommendations. We have added detailed methodology in the revised manuscript to clarify the quantification process (line 180-188).

      The graphs in Fig. 3D b-wave and Fig. 3E-b wave are duplicated.

      We apologize for the error in our figures. We have corrected the mistake by replacing the duplicated image in Fig. 3E-b wave with the correct one (line 880). Your careful observation has been very helpful in improving our manuscript. Thank you for bringing this to our attention.

      The quantifications of the thickness of retinal layers in HE-stained sections in Figure 4 (IPL) and Response Figure 2 are incorrect. For mice retina, the thickness of the IPL is approximately 50 µm.

      Thank you for your meticulous review of the manuscript. We have rectified the inaccuracies in the quantification of retinal layer thickness in HE-stained sections in Figure 4, addressing the initial issue with the scale bar.

      We consulted with a microscope engineer and used a microscope microscale to calibrate the scale of the fluorescence microscope (BX63; Olympus, Tokyo, Japan) at the suggestion of the engineer.

      We recount the thickness of all layers of the HE-stained retinal section (line 902). The inner retina thickness in Figure 4 has been adjusted under a new scale bar, and the thickness of the outer retinal layers is now displayed in

      Author response image 1. However, the IPL thickness of the sham eye in the UPOAO model is still not aligned with the common thickness of 50 µm. Therefore we review the literature within our laboratory, focusing on C57BL/6 mice from the same source, revealed that the inner retina thickness (GCC+INL) in the HE-stained sections of the sham eye in the UPOAO model (around 80 µm) is consistent with previous findings (see Author response image 2) conducted by Kaibao Ji and published in Experimental Eye Research in 2021 [1].

      We captured and analyzed the average retinal thickness of each layer over a long range of 200-1100 μm from the optic nerve head (see Author response image 3, highlighted by the green line). The field region has been corrected in the revised manuscript (line 232). Considering the significant variation in retinal thickness from the optic nerve to the periphery, we consulted literature on multi-point measurements of HE-stained retinas. The average thickness of the GCC layer in the control group was approximately 57 µm at 600 µm from the optic nerve head and about 48 µm at 1200 µm from the optic nerve head in the literature [2] (see Author response image 4). The GCC layer thickness of the sham eye in the UPOAO model is around 50 µm, in alignment with existing literature. In future studies, we will pay more attention to the issue of thickness averaging.

      We appreciate your thorough review and valuable feedback, which has enabled us to correct errors and enhance the accuracy of our research.

      Author response image 1.

      Thickness of OPL, ONL, IS/OS+RPE in HE staining. n=3; ns: no significance (p>0.05).

      Author response image 2.

      Cited from Ji, K., et al., Resveratrol attenuates retinal ganglion cell loss in a mouse model of retinal ischemia reperfusion injury via multiple pathways. Experimental Eye Research, 2021. 209: p. 108683.

      Author response image 3.

      Schematic diagram illustrating the selection of regions. The figure was captured using a fluorescence microscope (BX63; Olympus, Tokyo, Japan) under a 4X objective. Scale bar=500 µm.

      Author response image 4.

      Cited from Feng, L., et al., Ripa-56 protects retinal ganglion cells in glutamate-induced retinal excitotoxic model of glaucoma. Sci Rep, 2024. 14(1): p. 3834.

      There are some typos in the summary table. For example: 'Amplitudes of a-wave (0.3, 2.0, and 10.0 cd.s/m²)' should be 'Amplitudes of a-wave (0.3, 3.0, and 10.0 cd.s/m²)'; and 'IINL thickness' in HE' should be 'INL thickness'.

      Thank you for pointing out the typos in the summary table (line 1073). We have corrected 'Amplitudes of a-wave (0.3, 2.0, and 10.0 cd.s/m²)' to 'Amplitudes of a-wave (0.3, 3.0, and 10.0 cd.s/m²)' and 'IINL thickness' to 'INL thickness'. Your attention to detail is greatly appreciated and has been very helpful in improving our manuscript.

      References

      (1) Ji, K., et al., Resveratrol attenuates retinal ganglion cell loss in a mouse model of retinal ischemia reperfusion injury via multiple pathways. Experimental Eye Research, 2021. 209: p. 108683.

      (2) Feng, L., et al., Ripa-56 protects retinal ganglion cells in glutamate-induced retinal excitotoxic model of glaucoma. Sci Rep, 2024. 14(1): p. 3834.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      (1) Authors' experimental designs have some caveats to definitely support their claims. Authors claimed that aged LT-HSCs have no myeloid-biased clone expansion using transplantation assays. In these experiments, authors used 10 HSCs and young mice as recipients. Given the huge expansion of old HSC by number and known heterogeneity in immunophenotypically defined HSC populations, it is questionable how 10 out of so many old HSCs (an average of 300,000 up to 500,000 cells per mouse; Mitchell et al., Nature Cell Biology, 2023) can faithfully represent old HSC population. The Hoxb5+ old HSC primary and secondary recipient mice data (Fig. 2C and D) support this concern. In addition, they only used young recipients. Considering the importance of inflammatory aged niche in the myeloid-biased lineage output, transplanting young vs old LT-HSCs into aged mice will complete the whole picture. 

      We sincerely appreciate your insightful comment regarding the existence of approximately 500,000 HSCs per mouse in older mice. To address this, we have conducted a statistical analysis to determine the appropriate sample size needed to estimate the characteristics of a population of 500,000 cells with a 95% confidence level and a ±5% margin of error. This calculation was performed using the finite population correction applied to Cochran’s formula.

      For our calculations, we used a proportion of 50% (p = 0.5), as it has been reported that approximately 50% of HSCs are myeloid-biased1,2. The formula used is as follows:

      N \= 500,000 (total population size)

      Z = 1.96 (Z-score for a 95% confidence level)

      p = 0.5 (expected proportion)

      e \= 0.05 (margin of error)

      Applying this formula, we determined that the required sample size is approximately 384 cells. This sample size ensures that the observed proportion in the sample will reflect the characteristics of the entire population. In our study, we have conducted functional experiments across Figures 2, 3, 5, 6, S3, and S6, with a total sample size of n = 126, which corresponds to over 1260 cells. While it would be ideal to analyze all 500,000 cells, this would necessitate the use of 50,000 recipient mice, which is not feasible. We believe that the number of cells analyzed is reasonable from a statistical standpoint. 

      References

      (1) Dykstra, Brad et al. “Clonal analysis reveals multiple functional defects of aged murine hematopoietic stem cells.” The Journal of experimental medicine vol. 208,13 (2011): 2691-703. doi:10.1084/jem.20111490

      (2) Beerman, Isabel et al. “Functionally distinct hematopoietic stem cells modulate hematopoietic lineage potential during aging by a mechanism of clonal expansion.” Proceedings of the National Academy of Sciences of the United States of America vol. 107,12 (2010): 5465-70. doi:10.1073/pnas.1000834107

      (2) Authors' molecular data analyses need more rigor with unbiased approaches. They claimed that neither aged LT-HSCs nor aged ST-HSCs exhibited myeloid or lymphoid gene set enrichment but aged bulk HSCs, which are just a sum of LTHSCs and ST-HSCs by their gating scheme (Fig. 4A), showed the "tendency" of enrichment of myeloid-related genes based on the selected gene set (Fig. 4D). Although the proportion of ST-HSCs is reduced in bulk HSCs upon aging, since STHSCs do not exhibit lymphoid gene set enrichment based on their data, it is hard to understand how aged bulk HSCs have more myeloid gene set enrichment compared to young bulk HSCs. This bulk HSC data rather suggest that there could be a trend toward certain lineage bias (although not significant) in aged LT-HSCs or ST-HSCs. Authors need to verify the molecular lineage priming of LT-HSCs and ST-HSCs using another comprehensive dataset. 

      Thank you for your thoughtful feedback regarding the lack of myeloid or lymphoid gene set enrichment in aged LT-HSCs and aged ST-HSCs, despite the observed tendency for myeloid-related gene enrichment in aged bulk HSCs.

      First, we acknowledge that the GSEA results vary among the different myeloid gene sets analyzed (Fig. 4, D–F; Fig. S4, C–D). Additionally, a comprehensive analysis of mouse HSC aging using multiple RNA-seq datasets reported that nearly 80% of differentially expressed genes show poor reproducibility across datasets[1]. These factors highlight the challenges of interpreting lineage bias in HSCs based solely on previously published transcriptomic data.

      Given these points, we believe that emphasizing functional experimental results is more critical than incorporating an additional dataset to support our claim. In this regard, we have confirmed that young and aged LT-HSCs have similar differentiation capacity (Figure 3), while myeloid-biased hematopoiesis is observed in aged bulk HSCs (Figure S3). These findings are further corroborated by independent functional experiments. We sincerely appreciate your insightful comments.

      Reference

      (1) Flohr Svendsen, Arthur et al. “A comprehensive transcriptome signature of murine hematopoietic stem cell aging.” Blood vol. 138,6 (2021): 439-451. doi:10.1182/blood.2020009729

      (3) Although authors could not find any molecular evidence for myeloid-biased hematopoiesis from old HSCs (either LT or ST), they argued that the ratio between LT-HSC and ST-HSC causes myeloid-biased hematopoiesis upon aging based on young HSC experiments (Fig. 6). However, old ST-HSC functional data showed that they barely contribute to blood production unlike young Hoxb5- HSCs (ST-HSC) in the transplantation setting (Fig. 2). Is there any evidence that in unperturbed native old hematopoiesis, old Hoxb5- HSCs (ST-HSC) still contribute to blood production?

      If so, what are their lineage potential/output? Without this information, it is hard to argue that the different ratio causes myeloid-biased hematopoiesis in aging context. 

      Thank you for the insightful and important question. The post-transplant chimerism of ST-HSCs was low in Fig. 2, indicating that transplantation induced a short-term loss of hematopoietic potential due to hematopoietic stress per cell. 

      To reduce this stress, we increased the number of HSCs in transplantation setting. In Fig. S6, old LT-HSCs and old ST-HSCs were transplanted in a 50:50 or 20:80 ratio, respectively. As shown in Fig. S6.D, the 20:80 group, which had a higher proportion of old ST-HSCs, exhibited a statistically significant increase in the lymphoid percentage in the peripheral blood post-transplantation. 

      These findings suggest that old ST-HSCs contribute to blood production following transplantation. 

      Reviewer #2 (Public review):

      While aspects of their work are fascinating and might have merit, several issues weaken the overall strength of the arguments and interpretation. Multiple experiments were done with a very low number of recipient mice, showed very large standard deviations, and had no statistically detectable difference between experimental groups. While the authors conclude that these experimental groups are not different, the displayed results seem too variable to conclude anything with certainty. The sensitivity of the performed experiments (e.g. Fig 3; Fig 6C, D) is too low to detect even reasonably strong differences between experimental groups and is thus inadequate to support the author's claims. This weakness of the study is not acknowledged in the text and is also not discussed. To support their conclusions the authors need to provide higher n-numbers and provide a detailed power analysis of the transplants in the methods section. 

      Response #2-1:

      Thank you for your important remarks. The power analysis for this experiment shows that power = 0.319, suggesting that more number may be needed. On the other hand, our method for determining the sample size in Figure 3 is as follows:

      (1) First, we checked whether myeloid biased change is detected in the bulk-HSC fraction (Figure S3). The results showed that the difference in myeloid output at 16 weeks after transplantation was statistically significant (young vs. aged = 7.2 ± 8.9 vs. 42.1 ± 35.5%, p = 0.01), even though n = 10.

      (2) Next, myeloid biased HSCs have been reported to be a fraction with high selfrenewal ability (2004, Blood). If myeloid biased HSCs increase with aging, the increase in myeloid biased HSCs in LT-HSC fraction would be detected with higher sensitivity than in the bulk-HSC fraction used in Figure S3.

      (3) However, there was no difference not only in p-values but also in the mean itself, young vs aged = 51.4±31.5% vs 47.4±39.0%, p = 0.82, even though n = 8 in Figure 3. Since there was no difference in the mean itself, it is highly likely that no difference will be detected even if n is further increased.

      Regarding Figure 6, we obtained a statistically significant difference and consider the sample size to be sufficient. In addition, we have performed various functional experiments (Figures 2, 5, 6 and S6), and have obtained consistent results that expansion of myeloid biased HSCs does not occur with aging in Hoxb5+HSCs fraction. Based on the above, we conclude that the LT-HSC fraction does not differ in myeloid differentiation potential with aging.

      As the authors attempt to challenge the current model of the age-associated expansion of myeloid-biased HSCs (which has been observed and reproduced by many different groups), ideally additional strong evidence in the form of single-cell transplants is provided. 

      Response #2-2:

      Thank you for the comments. As the reviewer pointed out, we hope we could reconfirm our results using single-cell level technology in the future.

      On the other hand, we have reported that the ratio of myeloid to lymphoid cells in the peripheral blood changes when the number of HSCs transplanted, or the number of supporting cells transplanted with HSCs, is varied[1-2]. Therefore, single-cell transplant data need to be interpreted very carefully to determine differentiation potential.

      From this viewpoint, future experiments will combine the Hoxb5 reporter system with a lineage tracing system that can track HSCs at the single-cell level over time. This approach will investigate changes in the self-renewal capacity of individual HSCs and their subsequent differentiation into progenitor cells and peripheral blood cells. We have reflected this comment by adding the following sentences in the manuscript.

      [P19, L451] “In contrast, our findings should be considered in light of some limitations. In this report, we primarily performed ten to twenty cell transplantation assays. Therefore, the current theory should be revalidated using single-cell technology with lineage tracing system[3-4]. This approach will investigate changes in the self-renewal capacity of individual HSCs and their subsequent differentiation into progenitor cells and peripheral blood cells.” 

      It is also unclear why the authors believe that the observed reduction of ST-HSCs relative to LT-HSCs explains the myeloid-biased phenotype observed in the peripheral blood. This point seems counterintuitive and requires further explanation. 

      Response #2-3:

      Thank you for your comment. We apologize for the insufficient explanation. Our data, as shown in Figures 3 and 4, demonstrate that the differentiation potential of LT-HSCs remains unchanged with age. Therefore, rather than suggesting that an increase in LT-HSCs with a consistent differentiation capacity leads to myeloidbiased hematopoiesis, it seems more accurate to highlight that the relative decrease in the proportion of ST-HSCs, which remain in peripheral blood as lymphocytes, leads to a relative increase in myeloid cells in peripheral blood and thus causes myeloid-biased hematopoiesis.

      However, if we focus on the increase in the ratio of LT-HSCs, it is also plausible to explain that “with aging, the proportion of LT-HSCs capable of long-term myeloid hematopoiesis increases. As a result, from 16 weeks after transplantation, the influence of LT-HSCs maintaining the long-term ability to produce myeloid cells becomes relatively more significant, leading to an increase in the ratio of myeloid cells in the peripheral blood and causing myeloid-biased hematopoiesis.”

      Based on my understanding of the presented data, the authors argue that myeloidbiased HSCs do not exist, as 

      a) they detect no difference between young/aged HSCs after transplant (mind low nnumbers and large std!!!); b) myeloid progenitors downstream of HSCs only show minor or no changes in frequency and c) aged LT-HSCs do not outperform young LT-HSC in myeloid output LT-HSCs in competitive transplants (mind low n-numbers and large std!!!). 

      However, given the low n-numbers and high variance of the results, the argument seems weak and the presented data does not support the claims sufficiently. That the number of downstream progenitors does not change could be explained by other mechanisms, for instance, the frequently reported differentiation short-cuts of HSCs and/or changes in the microenvironment. 

      Response #2-4:

      We appreciate the comments. As mentioned above, we will correct the manuscript regarding the sample size. Regarding the interpreting of the lack of increase in the percentage of myeloid progenitor cells in the bone marrow with age, it is instead possible that various confounding factors, such as differentiation shortcuts or changes in the microenvironment, are involved.

      However, even when aged LT-HSCs and young LT-HSCs are transplanted into the same recipient mice, the timing of the appearance of different cell fractions in peripheral blood is similar (Figure 3 of this paper). Therefore, we have not obtained data suggesting that clear shortcuts exist in the differentiation process of aged HSCs into neutrophils or monocytes. Additionally, it is currently consensually accepted that myeloid cells, including neutrophils and monocytes, differentiate from GMPs[1]. Since there is no changes in the proportion of GMPs in the bone marrow with age, we concluded that the differentiation potential into myeloid cells remains consistent with aging.

      "Then, we found that the myeloid lineage proportions from young and aged LT-HSCs were nearly comparable during the observation period after transplantation (Fig. 3, B and C)." 

      [Comment to the authors]: Given the large standard deviation and low n-numbers, the power of the analysis to detect differences between experimental groups is very low. Experimental groups with too large standard deviations (as displayed here) are difficult to interpret and might be inconclusive. The absence of clearly detectable differences between young and aged transplanted HSCs could thus simply be a false-negative result. The shown experimental results hence do not provide strong evidence for the author's interpretation of the data. The authors should add additional transplants and include a detailed power analysis to be able to detect differences between experimental groups with reasonable sensitivity. 

      Response #2-5:

      Thank you for providing these insights. Regarding the sample size, we have addressed this in Response #2-1.

      Line 293: "Based on these findings, we concluded that myeloid-biased hematopoiesis observed following transplantation of aged HSCs was caused by a relative decrease in ST-HSC in the bulk-HSC compartment in aged mice rather than the selective expansion of myeloid-biased HSC clones." 

      Couldn't that also be explained by an increase in myeloid-biased HSCs, as repeatedly reported and seen in the expansion of CD150+ HSCs? It is not intuitively clear why a reduction of ST-HSCs clones would lead to a myeloid bias. The author should try to explain more clearly where they believe the increased number of myeloid cells comes from. What is the source of myeloid cells if the authors believe they are not derived from the expanded population of myeloid-biased HSCs? t 

      Response #2-6:

      Thank you for pointing this out. We apologize for the insufficient explanation. We will explain using Figure 8 from the paper.

      First, our data show that LT-HSCs maintain their differentiation capacity with age, while ST-HSCs lose their self-renewal capacity earlier, so that only long-lived memory lymphocytes remain in the peripheral blood after the loss of selfrenewal capacity in ST-HSCs (Figure 8, upper panel). In mouse bone marrow, the proportion of LT-HSCs increases with age, while the proportion of ST-HSCs relatively decreases (Figure 8, lower panel and Figure S5). 

      Our data show that merely reproducing the ratio of LT-HSCs to ST-HSCs observed in aged mice using young LT-HSCs and ST-HSCs can replicate myeloidbiased hematopoiesis. This suggests that the increase in LT-HSC and the relative decrease in ST-HSC within the HSC compartment with aging are likely to contribute to myeloid-biased hematopoiesis.

      As mentioned earlier, since the differentiation capacity of LT-HSCs remain unchaged with age, it seems more accurate to describe that the relative decrease in the proportion of ST-HSCs, which retain long-lived memory lymphocytes in peripheral blood, leads to a relative increase in myeloid cells in peripheral blood and thus causes myeloid-biased hematopoiesis.

      However, focusing on the increase in the proportion of LT-HSCs, it is also possible to explain that “with aging, the proportion of LT-HSCs capable of long-term myeloid hematopoiesis increases. As a result, from 16 weeks after transplantation, the influence of LT-HSCs maintaining the long-term ability to produce myeloid cells becomes relatively more significant, leading to an increase in the ratio of myeloid cells in the peripheral blood and causing myeloid-biased hematopoiesis.”

      Recommendations for the authors: 

      Reviewer #2 (Recommendations for the authors):

      Summary: 

      Comment #2-1: While aspects of their work are fascinating and might have merit, several issues weaken the overall strength of the arguments and interpretation. Multiple experiments were done with a very low number of recipient mice, showed very large standard deviations, and had no statistically detectable difference between experimental groups. While the authors conclude that these experimental groups are not different, the displayed results seem too variable to conclude anything with certainty. The sensitivity of the performed experiments (e.g. Figure 3; Figure 6C, D) is too low to detect even reasonably strong differences between experimental groups and is thus inadequate to support the author's claims. This weakness of the study is not acknowledged in the text and is also not discussed. To support their conclusions the authors, need to provide higher n-numbers and provide a detailed power analysis of the transplants in the methods section. 

      Response #2-1

      Thank you for your important remarks. The power analysis for this experiment shows that power = 0.319, suggesting that more number may be needed. On the other hand, our method for determining the sample size in Figure 3 is as follows: 

      (1) First, we checked whether myeloid biased change is detected in the bulk-HSC fraction (Figure S3). The results showed that the difference in myeloid output at 16 weeks after transplantation was statistically significant (young vs. aged = 7.2 {plus minus} 8.9 vs. 42.1 {plus minus} 35.5%, p = 0.01), even though n = 10. 

      (2) Next, myeloid biased HSCs have been reported to be a fraction with high selfrenewal ability (2004, Blood). If myeloid biased HSCs increase with aging, the increase in myeloid biased HSCs in LT-HSC fraction would be detected with higher sensitivity than in the bulk-HSC fraction used in Figure S3. 

      (3) However, there was no difference not only in p-values but also in the mean itself, young vs aged = 51.4{plus minus}31.5% vs 47.4{plus minus}39.0%, p = 0.82, even though n = 8 in Figure 3. Since there was no difference in the mean itself, it is highly likely that no difference will be detected even if n is further increased. 

      Regarding Figure 6, we obtained a statistically significant difference and consider the sample size to be sufficient. In addition, we have performed various functional experiments (Figures 2, 5, 6 and S6), and have obtained consistent results that expansion of myeloid-biased HSCs does not occur with aging in Hoxb5+HSCs fraction. Based on the above, we conclude that the LT-HSC fraction does not differ in myeloid differentiation potential with aging. 

      [Comment for authors]  

      Paradigm-shifting extraordinary claims require extraordinary data. Unfortunately, the authors do not provide additional data to further support their claims. Instead, the authors argue the following: Because they were able to find significant differences between experimental groups in some experiments, the absence of significant differences in the results of other experiments must be correct, too. 

      This logic is in my view flawed. Any assay/experiment with highly variable data has a very low sensitivity to detect significant differences between groups. If, as in this case, the variance is as large as the entire dynamic range of the readout, it becomes impossible to be able to detect any difference. In these cases, it is not surprising and actually expected that the mean of the group is located close to the center of the dynamic range as is the case here (center of dynamic range: 50%). In other words, this means that the experiments are simply not reproducible. It is absolutely critical to remember that any experiment and its associated statistical analysis has 3 (!!!) instead of 2 possible outcomes: 

      (1) There is a statistically significant difference 

      (2) There is no statistically significant difference 

      (3) The results of the experiment are inconclusive because the replicates are too variable and the results are not reproducible.  

      While most of us are inclined to think about outcomes (1) or (2), outcome (3) cannot be neglected. While it might be painful to accept, the only way to address concerns about data reproducibility is to provide additional data, improve reproducibility, and lower the power of the analysis to an acceptable level (e.g. able to detect difference of 5-10% between groups). 

      Without going into the technical details, the example graph from the link below illustrates that with a power 0.319 as stated by the authors, approx. 25 transplants, instead of 8, would be required. 

      Typically, however, a power of 0.8 is a reasonable value for any power analysis (although it's not a very strong power either). Even if we are optimistic and assume that there might be a reasonably large difference between experimental groups (in the example above P2 = 0.6, which is actually not that large) we can estimate that we would need over 10 transplants per group to say with confidence that two experimental groups likely do not differ. With smaller differences, these numbers increase quickly to 20+ transplants per group as can be seen in the example graph using an Alpha of 0.1 above. 

      Further reading can be found here and in many textbooks or other online resources: https://power-analysis.com/effect_size.htm  https://tss.awf.poznan.pl/pdf-188978-110207? filename=Using%20power%20analysis%20to.pdf 

      Response:

      Thank you for your feedback. We fully agree with the reviewer that paradigmshifting claims must be supported by equally robust data. It has been welldocumented that the frequency of myeloid-biased HSCs increases with age, with reports indicating that over 50% of the HSC compartment in aged mice consists of myeloid-biased HSCs[1,2]. Based on this, we believe that if aged LT-HSCs were substantially myeloid-biased, the difference should be readily detectable.

      To further validate our findings, we showed the similar preliminary experiment. The resulting data are shown below (n = 8). 

      Author response image 1.

      (A) Experimental design for competitive co-transplantation assay. Ten CD45.2<sup>+</sup> young LT-HSCs and ten CD45.2<sup>+</sup> aged LT-HSCs were transplanted with 2 × 10<sup>5</sup> CD45.1<sup>+</sup>/CD45.2<sup>+</sup> supporting cells into lethally irradiated CD45.1<sup>+</sup> recipient mice (n \= 8). (B) Lineage output of young or aged LT-HSCs at 4, 8, 12, 16 weeks after transplantation. Each bar represents an individual mouse. *P < 0.05. **P < 0.01.

      While a slight increase in myeloid-biased hematopoiesis was observed in the aged LT-HSC fraction, the difference was not statistically significant. These new results are presented alongside the original Figure 3, which was generated using a larger sample size (n = 16).

      Author response image 2.

      (A) Experimental design for competitive co-transplantation assay. Ten CD45.2<sup>+</sup> young LT-HSCs and ten CD45.2<sup>+</sup> aged LT-HSCs were transplanted with 2 × 10<sup>5</sup> CD45.1<sup>+</sup>/CD45.2<sup>+</sup> supporting cells into lethally irradiated CD45.1<sup>+</sup> recipient mice (n \= 16). (B) Lineage output of young or aged LT-HSCs at 4, 8, 12, 16 weeks after transplantation. Each bar represents an individual mouse. 

      Consistent with the original data, aged LT-HSCs exhibited a lineage output that was nearly identical to that of young LT-HSCs. Nonetheless, as the reviewer rightly pointed out, we cannot completely exclude the possibility that subtle differences may exist but remain undetected. To address this, we have added the following sentence to the manuscript:  

      [P9, L200] “These findings unmistakably demonstrated that mixed/bulk-HSCs showed myeloid skewed hematopoiesis in PB with aging. In contrast, LT-HSCs maintained a consistent lineage output throughout life, although subtle differences between aged and young LT-HSCs may exist and cannot be entirely ruled out.”

      References

      (1) Dykstra, Brad et al. “Clonal analysis reveals multiple functional defects of aged murine hematopoietic stem cells.” The Journal of experimental medicine vol. 208,13 (2011): 2691-703. doi:10.1084/jem.20111490

      (2) Beerman, Isabel et al. “Functionally distinct hematopoietic stem cells modulate hematopoietic lineage potential during aging by a mechanism of clonal expansion.” Proceedings of the National Academy of Sciences of the United States of America vol. 107,12 (2010): 5465-70. doi:10.1073/pnas.1000834107

      Comment #2-3: It is also unclear why the authors believe that the observed reduction of STHSCs relative to LT-HSCs explains the myeloid-biased phenotype observed in the peripheral blood. This point seems counterintuitive and requires further explanation. 

      Response #2-3:  

      Thank you for your comment. We apologize for the insufficient explanation. Our data, as shown in Figures 3 and 4, demonstrate that the differentiation potential of LTHSCs remains unchanged with age. Therefore, rather than suggesting that an increase in LT-HSCs with a consistent differentiation capacity leads to myeloid biased hematopoiesis, it seems more accurate to highlight that the relative decrease in the proportion of ST-HSCs, which remain in peripheral blood as lymphocytes, leads to a relative increase in myeloid cells in peripheral blood and thus causes myeloid-biased hematopoiesis. However, if we focus on the increase in the ratio of LT-HSCs, it is also plausible to explain that "with aging, the proportion of LT-HSCs capable of long-term myeloid hematopoiesis increases. As a result, from 16 weeks after transplantation, the influence of LT-HSCs maintaining the long-term ability to produce myeloid cells becomes relatively more significant, leading to an increase in the ratio of myeloid cells in the peripheral blood and causing myeloid-biased hematopoiesis." 

      [Comment for authors] 

      While this interpretation of the data might make sense the shown data do not exclude alternative explanations. The authors do not exclude the possibility that LTHSCs expand with age and that this expansion in combination with an aging microenvironment drives myeloid bias. The authors should quantify the frequency [%] and absolute number of LT-HSCs and ST-HSCs in young vs. aged animals. Especially analyzing the abs. numbers of cells will be important to support their claims as % can be affected by changes in the frequency of other populations. 

      Thank you for your very important point. As this reviewer pointed out, we do not exclude the possibility that the combination of aged microenvironment drives myeloid bias. Additionally, we acknowledge that myeloid-biased hematopoiesis with age is a complex process likely influenced by multiple factors. We would like to discuss the mechanism mentioned as a future research direction. Thank you for the insightful feedback. Regarding the point about the absolute cell numbers mentioned in the latter half of the paragraph, we will address this in detail in our subsequent response (Response #2-4).

      Comment #2-4: Based on my understanding of the presented data, the authors argue that myeloid-biased HSCs do not exist, as a) they detect no difference between young/aged HSCs after transplant (mind low n-numbers and large std!); b) myeloid progenitors downstream of HSCs only show minor or no changes in frequency and c) aged LT-HSCs do not outperform young LT-HSCs in myeloid output LTHSCs in competitive transplants (mind low n-numbers and large std!). However, given the low n-numbers and high variance of the results, the argument seems weak and the presented data does not support the claims sufficiently. That the number of downstream progenitors does not change could be explained by other mechanisms, for instance, the frequently reported differentiation short-cuts of HSCs and/or changes in the microenvironment. 

      Response #2-4:  

      We appreciate the comments. As mentioned above, we will correct the manuscript regarding the sample size. Regarding the interpreting of the lack of increase in the percentage of myeloid progenitor cells in the bone marrow with age, it is instead possible that various confounding factors, such as differentiation shortcuts or changes in the microenviroment, are involved. However, even when aged LT-HSCs and young LT-HSCs are transplanted into the same recipient mice, the timing of the appearance of different cell fractions in peripheral blood is similar (Figure 3 of this paper). Therefore, we have not obtained data suggesting that clear shortcuts exist in the differentiation process of aged HSCs into neutrophils or monocytes. Additionally, it is currently consensually accepted that myeloid cells, including neutrophils and monocytes, differentiate from GMPs1. Since there are no changes in the proportion of GMPs in the bone marrow with age, we concluded that the differentiation potential into myeloid cells remains consistent with aging. 

      Reference 

      (1) Akashi K and others, 'A Clonogenic Common Myeloid Progenitor That Gives Rise to All Myeloid Lineages', Nature, 404.6774 (2000), 193-97. 

      [Comment for authors] 

      As the relative frequency of cell population can be misleading, the authors should compare the absolute numbers of progenitors in young vs. aged mice to strengthen their argument. It would also be helpful to quantify the absolute numbers and relative frequencies in WT mice to exclude the possibility the HoxB5-trimcherry mouse model suffers from unexpected aging phenotypes and the hematopoietic system differs from wild-type animals.

      Thank you for your valuable feedback. We understand the importance of comparing the absolute numbers of progenitors in young versus aged mice to provide a more accurate representation of the changes in cell populations.

      Therefore, we quantified the absolute cell count of hematopoietic cells in the bone marrow using flow cytometry data. 

      Author response image 3.

      As previously reported, we observed a 10-fold increase in the number of pHSCs in aged mice compared to young mice. Additionally, our analysis revealed a statistically significant decrease in the number of Flk2+ progenitors and CLPs in aged mice. On the other hand, there was no statistically significant change in the number of myeloid progenitors between the two age groups. We appreciate the suggestion and hope that this additional information strengthens our argument and addresses your concerns.

      Comment #2-5:  

      "Then, we found that the myeloid lineage proportions from young and aged LT-HSCs were nearly comparable during the observation period after transplantation (Figure 3, B and C)." Given the large standard deviation and low n-numbers, the power of the analysis to detect differences between experimental groups is very low. Experimental groups with too large standard deviations (as displayed here) are difficult to interpret and might be inconclusive. The absence of clearly detectable differences between young and aged transplanted HSCs could thus simply be a false-negative result. The shown experimental results hence do not provide strong evidence for the author's interpretation of the data. The authors should add additional transplants and include a detailed power analysis to be able to detect differences between experimental groups with reasonable sensitivity. 

      Response #2-5:  

      Thank you for providing these insights. Regarding the sample size, we have addressed this in Response #2-1. 

      [Comment for authors]  

      As explained in detail in the response to #2-1 the provided arguments are not convincing. As the authors pointed out, the power of these experiments is too low to make strong claims. If the author does not intend to provide new data, the language of the manuscript needs to be adjusted to reflect this weakness. A paragraph discussing the limitations of the study mentioning the limited power of the data should be included beyond the above-mentioned rather vague statement that the data should be validated (which is almost always necessary anyway). 

      Thank you for your valuable comment. We agree with the importance of discussing potential limitations in our experimental design. In response to the reviewer’s suggestion, we have revised the manuscript to include the following sentences:

      [P19, L434] "In the co-transplantation assay shown in Figure 3, the myeloid lineage output derived from young and aged LT-HSCs was comparable (Young LT-HSC: 51.4 ± 31.5% vs. Aged LT-HSC: 47.4 ± 39.0%, p = 0.82). Although no significant difference was detected, the small sample size (n = 8) may limit the sensitivity of the assay to detect subtle myeloid-biased phenotypes."

      This addition acknowledges the potential limitations of our analysis and highlights the need for further investigation with larger cohorts.

      Comment #2-6:

      Line 293: "Based on these findings, we concluded that myeloid biased hematopoiesis observed following transplantation of aged HSCs was caused by a relative decrease in ST-HSC in the bulk-HSC compartment in aged mice rather than the selective expansion of myeloid-biased HSC clones." Couldn't that also be explained by an increase in myeloid-biased HSCs, as repeatedly reported and seen in the expansion of CD150+ HSCs? It is not intuitively clear why a reduction of STHSCs clones would lead to a myeloid bias. The author should try to explain more clearly where they believe the increased number of myeloid cells comes from. What is the source of myeloid cells if the authors believe they are not derived from the expanded population of myeloid-biased HSCs?

      Response #2-6:

      Thank you for pointing this out. We apologize for the insufficient explanation. We will explain using attached Figure 8 from the paper. First, our data show that LT-HSCs maintain their differentiation capacity with age, while ST-HSCs lose their self-renewal capacity earlier, so that only long-lived memory lymphocytes remain in the peripheral blood after the loss of self-renewal capacity in ST-HSCs (Figure 8, upper panel). In mouse bone marrow, the proportion of LT-HSCs increases with age, while the proportion of STHSCs relatively decreases (Figure 8, lower panel and Figure S5).

      Our data show that merely reproducing the ratio of LT-HSCs to ST-HSCs observed in aged mice using young LT-HSCs and ST-HSCs can replicate myeloid-biased hematopoiesis. This suggests that the increase in LT-HSC and the relative decrease in ST-HSC within the HSC compartment with aging are likely to contribute to myeloid-biased hematopoiesis.

      As mentioned earlier, since the differentiation capacity of LT-HSCs remain unchanged with age, it seems more accurate to describe that the relative decrease in the proportion of STHSCs, which retain long-lived memory lymphocytes in peripheral blood, leading to a relative increase in myeloid cells in peripheral blood and thus causes myeloid-biased hematopoiesis. However, focusing on the increase in the proportion of LT-HSCs, it is also possible to explain that "with aging, the proportion of LT-HSCs capable of long-term myeloid hematopoiesis increases. As a result, from 16 weeks after transplantation, the influence of LT-HSCs maintaining the long-term ability to produce myeloid cells become relatively more significant, leading to an increase in the ratio of myeloid cells in the peripheral blood and causing myeloid biased hematopoiesis."

      [Comment for authors]

      While I can follow the logic of the argument, my concerns about the interpretation remain as I see discrepancies in other findings in the published literature. For instance, what the authors call ST-HSCs, differs from the classical functional definition of ST-HSCs. It is thus difficult to relate the described observations to previous reports. ST-HSCs typically can contribute significantly to multiple lineages for several weeks (see for example PMID: 29625072). It is somewhat surprising that the ST-HSC in this study don't show this potential and loose their potential much quicker.

      The authors should thus provide a more comprehensive depth of immunophenotypic and molecular characterization to compare their LT-HSCs to ST-HSCs. For instance, are LT-HSCs CD41- HSCs? How do ST-HSCs differ in their surface marker expression from previously used definitions of ST-HSCs? A list of differentially expressed genes between young and old LT-HSCs and ST-HSCs should be done and will likely provide important insights into the molecular programs/markers (beyond the provided GO analysis, which seems superficial).

      Thank you for your valuable feedback. As the reviewer noted, there are indeed multiple definitions of ST-HSCs. We appreciate the opportunity to clarify our definitions of ST-HSCs. We define ST-HSCs functionally, rather than by surface antigens, which we believe is the most classical and widely accepted definition [1]. In our study, we define long-term hematopoietic stem cells (LT-HSCs) as those HSCs that continue to contribute to hematopoiesis after a second transplantation and possess long-term self-renewal potential. Conversely, we define short-term hematopoietic stem cells (ST-HSCs) as those HSCs that do not contribute to hematopoiesis after a second transplantation and only exhibit self-renewal potential in the short term. 

      Next, in the paper referenced by the reviewer[2], the chimerism of each fraction of ST-HSCs also peaked at 4 weeks and then decreased to approximately 0.1% after 12 weeks post-transplantation. Author response image 5 illustrates our ST-HSC donor chimerism in Figure 2. We believe that data in the paper referenced by the reviewer2 is consistent with our own observations of the hematopoietic pattern following ST-HSC transplantation, indicating a characteristic loss of hematopoietic potential 4 weeks after the transplantation. Furthermore, as shown in Figures 2D and 2F, the fraction of ST-HSCs does not exhibit hematopoietic activity after the second transplantation. Therefore, we consider this fraction to be ST-HSCs.

      Author response image 4.

      Additionally, the RNAseq data presented in Figures 4 and S4 revealed that the GSEA results vary among the different myeloid gene sets analyzed (Fig. 4, D–F; Fig. S4, C–D). Moreover, a comprehensive analysis of mouse HSC aging using multiple RNA-seq datasets reported that nearly 80% of differentially expressed genes show poor reproducibility across datasets[3]. From the above, while RNAseq data is indeed helpful, we believe that emphasizing functional experimental results is more critical than incorporating an additional dataset to support our claim. Thank you once again for your insightful feedback.

      References

      (1) Kiel, Mark J et al. “SLAM family receptors distinguish hematopoietic stem and progenitor cells and reveal endothelial niches for stem cells.” Cell vol. 121,7 (2005): 1109-21. doi:10.1016/j.cell.2005.05.026

      (2) Yamamoto, Ryo et al. “Large-Scale Clonal Analysis Resolves Aging of the Mouse Hematopoietic Stem Cell Compartment.” Cell stem cell vol. 22,4 (2018): 600-607.e4. doi:10.1016/j.stem.2018.03.013

      (3) Flohr Svendsen, Arthur et al. “A comprehensive transcriptome signature of murine hematopoietic stem cell aging.” Blood vol. 138,6 (2021): 439-451. doi:10.1182/blood.2020009729

      Reviewer #3 (Public review): 

      Although the topic is appropriate and the new model provides a new way to think about lineage-biased output observed in multiple hematopoietic contexts, some of the experimental design choices, as well as some of the conclusions drawn from the results could be substantially improved. Also, they do not propose any potential mechanism to explain this process, which reduces the potential impact and novelty of the study. 

      The authors have satisfactorily replied to some of my comments. However, there are multiple key aspects that still remain unresolved.

      Reviewer #3 (Recommendations for the authors): 

      Comment #3-1,2:  

      Although the additional details are much appreciated the core of my original comments remains unanswered. There are still no details about the irradiation dose for each particular experiment. Is any transplant performed using a 9.1 Gy dose? If yes, please indicate it in text or figure legend. If not, please remove this number from the corresponding method section. 

      Again, 9.5 Gy (split in two doses) is commonly reported as sublethal. The fact that the authors used a methodology that deviates from the "standard" for the field makes difficult to put these results in context with previous studies. It is not possible to know if the direct and indirect effects of this conditioning method in the hematopoietic system have any consequences in the presented results. 

      Thank you for your clarification. We confirm that none of the transplantation experiments described were performed using a 9.1 Gy irradiation dose. We have therefore removed the mention of "9.1 Gy" from the relevant section of the Materials and Methods. We appreciate helpful suggestion to improve the clarity of the manuscript.

      [P22, L493] “12-24 hours prior to transplantation, C57BL/6-Ly5.1 mice, or aged C57BL/6J recipient mice were lethally irradiated with single doses of 8.7 Gy.”

      Regarding the reviewer’s concern about the radiation dose used in our experiments, we will address this point in more detail in our subsequent response (see Response #3-4).

      Comment #3-4(Original): When representing the contribution to PB from transplanted cells, the authors show the % of each lineage within the donor-derived cells (Figures 3B-C, 5B, 6B-D, 7C-E, and S3 B-C). To have a better picture of total donor contribution, total PB and BM chimerism should be included for each transplantation assay. Also, for Figures 2C-D and Figures S2A-B, do the graphs represent 100% of the PB cells? Are there any radioresistant cells?

      Response #3-4 (Original): Thank you for highlighting this point. Indeed, donor contribution to total peripheral blood (PB) is important information. We have included the donor contribution data for each figure above mentioned.

      In Figure 2C-D and Figure S2A-B, the percentage of donor chimerism in PB was defined as the percentage of CD45.1-CD45.2+ cells among total CD45.1-CD45.2+ and CD45.1+CD45.2+ cells as described in method section.

      Comment for our #3-4 response:  

      Thanks for sharing these data. These graphs should be included in their corresponding figures along with donor contribution to BM. 

      Regarding Figure2 C-D, as currently shown, the graphs only account for CD45.1CD45.2+ (donor-derived) and CD45.1+CD45.2+ (supporting-derived). What is the percentage of CD45.1+CD45.2- (recipient-derived)? Since the irradiation regiment is atypical, including this information would help to know more about the effects of this conditioning method. 

      Thank you for your insightful comment regarding Figure 2C-D. To address the concern that the reviewer pointed out, we provide the kinetics of the percentage of CD45.1+CD45.2- (recipient-derived) in Author response image 7.

      Author response image 5.

      As the reviewer pointed out, we observed the persistence of recipient-derived cells, particularly in the secondary transplant. As noted, this suggests that our conditioning regimen may have been suboptimal. In response, we will include the donor chimerism analysis in the total cells and add the following statement in the study limitations section to acknowledge this point:

      [P19, L439] “Additionally, in this study, we purified LT-HSCs using the Hoxb5 reporter system and employed a moderate conditioning regimen (8.7 Gy). To have a better picture of total donor contribution, total PB chimerism are presented in Figure S7 and we cannot exclude the possibility that these factors may have influenced the results. Therefore, it would be ideal to validate our findings using alternative LT-HSC markers and different conditioning regimens.”

      Comment #3-5: For BM progenitor frequencies, the authors present the data as the frequency of cKit+ cells. This normalization might be misleading as changes in the proportion of cKit+ between the different experimental conditions could mask differences in these BM subpopulations. Representing this data as the frequency of BM single cells or as absolute numbers (e.g., per femur) would be valuable.

      Response #3-5:

      We appreciate the reviewer's comment on this point. 

      Firstly, as shown in Supplemental Figures S1B and S1C, we analyze the upstream (HSC, MPP, Flk2+) and downstream (CLP, MEP, CMP, GMP) fractions in different panels. Therefore, normalization is required to assess the differentiation of HSCs from upstream to downstream.

      Additionally, the reason for normalizing by c-Kit+ is that the bone marrow analysis was performed after enrichment using the Anti-c-Kit antibody for both upstream and downstream fractions. Based on this, we calculated the progenitor populations as a frequency within the c-Kit positive cells. Next, the results of normalizing the whole bone marrow cells (live cells) are shown below. 

      Author response image 6.

      Similar to the results of normalizing c-Kit+ cells, myeloid progenitors remained unchanged, including a statistically significant decrease in CMP in aged mice. Additionally, there were no significant differences in CLP. In conclusion, similar results were obtained between the normalization with c-Kit and the normalization with whole bone marrow cells (live cells).

      However, as the reviewer pointed out, it is necessary to explain the reason for normalization with c-Kit. Therefore, we will add the following description.

      [P21, L502] For the combined analysis of the upstream (HSC, MPP, Flk2+) and downstream (CLP, MEP, CMP, GMP) fractions in Figures 1B, we normalized by cKit+ cells because we performed a c-Kit enrichment for the bone marrow analysis.

      Comment for our #3-5 response:

      I understand that normalization is necessary to compare across different BM populations. However, the best way would be to normalize to single cells. As I mentioned in my original comment, normalizing to cKit+ cells could be misleading, as the proportion of cKit+ cells could be different across the experimental conditions. Further, enriching for cKit+ cells when analyzing BM subpopulation frequencies could introduce similar potential errors. The enrichment would depend on the level of expression of cKit for each of these population, what would alter the final quantification. Indeed, CLP are typically defined as cKit-med/low. Thus, cKit enrichment would not be a great method to analyze the frequency of these cells. 

      The graph in the authors' response to my comment, show similar trend to what is represented Figure 1B for some populations. However, there are multiple statistically significant changes that disappear in this new version. This supports my original concern and, in consequence, I would encourage to represent this data as the frequency of BM single cells or as absolute numbers (e.g., per femur). 

      Thank you for your thoughtful follow-up comment. In response to the reviewer’s suggestion, we will represent the data as the frequency among total BM single cells. These revised graphs have been incorporated into the updated Figure 7F and corresponding figure legend have been revised accordingly to accurately reflect these representations. We appreciate your valuable input, which has helped us improve the clarity and rigor of our data presentation.

      Comment #3-6: Regarding Figure 1B, the authors argue that if myeloid-biased HSC clones increase with age, they should see increased frequency of all components of the myeloid differentiation pathway (CMP, GMP, MEP). This would imply that their results (no changes or reduction in these myeloid subpopulations) suggest the absence of myeloid-biased HSC clones expansion with age. This reviewer believes that differentiation dynamics within the hematopoietic hierarchy can be more complex than a cascade of sequential and compartmentalized events (e.g., accelerated differentiation at the CMP level could cause exhaustion of this compartment and explain its reduction with age and why GMP and MEP are unchanged) and these conclusions should be considered more carefully.

      Response #3-6:

      We wish to thank the reviewer for this comment. We agree with that the differentiation pathway may not be a cascade of sequential events but could be influenced by various factors such as extrinsic factors.

      In Figure 1B, we hypothesized that there may be other mechanisms causing myeloid-biased hematopoiesis besides the age-related increase in myeloid-biased HSCs, given that the percentage of myeloid progenitor cells in the bone marrow did not change with age. However, we do not discuss the presence or absence of myeloid-biased HSCs based on the data in Figure 1B. 

      Our newly proposed theories—that the differentiation capacity of LT-HSCs remains unchanged with age and that age-related myeloid-biased hematopoiesis is due to changes in the ratio of LT-HSCs to ST-HSCs—are based on functional experiment results. As the reviewer pointed out, to discuss the presence or absence of myeloid-biased HSCs based on the data in Figure 1B, it is necessary to apply a system that can track HSC differentiation at single-cell level. The technology would clarify changes in the self-renewal capacity of individual HSCs and their differentiation into progenitor cells and peripheral blood cells. The authors believe that those single-cell technologies will be beneficial in understanding the differentiation of HSCs. Based on the above, the following statement has been added to the text.

      [P19, L440] In contrast, our findings should be considered in light of some limitations. In this report, we primarily performed ten to twenty cell transplantation assays. Therefore, the current theory should be revalidated using single-cell technology with lineage tracing system1-2. This approach will investigate changes in the self-renewal capacity of individual HSCs and their subsequent differentiation into progenitor cells and peripheral blood cells. 

      Comment for our #3-6 response:

      Thanks for the response. My original comments referred to the statement "On the other hand, in contrast to what we anticipated, the frequency of GMP was stable, and the percentage of CMP actually decreased significantly with age, defying our prediction that the frequency of components of the myeloid differentiation pathway, such as CMP, GMP, and MEP would increase in aged mice if myeloid-biased HSC clones increase with age (Fig. 1 B)" (lines #129-133). Again, the absence of an increase in CMP, GMP and MEP with age does not mean the absence of and increase in myeloid-biased HSC clones. This statement should be considered more carefully. 

      Thank you for the insightful comment. We agree that the absence of an increase in CMP, GMP and MEP with age does not mean the absence of an increase in myeloid-biased HSC clones. In our revised manuscript, we have refined the statement to acknowledge this nuance more clearly. The updated text now reads as follows:

      P6, L129] On the other hand, in contrast to what we anticipated, the frequency of GMP was stable, and the percentage of CMP actually decreased significantly with age, defying our prediction that the frequency of components of the myeloid differentiation pathway, such as CMP, GMP, and MEP may increase in aged mice, if myeloid-biased HSC clones increase with age. 

      Comment #3-7: Within the few recipients showing good donor engraftment in Figure 2C, there is a big proportion of T cells that are "amplified" upon secondary transplantation (Figure 2D). Is this expected?

      Response #3-7:

      We wish to express our deep appreciation to the reviewer for insightful comment on this point. As the reviewers pointed out, in Figure 2D, a few recipients show a very high percentage of T cells. The authors had the same question and considered this phenomenon as follows:

      (1) One reason for the very high percentage of T cells is that we used 1 x 107 whole bone marrow cells in the secondary transplantation. Consequently, the donor cells in the secondary transplantation contained more T-cell progenitor cells, leading to a greater increase in T cells compared to the primary transplantation.

      (2) We also consider that this phenomenon may be influenced by the reduced selfrenewal capacity of aged LT-HSCs, resulting in decreased sustained production of myeloid cells in the secondary recipient mice. As a result, long-lived memorytype lymphocytes may preferentially remain in the peripheral blood, increasing the percentage of T cells in the secondary recipient mice.

      We have discussed our hypothesis regarding this interesting phenomenon. To further clarify the characteristics of the increased T-cell count in the secondary recipient mice, we will analyze TCR clonality and diversity in the future.

      Comment for our #3-7 response:

      Thanks for the potential explanations to my question. This fact is not commonly reported in previous transplantation studies using aged HSCs. Could Hoxb5 label fraction of HSCs that is lymphoid/T-cell biased upon secondary transplantation? The number of recipients with high frequency of lymphoid cells in the peripheral blood (even from young mice) is remarkable. 

      Response:

      Thank you for your insightful suggestion. Based on this comment, we calculated the percentage of lymphoid cells in the donor fraction at 16 weeks following the secondary transplantation, which was 56.1 ± 25.8% (L/M = 1.27). According to the Müller-Sieburg criteria, lymphoid-biased hematopoiesis is defined as having an L/M ratio greater than 10. 

      Given our findings, we concluded that the Hoxb5-labeled fraction does not specifically indicate lymphoid-biased hematopoiesis. We sincerely appreciate the valuable input, which helped us to further clarify the interpretation of our results.

      Comment #3-8: Do the authors have any explanation for the high level of variabilitywithin the recipients of Hoxb5+ cells in Figure 2C?

      Response #3-8:

      We appreciate the reviewer's comment on this point. As noted in our previous report, transplantation of a sufficient number of HSCs results in stable donor chimerism, whereas a small number of HSCs leads to increased variability in donor chimerism1. Additionally, other studies have observed high variability when fewer than 10 HSCs are transplanted2-3. Based on this evidence, we consider that the transplantation of a small number of cells (10 cells) is the primary cause of the high level of variability observed.

      Comment for our #3-8 response:

      I agree that transplanting low number of HSC increases the mouse-to-mouse variability. For that reason, a larger cohort of recipients for this kind of experiment would be ideal. 

      Response:

      Thank you for the insightful comment. We agree that a larger cohort of recipients would be ideal for this type of experiment. In Figure 2, the difference between Hoxb5<suup>+</sup> and Hoxb5⁻ cells are robust, allowing for a clear statistical distinction despite the cohort size. However, we also recognize that a larger cohort would be necessary to detect more subtle differences, particularly in Figure 3. In response, we have added the following statement to the main text to acknowledge this limitation.

      P9, L200] These findings unmistakably demonstrated that mixed/bulk-HSCs showed myeloid skewed hematopoiesis in PB with aging. In contrast, LT-HSCs maintained a consistent lineage output throughout life, although subtle differences between aged and young LT-HSCs may exist and cannot be entirely ruled out.

      Comment #3-10: Is Figure 2G considering all primary recipients or only the ones that were used for secondary transplants? The second option would be a fairer comparison.

      Response #3-10:

      We appreciate the reviewer's comment on this point. We considered all primary recipients in Figure 2G to ensure a fair comparison, given the influence of various factors such as the radiosensitivity of individual recipient mice[1]. Comparing only the primary recipients used in the secondary transplantation would result in n = 3 (primary recipient) vs. n = 12 (secondary recipient). Including all primary recipients yields n = 11 vs. n = 12, providing a more balanced comparison. Therefore, we analyzed all primary recipient mice to ensure the reliability of our results.

      Comment for our #3-10 response:

      I respectfully disagree. Secondary recipients are derived from only 3 of the primary recipients. Therefore, the BM composition is determined by the composition of their donors. Including primary recipients that are not transplanted into secondary recipients for is not the fairest comparison for this analysis. 

      Thank you for your comment and for highlighting this important issue. We acknowledge the concern that including primary recipients that are not transplanted into secondary recipients is not the fairest comparison for this analysis. In response, we have reanalyzed the data using only the primary recipients whose bone marrow was actually transplanted into secondary recipients. 

      Author response image 7.

      Importantly, the reanalysis confirmed that the kinetics of myeloid cell proportions in peripheral blood were consistent between primary and secondary transplant recipients. We sincerely appreciate your thoughtful feedback, which has helped us improve the clarity.

      Comment #3-11: When discussing the transcriptional profile of young and aged HSCs, the authors claim that genes linked to myeloid differentiation remain unchanged in the LT-HSC fraction while there are significant changes in the STHSCs. However, 2 out of the 4 genes shown in Figure S4B show ratios higher than 1 in LT-HSCs.

      Response #3-11:

      Thank you for highlighting this important point. As the reviewer pointed out, when we analyze the expression of myeloid-related genes, some genes are elevated in aged LT-HSCs compared to young LT-HSCs. However, the GSEA analysis using myeloid-related gene sets, which include several hundred genes, shows no significant difference between young and aged LT-HSCs (see Figure S4C in this paper). Furthermore, functional experiments using the co-transplantation system show no difference in differentiation capacity between young and aged LT-HSCs (see Figure 3 in this paper). Based on these results, we conclude that LT-HSCs do not exhibit any change in differentiation capacity with aging.

      Comment for our #3-11 response:

      The authors used the data in Figure S4 to claim that "myeloid genes were tended to be enriched in aged bulk-HSCs but not in aged LT-HSCs compared to their respective controls" (this is the title of the figure; line # 1326). This is based on an increase in gene expression of CD150, vWF, Selp, Itgb3 in aged cells compared to young cells (Figure S4B). However, an increase in Selp and Itgb3 is also observed for LT-HSCs (lower magnitude, but still and increase). 

      Also, regarding the GSEA, the only term showing statistical significance in bulk HSCs is "Myeloid gene set", which does not reach significance in LT-HSCs, but present a trend for enrichment (q = 0.077). None of the terms in shown in this panel present statistical significance in ST-HSCs. 

      Thank you for your valuable point. As the reviewer noted, the current title may cause confusion. Therefore, we propose changing it to the following:

      [P52, L1331] “Figure S4. Compared to their respective young controls, aged bulk-HSCs exhibit greater enrichment of myeloid gene expression than aged LT-HSCs”

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review):

      Overall, the manuscript is very well written, the approaches used are clever, and the data were thoroughly analyzed. The study conveyed important information for understanding the circuit mechanism that shapes grid cell activity. It is important not only for the field of MEC and grid cells, but also for broader fields of continuous attractor networks and neural circuits.

      We appreciate the positive comments.

      (1) The study largely relies on the fact that ramp-like wide-field optogenetic stimulation and focal optogenetic activation both drove asynchronous action potentials in SCs, and therefore, if a pair of PV+ INs exhibited correlated activity, they should receive common inputs. However, it is unclear what criteria/thresholds were used to determine the level of activity asynchronization, and under these criteria, what percentage of cells actually showed synchronized or less asynchronized activity. A notable percentage of synchronized or less asynchronized SCs could complicate the results, i.e., PV+ INs with correlated activity could receive inputs from different SCs (different inputs), which had synchronized activity. More detailed information/statistics about the asynchronization of SC activity is necessary for interpreting the results.

      The percentage of SCs that show synchronised activity during ramping optogenetic activation is zero. To make this clear we've added new quantification to the analyses of simultaneously activated SCs in Figure 2, Figure Supplement 1. This includes confidence intervals for the correlograms and statistical comparisons of the correlograms to shuffled data from each pair of neurons. We also validate our statistical analysis strategy by showing that it successfully identifies autocorrelation peaks for the same cells.

      Synchronisation during focal optogenetic activation is also expected to be zero. We did not commit resources to experiments to directly test this for focal stimulation because we had already tested the possibility with ramping stimuli discussed above, and because the established biophysics of local SC circuits is such that synchronised activity during selective activation of SCs is unlikely. In particular, because direct excitatory connections between SCs are either rare or absent (Fuchs et al. 2016; Couey et al. 2013; Pastoll et al. 2013; Winterer et al. 2017), and when detected have small amplitude (Winterer et al. 2017), no mechanism exists that could drive synchronisation. The absence of coordination in responses to ramping stimuli quantified above is consistent with this conclusion.

      (2) The hypothesis about the "direct excitatory-inhibitory" synaptic interactions is made based on the GABAzine experiments in Figure 4. In the Figure 8 diagram, the direct interaction is illustrated between PV+ INs and SCs. However, the evidence supporting this "direct interaction" between these two cell types is missing. Is it possible that pyramidal cells are also involved in this interaction? Some pieces of evidence or discussions are necessary to further support the "direction interaction".

      We were insufficiently clear in our previous attempts to ground these interpretations in the context of previous work. The hypothesis about "direct excitatory-inhibitory" interactions wasn't made solely on the basis of Figure 4, but from multiple previous studies that directly demonstrate these interactions (e.g. Fuchs et al. 2016; Couey et al. 2013; Pastoll et al. 2013). Similarly, the diagram in Figure 8 doesn't only reflect the conclusions of the present study but integrates work from these and other previous studies.

      A possible role for pyramidal cells in coordination would require that they can be driven to fire action potentials by input from SCs. However, SCs appear not to connect to pyramidal cells (0/126 tested connections in Winterer et al. 2017). Thus, this possibility is inconsistent with the previously published data.

      To make these points clearer we have added additional discussion and citations to the results (p 5), discussion (p 11) and legend to Figure 8.

      Reviewer #2 (Public Review):

      In this study, Huang et al. employed optogenetic stimulation alongside paired whole-cell recordings in genetically defined neuron populations of the medial entorhinal cortex to examine the spatial distribution of synaptic inputs and the functional-anatomical structure of the MEC. They specifically studied the spatial distribution of synaptic inputs from parvalbumin-expressing interneurons to pairs of excitatory stellate cells. Additionally, they explored the spatial distribution of synaptic inputs to pairs of PV INs. Their results indicate that both pairs of SCs and PV INs generally receive common input when their relative somata are within 200-300 ums of each other. The research is intriguing, with controlled and systematic methodologies. There are interesting takeaways based on the implications of this work to grid cell network organization in MEC.

      We appreciate the positive comments.

      (1) Results indicate that in brain slices, nearby cells typically share a higher degree of common input. However, some proximate cells lack this shared input. The authors interpret these findings as: "Many cells in close proximity don't seem to share common input, as illustrated in Figures 3, 5, and 7. This implies that these cells might belong to separate networks or exist in distinct regions of the connectivity space within the same network.".

      Every slice orientation could have potentially shared inputs from an orthogonal direction that are unavoidably eliminated. For instance, in a horizontal section, shared inputs to two SCs might be situated either dorsally or ventrally from the horizontal cut, and thus removed during slicing. Given the synaptic connection distributions observed within each intact orientation, and considering these distributions appear symmetrically in both horizontal and sagittal sections, the authors should be equipped to estimate the potential number of inputs absent due to sectioning in the orthogonal direction. How might this estimate influence the findings, especially those indicating that many close neurons don't have shared inputs?

      We appreciate the suggestion, however systematically generating estimates that account in full for the relative position of the postsynaptic neurons, for variation in the organisation of their dendritic fields and for unknowns such as the location and number of synaptic contacts made, quickly leads to a large potential parameter space, while not advancing our understanding beyond qualitative assessment of the raw data.

      Given this, we make the following comments:

      'We note that the absence of correlated inputs in one slice plane does not rule out the possibility that the same cell pair receives common inputs in a different plane, as these inputs would most likely not be activated if the cell bodies of the presynaptic neuron were removed by slicing.' (p10) and:

      'The incompleteness may in part result from loss of some inputs by tissue slicing. However, the fact that axons were well preserved and typically extended beyond the range of functional correlations, while many cell pairs that did not receive correlated input were relatively close to one another and had overlapping dendritic fields, argues against tissue slicing being a major contributor to incompleteness.' (p10).

      (2) The study examines correlations during various light-intensity phases of the ramp stimuli. One wonders if the spatial distribution of shared (or correlated) versus independent inputs differs when juxtaposing the initial light stimulation phase, which begins to trigger spiking, against subsequent phases. This differentiation might be particularly pertinent to the PV to SC measurements. Here, the initial phase of stimulation, as depicted in Figure 7, reveals a relatively sparse temporal frequency of IPSCs. This might not represent the physiological conditions under which high-firing INs function.

      While the authors seem to have addressed parts of this concern in their focal stim experiments by examining correlations during both high and low light intensities, they could potentially extract this metric from data acquired in their ramp conditions. This would be especially valuable for PV to SC measurements, given the absence of corresponding focal stimulation experiments.

      As the reviewer's comments recognise, the consistent results with focal stimulation already provide direct experimental validation to our ramp stimulation approach. We appreciate the suggestion for further analysis, but as we understand it this analysis would be hard to interpret. First, variation between pairs in the activity at different phases of the light ramp will be confounded by slice to slice differences in the level of ChR2 expression, e.g. in Figure 2, Figure Supplement 1 within slice variability is low, whereas between slice variation is relatively high. This is because in slices with relatively low expression spike onset is relatively late, while in slices with relatively high expression spike onset is early in the ramp and later in the ramp neurons experience depolarising block. Second, the onset of changes in cross-correlation coefficients and lag variation is typically abrupt. This makes it challenging to assign windows to onset phases or to interpret the resulting data.

      (3) Re results from Figure 2: Please fully describe the model in the methods section. Generally, I like using a modeling approach to explore the impact of convergent synaptic input to PVs from SCs that could effectively validate the experimental approach and enhance the interpretability of the experimental stim/recording outcomes. However, as currently detailed in the manuscript, the model description is inadequate for assessing the robustness of the simulation outcomes. If the IN model is simply integrate-and-fire with minimal biophysical attributes, then the findings in Fig 2F results shown in Fig 2F might be trivial. Conversely, if the model offers a more biophysically accurate representation (e.g., with conductance-based synaptic inputs, synapses appropriately dispersed across the model IN dendritic tree, and standard PV IN voltage-gated membrane conductances), then the model's results could serve as a meaningful method to both validate and interpret the experiments.

      We have expanded the description of the modelling given in the methods including clearer motivation and justification (p 15). Two points are helpful to consider:

      First, the goal of the model is to assess the feasibility of the correlation based approach given the synaptic current responses recorded at the soma. We now make this clearer by stating that:

      'The goal of our simulations was to assess if analysis of cross-correlations between currents recorded from pairs of neurons could be used to establish whether they receive shared input from the same pre-synaptic neuron. While this should be obvious if neurons exclusively receive shared input, we wanted to establish whether shared input is detectable when each neuron also receives independent inputs of similar frequency and amplitude to the shared input.' (p 15).

      The suggestion that the results in Figure 2F are trivial doesn't make sense to us. Indeed, it strikes us as non-trivial that with this approach shared input from a single common presynaptic neuron is not detectable, but input from two or more is.

      Second, because we are simulating a somatic voltage-clamp experiment the details of the neuronal time constants, voltage-gated channels or other integrative mechanisms that reviewer suggests may be important here are not actually relevant to the interpretation. To appreciate this consider the membrane equation:

      When the membrane is clamped at a fixed potential, there is no capacitance current , while voltage-dependent ionic currents and the resting ionic current are constant. In this case the only time varying current is the synaptic current . Thus, adding more details would not make the model more 'meaningful' as these details would be redundant and the results will be the same as simply considering convolution of the synaptic conductances. We have made this rationale clearer in the revised methods (p 15).

      Reviewer #3 (Public Review):

      These are technically demanding experiments, but the authors show quite convincing differences in the correlated response of cell pairs that are close to each other in contrast to an absence of correlation in other cell pairs at a range of relative distances. This supports their main point of demonstrating anatomical clusters of cells receiving shared inhibitory input.

      We appreciate the positive comments.

      The overall technique is complex and the presentation could be more clear about the techniques and analysis.

      Thanks. We've added additional explanation to the methods section to try to improve clarity (p 15-16).

      In addition, due to this being a slice preparation they cannot directly relate the inhibitory interactions to the functional properties of grid cells which was possible in the 2-photon in vivo imaging experiment by Heys and Dombeck, 2014.

      We agree the two approaches are complementary. The Heys and Dombeck study could only reveal correlations in functional activity, which could have many possible synaptic mechanisms, whereas our results address synaptic organisation but the representational roles of the specific neurons we recorded from are unclear. We have highlighted these current limitations and strategies to address them in the final paragraph of the discussion (p 11).

    1. Author Response

      The following is the authors’ response to the previous reviews.

      Reviewer #3 comment

      1) One suggestion for improvement is to consider incorporating the results from Figure S9 into in the main Figure 6, which would enhance readers' comprehension.

      We appreciate your valuable feedback. Based on the reviewer’s suggestion, we have incorporated results from the Figure S9 into the main Figure 6, as shown below. Manuscripts and figure legends have also been modified accordingly.

      Author response image 1.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors aim to assess the effect of salt stress on root:shoot ratio, identify the underlying genetic mechanisms, and evaluate their contribution to salt tolerance. To this end, the authors systematically quantified natural variations in salt-induced changes in root:shoot ratio. This innovative approach considers the coordination of root and shoot growth rather than exploring biomass and the development of each organ separately. Using this approach, the authors identified a gene cluster encoding eight paralog genes with a domain-of-unknown-function 247 (DUF247), with the majority of SNPs clustering into SR3G (At3g50160). In the manuscript, the authors utilized an integrative approach that includes genomic, genetic, evolutionary, histological, and physiological assays to functionally assess the contribution of their genes of interest to salt tolerance and root development.

      Comments on revisions:

      As the authors correctly noted, variations across samples, genotypes, or experiments make achieving statistical significance challenging. Should the authors choose to emphasize trends across experiments to draw biological conclusions, careful revisions of the text, including titles and figure legends, will be necessary to address some of the inconsistencies between figures (see examples below). However, I would caution that this approach may dilute the overall impact of the work on SR3G function and regulation. Therefore, I strongly recommend pursuing additional experimental evidence wherever possible to strengthen the conclusions.

      (1) Given the phenotypic differences shown in Figures S17A-B, 10A-C, and 6A, the statement that "SR3G does not play a role in plant development under non-stress conditions" (lines 680-681) requires revision to better reflect the observed data.

      Thank you to the reviewer for the comment. We appreciate the acknowledgment that variations among experiments are inherent to biological studies. Figures 6A and S17 represent the same experiment, which initially indicated a phenotype for the sr3g mutant under salt stress. To ensure that growth changes were specifically normalized for stress conditions, we calculated the Stress Tolerance Index (Fig. 6B). In Figure 10, we repeated the experiment including all five genotypes, which supported our original observation that the sr3g mutant exhibited a trend toward reduced lateral root number under 75 mM NaCl compared to Col-0, although this difference was not significant (Fig. 10B). Additionally, we confirmed that the wrky75 mutant showed a significant reduction in main root growth under salt stress compared to Col-0, consistent with findings reported in The Plant Cell by Lu et al. 2023. For both main root length and lateral root number, we demonstrated that the double mutants of wrky75/sr3g displayed growth comparable to wild-type Col-0. This result suggests that the sr3g mutation compensates for the salt sensitivity of the wrky75 mutant.

      We completely agree with the reviewer that there is a variation in our results regarding the sr3g phenotype under control conditions, as presented in Fig. 6A/Fig. S17 and Fig. 10A-C. In Fig. 6A/Fig. S17, we did not observe any consistent trends in main root or lateral root length for the sr3g mutant compared to Col-0 under control conditions. However, in Fig. 10A-C, we observed a significant reduction in main root length, lateral root number, and lateral root length for the sr3g mutant under control conditions. We believe this may align with SR3G’s role as a negative regulator of salt stress responses. While loss of this gene benefits plants in coping with salt stress, it might negatively impact overall plant growth under non-stress conditions. This interpretation is further supported by our findings on the root suberization pattern in sr3g mutants under control conditions (Fig. 8B), where increased suberization in root sections 1 to 3, compared to Col-0, could inhibit root growth. While SR3G's role in overall plant fitness is intriguing, it is beyond the scope of this study. We cannot rule out the possibility that SR3G contributes positively to plant growth, particularly root growth. That said, we observed no differences in shoot growth between Col-0 and the sr3g mutant under control conditions (Fig. 7). Additionally, we calculated the Stress Tolerance Index for all aspects of root growth shown in Fig. 10 and presented it in Fig. S25.

      To address the reviewer request on rephrasing the lines 680-681 from"SR3G does not play a role in plant development under non-stress conditions" (lines 680-681) statement, this statement is found in lines 652-653 and corresponds to Fig. 7, where we evaluated rosette growth in the WT and sr3g mutant under both control and salt stress conditions. We did not observe any significant differences or even trends between the two genotypes under control conditions, confirming the accuracy of the statement. To clarify further, we have added “SR3G does not play a role in rosette growth and development under non-stress conditions”.

      (2) I agree with the authors that detecting expression differences in lowly expressed genes can be challenging. However, as demonstrated in the reference provided (Lu et al., 2023), a significant reduction in WRKY75 expression is observed in T-DNA insertion mutant alleles of WRKY75. In contrast, Fig. 9B in the current manuscript shows no reduction in WRKY75 expression in the two mutant alleles selected by the authors, which suggests that these alleles cannot be classified as loss-of-function mutants (line 745). Additionally, the authors note that the wrky75 mutant exhibits reduced main root length under salt stress, consistent with the phenotype reported by Lu et al. (2023). However, other phenotypic discrepancies exist between the two studies. For example, 1) Lu et al. (2023) report that w¬rky75 root length is comparable to WT under control conditions, whereas the current manuscript shows that wrky75 root growth is significantly lower than WT; 2) under salt stress, Lu et al. (2023) show that wrky75 accumulates higher levels of Na+, whereas the current study finds Na+ levels in wrky75 indistinguishable from WT. To confirm the loss of WRKY75 function in these T-DNA insertion alleles the authors should provide additional evidence (e.g., Western blot analysis).

      We sincerely appreciate the reviewer acknowledging the challenge of detecting expression differences in lowly expressed genes, such as transcription factors. Transcription factors are typically expressed at lower levels compared to structural or enzymatic proteins, as they function as regulators where small quantities can have substantial effects on downstream gene expression.

      That said, we respectfully disagree with the reviewer’s interpretation that there is no reduction in WRKY75 expression in the two mutant lines tested in Fig. 9C. Among the two independent alleles examined, wrky75-3 showed a clear reduction in expression compared to WT Col-0 under both control and salt stress conditions. Using the Tukey test to compare all groups, we observed distinct changes in the assigned significance letters for each case:

      Col/root/control (cd) vs wrky75-3/root/control (cd): Although the same significance letter was assigned, we still observed a clear reduction in WRKY75 transcript abundance. More importantly, the variation in expression is notably lower compared to Col-0.

      Col/shoot/control (bcd) vs wrky75-3/shoot/control (a): This is significant reduction compared to Col

      Col/root/salt (cd) vs wrky75-3/root/salt (bcd): Once again, the reduction in WRKY75 transcript levels corresponds to changes in the assigned significance letters.

      Col/shoot/salt (bc) vs wrky75-3/shoot/salt (ab): Once again, the reduction in WRKY75 transcript levels corresponds to changes in the assigned significance letters.

      To address the reviewer’s comment regarding the significant reduction in WRKY75 expression observed in T-DNA insertion mutant alleles of WRKY75 in the reference by Lu et al., 2023, we would like to draw the reviewer’s attention to the following points:

      a) Different alleles: The authors in The Plant Cell used different alleles than those used in our study, with one of their alleles targeting regions upstream of the WRKY75 gene. While we identified one of their described alleles (WRKY75-1, SALK_101367) on the T-DNA express website, which targets upstream of WRKY75, the other allele (wrky75-25) appears to have been generated through a different mechanism (possibly an RNAi line) that is not defined in the Plant Cell paper and does not appear on the T-DNA express website. The authors mentioned they have received these seeds as gifts from other labs in the acknowledgement ”We thank Prof. Hongwei Guo (Southern University of Science and Technology, China) and Prof. Diqiu Yu (Yunnan University, China) for kindly providing the WRKY75<sub>pro</sub>:GUS, 35S<sub>pro</sub>:WRKY75-GFP, wrky75-1, and wrky75-25 seeds. We thank Man-cang Zhang (Electrophysiology platform, Henan University) for performing the NMT experiment”.

      However, in our study, we selected two different T-DNAs that target the coding regions. While this may explain slight differences in the observed responses, both studies independently link WRKY75 to salt stress, regardless of the alleles used. For your reference, we have included a screenshot of the different alleles used.

      Author response image 1.

      b) Different developmental stages: They measured WRKY75 expression in 5-day-old seedlings. In our experiment, we used seedlings grown on 1/2x MS for 4 days, followed by transfer to treatment plates with or without 75 mM NaCl for one week. As a result, we analyzed older plants (12 days old) for gene expression analysis. Despite the difference in developmental stage, we were still able to observe a reduction in gene expression.

      c) Different tissues: The authors of The Plant Cell used whole seedlings for gene expression analysis, whereas we separated the roots and shoots and measured gene expression in each tissue type individually. This approach is logical, as WRKY75 is a root cell-specific transcription factor with higher expression in the roots compared to the shoots, as demonstrated in our analysis (Fig. 9C).

      Based on the reasoning above, we did work with loss-of-function mutants of WRKY75, particularly wrky75-3. To more accurately reflect the nature of the mutation, we have changed the term "loss-of-function" to "knock-down" in line 717.

      The reviewer mentioned phenotypic discrepancies between the two studies. We agree that there are some differences, particularly in the magnitude of responses or expression levels. However, despite variations in the alleles used, developmental stages, and tissue types, both studies reached the same conclusion: WRKY75 is involved in the salt stress response and acts as a positive regulator. We have discussed the differences between our study and The Plant Cell in the section above, summarizing them into three main points: different alleles, different developmental stages, and different tissue types.

      To address the reviewer’s comment regarding "Lu et al. (2023) report that wrky75 root length is comparable to WT under control conditions, whereas the current manuscript shows that wrky75 root growth is significantly lower than WT": We evaluated root growth differently than The Plant Cell study. In The Plant Cell (Fig. 5, H-J), root elongation was measured in 10-day-old plants with a single time point measurement. They transferred five-day-old wild-type, wrky75-1, wrky75-25, and WRKY75-OE plants to 1/2× MS medium supplemented with 0 mM or 125 mM NaCl for further growth and photographed them 5 days after transfer. In contrast, our study used 4-day-old seedlings, which were transferred to 1/2 MS with or without 0, 75, or 125 mM salt for additional growth (9 days). Rather than measuring root growth only at the end, we scanned the roots every other day, up to five times, to assess root growth rates. Essentially, the precision of our method is higher as we captured growth changes throughout the developmental process, compared to the approach used in The Plant Cell. We do not underestimate the significance of the work conducted by other colleagues in the field, but we also recognize that each laboratory has its own approach and specific practices. This variation in experimental setup is intrinsic to biology, and we believe it is important to study biological phenomena in different ways. Especially as the common or contrasting conclusions reached by different studies, performed by different labs and using different experimental setups are shedding more light on reproducibility and gene contribution across different conditions, which is intrinsic to phenotypic plasticity, and GxE interactions.

      The Plant Cell used a very high salt concentration, starting at 125 mM, while we were more cautious in our approach, as such a high concentration can inhibit and obscure more subtle phenotypic changes.

      To address the reviewer’s comment on "Lu et al. (2023) show that wrky75 accumulates higher levels of Na+, whereas the current study finds Na+ levels in wrky75 indistinguishable from WT," we would like to highlight the differences in the methodologies used in both studies. The Plant Cell measured Na+ accumulation in the wrky75 mutant using xylem sap (Supplemental Figure S10), which appears to be a convenient and practical approach in their laboratory. In their experiment, wild-type and wrky75 mutant plants were grown in soil for 3 weeks, watered with either a mock solution or 100 mM NaCl solution for 1 day, and then xylem sap was collected for Na+ content analysis. In contrast, our study employed a different method to measure Na+ and K+ ion content, using Inductively Coupled Plasma Atomic Emission Spectroscopy (ICP-AES) for root and shoot Na+ and K+ measurements. Additionally, we collected samples after two weeks on treatment plates and focused on the Na+/K+ ratio, which we consider more relevant than net Na+ or K+ levels, as the ratio of these ions is a critical determinant of plant salt tolerance. With this in mind, we observed a considerable non-significant increase in the Na+/K+ ratio in the shoots of the wrky75-3 mutant (assigned Tukey’s letter c) compared to the Col-0 WT (assigned Tukey’s letters abc) under 125 mM salt, suggesting that this mutant is salt-sensitive. Importantly, the Na+/K+ ratio in the double wrky75/sr3g mutants was reduced to the WT level under the same salt conditions, further indicating that the salt sensitivity of wrky75 is mitigated by the sr3g mutation.

      Based on the reasons mentioned above, we believe that conducting additional experiments, such as Western blot analysis, is unnecessary and would not contribute new insights or alter the context of our findings.

      Reviewer #2 (Public review):

      Summary:

      Salt stress is a significant and growing concern for agriculture in some parts of the world. While the effects of sodium excess have been studied in Arabidopsis and (many) crop species, most studies have focused on Na uptake, toxicity and overall effects on yield, rather than on developmental responses to excess Na, per se. The work by Ishka and colleagues aims to fill this gap.

      Working from an existing dataset that exposed a diverse panel of A. thaliana accessions to control, moderate, and severe salt stress, the authors identify candidate loci associated with altering the root:shoot ratio under salt stress. Following a series of molecular assays, they characterize a DUF247 protein which they dub SR3G, which appears to be a negative regulator of root growth under salt stress.

      Overall, this is a well-executed study which demonstrates the functional role played by a single gene in plant response to salt stress in Arabidopsis.

      Review of revised manuscript:

      The authors have addressed my point-by-point comments to my satisfaction. In the cases where they have changed their manuscript language, clarified figures, or added analyses I have no further comment. In some cases, there is a fruitful back-and-forth discussion of methodology which I think will be of interest to readers.

      I have nothing to add during this round of review. I think that the paper and associated discussion will make a nice contribution to the field.

      We sincerely appreciate the reviewer’s recognition of the significance of our work to the field.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Lines 518-519: The statement that other DUF247s exhibit similar expression patterns to SR3G, suggesting their responsiveness to salt stress, is not fully supported by Fig. S14. Please clarify the specific similarities (and differences) in the expression patterns of the DUF247s shown in Fig. S14, as their expression appears to be spatially and temporally diverse. Additionally, the scale is missing in Fig. S14.

      We thank the reviewer. We fixed the text and added expression scales to Figure S14.

      Line 684, Fig. 6A should be 7A.

      Thanks. It is fixed.

      Line 686, Fig. 7A should be 7B.

      Thanks. It is fixed.

      Lines 721-723: The signal quantification in Fig. 8B does not support the claim that "in section one,..., sr3g-5 showed more suberization compared to Col-0." Given the variability and noise often associated with histological dyes such as Fluorol Yellow staining, conclusions should be cautiously grounded in robust signal quantification. Additionally, please specify the number of biological replicates used in both Fig. 8B and C.

      We thank the reviewer for their comments. We believe the statement in the text accurately reflects our results presented in Figure 8B, where we stated “non-significant, but substantially higher levels of root suberization in sr3g-5 compared to Col-0 in sections one to three of the root under control condition (Fig. 8B).” Therefore, we kept the statement and have included the number of biological replicates in the figure legend.

      Lines 731-732: Please provide a more detailed explanation of how the significant changes in suberin monomer levels align with the Fluorol Yellow staining results, and clarify how these findings support the proposed negative role of SR3G in root suberization.

      Fluorol Yellow is a lipophilic dye widely used to label suberin in plant tissues, specifically in roots in this study. Given the inherent variability in histological assays, we confirmed the increase in suberization using an alternative method, Gas Chromatography–Mass Spectrometry (GC-MS). Both approaches revealed elevated suberin levels in the sr3g mutant compared to Col-0. Since the overall suberin content was higher in the mutant under both control and salt stress conditions, we proposed that SR3G acts as a negative regulator of root suberization.

      Lines 686-688 and Figure S24: The authors calculated water mass as FW-DW. A more standard approach for calculating water content is (FW-DW)/FW x 100. Please update the text or adjust the calculation accordingly. Additionally, if the goal is to test differences between WT and the mutant within each condition, a t-test would be a more appropriate statistical method.

      We thank the reviewer. We added water content % to the figure S24. We kept the statistical test as it is as we wanted to be able to observe changes across conditions and genotypes.

      Lines 633-635 states that "No significant difference was observed between sr3g-4 and Col-0 (Fig. S18), except for the Stress Tolerance Index (STI) calculated using growth rates of lateral root length and number." However, based on the Figure S18 legend and statistical analysis (i.e., ns), it appears that the sr3g-4 mutant shows no alterations in root system architecture compared to Col-0. Please revise the text to accurately reflect the results of the statistical analysis.

      We thank the reviewer. We now fixed the text to reflect the result.

      Lines 698-707: The statistical analysis does not support the reported differences in the Na+/K+ ratio for the single and double mutants of sr3g-5 and wrky75-3 (Fig. 10D, where levels connected by the same letters indicate they are not significantly different). Furthermore, the conclusion that "the SR3G mutation indeed compensated for the increased Na+ accumulation observed in the wrky75 mutant under salt stress" is also based on non-significant differences (Fig. S25B). Please revise the text to accurately reflect the results of the statistical analysis. Additionally, since each mutant is compared to the WT, I recommend using Dunnett's test for statistical analysis.

      We thank the reviewer for their feedback. We have carefully revised the text to better support our findings. As previously mentioned, variations among samples are evident and are well-reflected across all our datasets. We have presented all data and focused on identifying trends within our samples to guide interpretation.

      We observed that the SR3G mutation effectively compensated for the increased Na+ accumulation observed in the wrky75 mutant under salt stress. A closer examination of the shoot Na+/K+ ratio under 125 mM salt shows that the wrky75 single mutant has a higher Na+/K+ ratio (indicated by the letter "c") compared to Col-0 (indicated by "abc") and the two double mutants (also indicated by "abc"). Therefore, we have retained the statistical analysis as originally conducted, and maintain our conclusions as is.

      Figure 6: data in panel C present the Na/K ratio, not Na+ content. Based on the statistical analysis of root Na+ levels presented in Fig. S17C, there is no significant difference between sr3g-5 and WT. Please update the title of Fig. 6. In addition, in panel A, the title of the Y-axis and figure legend should be "Lateral root growth rate" without the word length, and in panel C, the statistical analysis is missing.

      We thank the reviewer. We updated Fig. 6 title and fixed the Y-axis in panel A, and added statistical letters to panel C. Legend was updated to reflect the changes.

      Figure 7: Please clearly label the time points where significant differences between genotypes are observed for both early and late salt treatments. Was there a significant difference recorded between WT and sr3g-5 on day 0 under early salt stress? Such differences may arise from initial variations in plant size within this experiment, as indicated by Fig. 7B, where significant differences in rosette area are evident starting from day 0. Additionally, please indicate the statistical analysis in panel E.

      We thank the reviewer for this suggestion. We updated the figure with a statistical test added to the panel E. Although the difference between sr3g mutant and Col-0 is indeed significant in its growth rate at day 0, we would like to draw the attention of the reviewer that this growth rate was calculated over the 24 hours after adding salt stress. Therefore, this difference in growth rate is related to exposure to salt stress. Moreover, the growth rate between Col-0 and sr3g mutant does not differ in two other treatments (Control and Late Salt Stress) further supporting the conclusion that sr3g is affecting rosette size and growth rate only under early salt stress conditions.

      We have also added the Salt Tolerance Index calculation to Figure S24 as additional evidence, controlling for potential differences in size between Col-0 and sr3g mutant.

      Figure S17: statistical analysis is not indicated in panels A, B, and D.

      We thank the reviewer for spotting that. We updated the figure with a statistical test.

      Figures S21-23: The quality of these figures is insufficient, hindering the ability to effectively interpret the authors' results and main message. Furthermore, a Dunnett's test, rather than a t-test, is the appropriate statistical method for this analysis.

      We thank the reviewer for this observation. We have now added a high resolution figures for all supplemental figures, which should increase the resolution of the figures. As we are comparing all of the genotypes to Col-0 one-by-one - the results of individual t-tests are sufficient for this analysis.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      In their paper, Kang et al. investigate rigidity sensing in amoeboid cells, showing that, despite their lack of proper focal adhesions, amoeboid migration of single cells is impacted by substrate rigidity. In fact, many different amoeboid cell types can durotax, meaning that they preferentially move towards the stiffer side of a rigidity gradient. 

      The authors observed that NMIIA is required for durotaxis and, buiding on this observation, they generated a model to explain how durotaxis could be achieved in the absence of strong adhesions. According to the model, substrate stiffness alters the diffusion rate of NMAII, with softer substrates allowing for faster diffusion. This allows for NMAII accumulation at the back, which, in turn, results in durotaxis. 

      The authors responded to all my comments and I have nothing to add. The evidence provided for durotaxis of non adherent (or low-adhering) cells is strong. I am particularly impressed by the fact that amoeboid cells can durotax even when not confined. I wish to congratulate the authors for the excellent work, which will fuel discussion in the field of cell adhesion and migration.

      We thank the reviewer for critically evaluating our work and giving kind suggestions. We are glad that the reviewer found our work to be of potential interest to the broad scientific community.

      Reviewer #2 (Public Review):

      Summary:

      The authors developed an imaging-based device that provides both spatialconfinement and stiffness gradient to investigate if and how amoeboid cells, including T cells, neutrophils, and Dictyostelium, can durotax. Furthermore, the authors showed that the mechanism for the directional migration of T cells and neutrophils depends on non-muscle myosin IIA (NMIIA) polarized towards the soft-matrix-side. Finally, they developed a mathematical model of an active gel that captures the behavior of the cells described in vitro.

      Strengths:

      The topic is intriguing as durotaxis is essentially thought to be a direct consequence of mechanosensing at focal adhesions. To the best of my knowledge, this is the first report on amoeboid cells that do not depend on FAs to exert durotaxis. The authors developed an imaging-based durotaxis device that provides both spatial confinement and stiffness gradient and they also utilized several techniques such as quantitative fluorescent speckle microscopy and expansion microscopy. The results of this study have well-designed control experiments and are therefore convincing.

      Weaknesses:

      Overall this study is well performed but there are still some minor issues I recommend the authors address:

      (1) When using NMIIA/NMIIB knockdown cell lines to distinguish the role of NMIIA and NMIIB in amoeboid durotaxis, it would be better if the authors took compensatory effects into account.

      We thank the reviewer for this suggestion. We have investigated the compensation of myosin in NMIIA and NMIIB KD HL-60 cells using Western blot and added this result in our updated manuscript (Fig. S4B, C). The results showed that the level of NMIIB protein in NMIIA KD cells doubled while there was no compensatory upregulation of NMIIA in NMIIB KD cells. This is consistent with our conclusion that NMIIA rather than NMIIB is responsible for amoeboid durotaxis since in NMIIA KD cells, compensatory upregulation of NMIIB did not rescue the durotaxis-deficient phenotype. 

      (2) The expansion microscopy assay is not clearly described and some details are missed such as how the assay is performed on cells under confinement.

      We thank the reviewer for this comment. We have updated details of the expansion microscopy assay in our revised manuscript in line 481-485 including how the assay is performed on cells under confinement:

      Briefly, CD4+ Naïve T cells were seeded on a gradient PA gel with another upper gel providing confinement. 4% PFA was used to fix cells for 15 min at room temperature. After fixation, the upper gradient PA gel is carefully removed and the bottom gradient PA gel with seeded cells were immersed in an anchoring solution containing 1% acrylamide and 0.7% formaldehyde (Sigma, F8775) for 5 h at 37 °C.

      (3) In this study, an active gel model was employed to capture experimental observations. Previously, some active nematic models were also considered to describe cell migration, which is controlled by filament contraction. I suggest the authors provide a short discussion on the comparison between the present theory and those prior models.

      We thank the reviewer for this suggestion. Active nematic models have been employed to recapitulate many phenomena during cell migration (Nat Commun., 2018, doi: 10.1038/s41467-018-05666-8.). The active nematic model describes the motion of cells using the orientation field, Q, and the velocity field, u. The director field n with (n = −n) is employed to represent the nematic state, which has head-tail symmetry. However, in our experiments, actin filaments are obviously polarized, which polymerize and flow towards the direction of cell migration. Therefore, we choose active gel model which describes polarized actin field during cell migration. In the discussion part, we have provided the comparison between active gel model and motor-clutch model. We have also supplemented a short discussion between the present model and active nematic model in the main text of line 345-347:

      The active nematic model employs active extensile or contractile agents to push or pull the fluid along their elongation axis to simulate cells flowing (61). 

      (4) In the present model, actin flow contributes to cell migration while myosin distribution determines cell polarity. How does this model couple actin and myosin together?

      We thank the reviewer for this question. In our model, the polarization field is employed to couple actin and myosin together. It is obvious that actin accumulate at the front while myosin diffuses in the opposite direction. Therefore, we propose that actin and myosin flow towards the opposite direction, which is captured in the convection term of actin ) and myosin () density field.

    1. Author Response

      The following is the authors’ response to the previous reviews.

      Recommendations for the authors:

      (1) Substantial revision of the claims and interpretation of the results is needed, especially in the setting of additional data showing enhanced erythrophagocytosis with decreased RBC lifespan.

      Thank you for your valuable feedback and suggestion for a substantial revision of the claims and interpretation of our results. We acknowledge the importance of considering additional data that shows enhanced erythrophagocytosis with decreased RBC lifespan. In response, we have revised our manuscript and incorporated additional experimental data to support and clarify our findings.

      (1) In our original manuscript, we reported a decrease in the number of splenic red pulp macrophages (RPMs) and phagocytic erythrocytes after hypobaric hypoxia (HH) exposure. This conclusion was primarily based on our observations of reduced phagocytosis in the spleen.

      (2) Additional experimental data on RBC labeling and erythrophagocytosis:

      • Experiment 1 (RBC labeling and HH exposure)

      We conducted an experiment where RBCs from mice were labeled with PKH67 and injected back into the mice. These mice were then exposed to normal normoxia (NN) or HH for 7 or 14 days. The subsequent assessment of RPMs in the spleen using flow cytometry and immunofluorescence detection revealed a significant decrease in both the population of splenic RPMs (F4/80hiCD11blo, new Figure 5A and C) and PKH67-positive macrophages after HH exposure (as depicted in new Figure 5A and C-E). This finding supports our original claim of reduced phagocytosis under HH conditions.

      Author response image 1.

      -Experiment 2 (erythrophagocytosis enhancement)

      To examine the effects of enhanced erythrophagocytosis, we injected Tuftsin after administering PKH67-labelled RBCs. Our observations showed a significant decrease in PKH67 fluorescence in the spleen, particularly after Tuftsin injection compared to the NN group. This result suggests a reduction in RBC lifespan when erythrophagocytosis is enhanced (illustrated in new Figure 7, A-B).

      Author response image 2.

      (3) Revised conclusions:

      • The additional data from these experiments support our original findings by providing a more comprehensive view of the impact of HH exposure on splenic erythrophagocytosis.

      • The decrease in phagocytic RPMs and phagocytic erythrocytes after HH exposure, along with the observed decrease in RBC lifespan following enhanced erythrophagocytosis, collectively suggest a more complex interplay between hypoxia, erythrophagocytosis, and RBC lifespan than initially interpreted.

      We think that these revisions and additional experimental data provide a more robust and detailed understanding of the effects of HH on splenic erythrophagocytosis and RBCs lifespan. We hope that these changes adequately address the concerns raised and strengthen the conclusions drawn in our manuscript.

      (2) F4/80 high; CD11b low are true RPMs which the cells which the authors are presenting, i.e. splenic monocytes / pre-RPMs. To discuss RPM function requires the presentation of these cells specifically rather than general cells in the proper area of the spleen.

      Thank you for your feedback requesting a substantial revision of our claims and interpretation, particularly considering additional data showing enhanced erythrophagocytosis with decreased RBC lifespan. In response, we have thoroughly revised our manuscript and included new experimental data that further elucidate the effects of HH on RPMs and erythrophagocytosis.

      (1) Re-evaluation of RPMs population after HH exposure:

      • Flow cytometry analysis (new Figure 3G, Figure 5A and B): We revisited the analysis of RPMs (F4/80hiCD11blo) in the spleen after 7 and 14 days of HH exposure. Our revised flow cytometry data consistently showed a significant decrease in the RPMs population post-HH exposure, reinforcing our initial findings.

      Author response image 3.

      Author response image 4.

      • In situ expression of RPMs (Figure S1, A-D):

      We further confirmed the decreased population of RPMs through in situ co-staining with F4/80 and CD11b, and F4/80 and CD68, in spleen tissues. These results clearly demonstrated a significant reduction in F4/80hiCD11blo (Figure S1, A and B) and F4/80hiCD68hi (Figure S1, C and D) cells following HH exposure.

      Author response image 5.

      (2) Single-cell sequencing analysis of splenic RPMs:

      • We conducted a single-cell sequencing analysis of spleen samples post 7 days of HH exposure (Figure S2, A-C). This analysis revealed a notable shift in the distribution of RPMs, predominantly associated with Cluster 0 under NN conditions, to a reduced presence in this cluster after HH exposure.

      • Pseudo-time series analysis indicated a transition pattern change in spleen RPMs, with a shift from Cluster 2 and Cluster 1 towards Cluster 0 under NN conditions, and a reverse transition following HH exposure (Figure S2, B and D). This finding implies a decrease in resident RPMs in the spleen under HH conditions.

      (3) Consolidated findings and revised interpretation:

      • The comprehensive analysis of flow cytometry, in situ staining, and single-cell sequencing data consistently indicates a significant reduction in the number of RPMs following HH exposure.

      • These findings, taken together, strongly support the revised conclusion that HH exposure leads to a decrease in RPMs in the spleen, which in turn may affect erythrophagocytosis and RBC lifespan.

      Author response image 6.

      In conclusion, our revised manuscript now includes additional experimental data and analyses, strengthening our claims and providing a more nuanced interpretation of the impact of HH on spleen RPMs and related erythrophagocytosis processes. We believe these revisions and additional data address your concerns and enhance the scientific validity of our study.

      (3) RBC retention in the spleen should be measured anyway quantitatively, eg, with proper flow cytometry, to determine whether it is increased or decreased.

      Thank you for your query regarding the quantitative measurement of RBC retention in the spleen, particularly in relation to HH exposure. We have utilized a combination of techniques, including flow cytometry and histological staining, to investigate this aspect comprehensively. Below is a summary of our findings and methodology.

      (1) Flow cytometry analysis of labeled RBCs:

      • Our study employed both NHS-biotin (new Figure 4, A-D) and PKH67 labeling (new Figure 4, E-H) to track RBCs in mice exposed to HH. Flow cytometry results from these experiments (new Figure 4, A-H) showed a decrease in the proportion of labeled RBCs over time, both in the blood and spleen. Notably, there was a significantly greater reduction in the amplitude of fluorescently labeled RBCs after NN exposure compared to the reduced amplitude of fluorescently labeled RBCs observed in blood and spleen under HH exposure. The observed decrease in labeled RBCs was initially counterintuitive, as we expected an increase in RBC retention due to reduced erythrophagocytosis. However, this decrease can be attributed to the significantly increased production of RBCs following HH exposure, diluting the proportion of labeled cells.

      • Specifically, for blood, the biotin-labeled RBCs decreased by 12.06% under NN exposure and by 7.82% under HH exposure, while the PKH67-labeled RBCs decreased by 9.70% under NN exposure and by 4.09% under HH exposure. For spleen, the biotin-labeled RBCs decreased by 3.13% under NN exposure and by 0.46% under HH exposure, while the PKH67-labeled RBCs decreased by 1.16% under NN exposure and by 0.92% under HH exposure. These findings suggest that HH exposure leads to a decrease in the clearance rate of RBCs.

      Author response image 7.

      (2) Detection of erythrophagocytosis in spleen:

      To assess erythrophagocytosis directly, we labeled RBCs with PKH67 and analyzed their uptake by splenic macrophages (F4/80hi) after HH exposure. Our findings (new Figure 5, D-E) indicated a decrease in PKH67-positive macrophages in the spleen, suggesting reduced erythrophagocytosis.

      Author response image 8.

      (3) Flow cytometry analysis of RBC retention:

      Our flow cytometry analysis revealed a decrease in PKH67-positive RBCs in both blood and spleen (Figure S4). We postulated that this was due to increased RBC production after HH exposure. However, this method might not accurately reflect RBC retention, as it measures the proportion of PKH67-labeled RBCs relative to the total number of RBCs, which increased after HH exposure.

      Author response image 9.

      (4) Histological and immunostaining analysis:

      Histological examination using HE staining and band3 immunostaining in situ (new Figure 6, A-D, and G-H) revealed a significant increase in RBC numbers in the spleen after HH exposure. This was further confirmed by detecting retained RBCs in splenic single cells using Wright-Giemsa composite stain (new Figure 6, E and F) and retained PKH67-labelled RBCs in spleen (new Figure 6, I and J).

      Author response image 10.

      (5) Interpreting the data:

      The comprehensive analysis suggests a complex interplay between increased RBC production and decreased erythrophagocytosis in the spleen following HH exposure. While flow cytometry indicated a decrease in the proportion of labeled RBCs, histological and immunostaining analyses demonstrated an actual increase in RBCs retention in the spleen. These findings collectively suggest that while the overall RBCs production is upregulated following HH exposure, the spleen's capacity for erythrophagocytosis is concurrently diminished, leading to increased RBCs retention.

      (6) Conclusion:

      Taken together, our results indicate a significant increase in RBCs retention in the spleen post-HH exposure, likely due to reduced residual RPMs and erythrophagocytosis. This conclusion is supported by a combination of flow cytometry, histological staining, and immunostaining techniques, providing a comprehensive view of RBC dynamics under HH conditions. We think these findings offer a clear quantitative measure of RBC retention in the spleen, addressing the concerns raised in your question.

      (4) Numerous other methodological problems as listed below.

      We appreciate your question, which highlights the importance of using multiple analytical approaches to understand complex physiological processes. Please find below our point-by-point response to the methodological comments.

      Reviewer #1 (Recommendations For The Authors):

      (1) Decreased BM and spleen monocytes d/t increased liver monocyte migration is unclear. there is no evidence that this happens or why it would be a reasonable hypothesis, even in splenectomized mice.

      Thank you for highlighting the need for further clarification and justification of our hypothesized decrease in BM and spleen monocytes due to increased monocyte migration to the liver, particularly in the context of splenectomized mice. Indeed, our study has not explicitly verified an augmentation in mononuclear cell migration to the liver in splenectomized mice.

      Nonetheless, our investigations have revealed a notable increase in monocyte migration to the liver after HH exposure. Noteworthy is our discovery of a significant upregulation in colony stimulating factor-1 (CSF-1) expression in the liver, observed after both 7 and 14 days of HH exposure (data not included). This observation was substantiated through flow cytometry analysis (as depicted in Figure S4), which affirmed an enhanced migration of monocytes to the liver. Specifically, we noted a considerable increase in the population of transient macrophages, monocytes, and Kupffer cells in the liver following HH exposure.

      Author response image 11.

      Considering these findings, we hypothesize that hypoxic conditions may activate a compensatory mechanism that directs monocytes towards the liver, potentially linked to the liver’s integral role in the systemic immune response. In accordance with these insights, we intend to revise our manuscript to reflect the speculative nature of this hypothesis more accurately, and to delineate the strategies we propose for its further empirical investigation. This amendment ensures that our hypothesis is presented with full consideration of its speculative basis, supported by a coherent framework for future validation.

      (2) While F4/80+CD11b+ population is decreased, this is mainly driven by CD11b and F4/80+ alone population is significantly increased. This is counter to the hypothesis.

      Thank you for addressing the apparent discrepancy in our findings concerning the F4/80+CD11b+ population and the increase in the F4/80+ alone population, which seems to contradict our initial hypothesis. Your observation is indeed crucial for the integrity of our study, and we appreciate the opportunity to clarify this matter.

      (1) Clarification of flow cytometry results:

      • In response to the concerns raised, we revisited our flow cytometry experiments with a focus on more clearly distinguishing the cell populations. Our initial graph had some ambiguities in cell grouping, which might have led to misinterpretations.

      • The revised flow cytometry analysis, specifically aimed at identifying red pulp macrophages (RPMs) characterized as F4/80hiCD11blo in the spleen, demonstrated a significant decrease in the F4/80 population. This finding is now in alignment with our immunofluorescence results.

      Author response image 12.

      Author response image 13.

      (2) Revised data and interpretation:

      • The results presented in new Figure 3G and Figure 5 (A and B) consistently indicate a notable reduction in the RPMs population following HH exposure. This supports our revised understanding that HH exposure leads to a decrease in the specific macrophage subset (F4/80hiCD11blo) in the spleen.

      We’ve updated our manuscript to reflect these new findings and interpretations. The revised manuscript details the revised flow cytometry analysis and discusses the potential mechanisms behind the observed changes in macrophage populations.

      (3) HO-1 expression cannot be used as a surrogate to quantify number of macrophages as the expression per cell can decrease and give the same results. In addition, the localization of effect to the red pulp is not equivalent to an assertion that the conclusion applies to macrophages given the heterogeneity of this part of the organ and the spleen in general.

      Thank you for your insightful comments regarding the use of HO-1 expression as a surrogate marker for quantifying macrophage numbers, and for pointing out the complexity of attributing changes in HO-1 expression specifically to macrophages in the splenic red pulp. Your observations are indeed valid and warrant a detailed response.

      (1) Role of HO-1 in macrophage activity:

      • In our study, HO-1 expression was not utilized as a direct marker for quantifying macrophages. Instead, it was considered an indicator of macrophage activity, particularly in relation to erythrophagocytosis. HO-1, being upregulated in response to erythrophagocytosis, serves as an indirect marker of this process within splenic macrophages.

      • The rationale behind this approach was that increased HO-1 expression, induced by erythrophagocytosis in the spleen’s red pulp, could suggest an augmentation in the activity of splenic macrophages involved in this process.

      (2) Limitations of using HO-1 as an indicator:

      • We acknowledge your point that HO-1 expression per cell might decrease, potentially leading to misleading interpretations if used as a direct quantifier of macrophage numbers. The variability in HO-1 expression per cell indeed presents a limitation in using it as a sole indicator of macrophage quantity.

      • Furthermore, your observation about the heterogeneity of the spleen, particularly the red pulp, is crucial. The red pulp is a complex environment with various cell types, and asserting that changes in HO-1 expression are exclusive to macrophages could oversimplify this complexity.

      (3) Addressing the concerns:

      • To address these concerns, we propose to supplement our HO-1 expression data with additional specific markers for macrophages. This would help in correlating HO-1 expression more accurately with macrophage numbers and activity.

      • We also plan to conduct further studies to delineate the specific cell types in the red pulp contributing to HO-1 expression. This could involve techniques such as immunofluorescence or immunohistochemistry, which would allow us to localize HO-1 expression to specific cell populations within the splenic red pulp.

      We’ve revised our manuscript to clarify the role of HO-1 expression as an indirect marker of erythrophagocytosis and to acknowledge its limitations as a surrogate for quantifying macrophage numbers.

      (4) line 63-65 is inaccurate as red cell homeostasis reaches a new steady state in chronic hypoxia.

      Thank you for pointing out the inaccuracy in lines 63-65 of our manuscript regarding red cell homeostasis in chronic hypoxia. Your feedback is invaluable in ensuring the accuracy and scientific integrity of our work. We’ve revised lines 63-65 to accurately reflect the understanding.

      (5) Eryptosis is not defined in the manuscript.

      Thank you for highlighting the omission of a definition for eryptosis in our manuscript. We acknowledge the significance of precisely defining such key terminologies, particularly when they play a crucial role in the context of our research findings. Eryptosis, a term referenced in our study, is a specialized form of programmed cell death unique to erythrocytes. Similar with apoptosis in other cell types, eryptosis is characterized by distinct physiological changes including cell shrinkage, membrane blebbing, and the externalization of phosphatidylserine on the erythrocyte surface. These features are indicative of the RBCs lifecycle and its regulated destruction process.

      However, it is pertinent to note that our current study does not extensively delve into the mechanisms or implications of eryptosis. Our primary focus has been to elucidate the effects of HH exposure on the processes of splenic erythrophagocytosis and the resultant impact on the lifespan of RBCs. Given this focus, and to maintain the coherence and relevance of our manuscript, we have decided to exclude specific discussions of eryptosis from our revised manuscript. This decision aligns with our aim to provide a clear and concentrated exploration of the influence of HH exposure on RBCs dynamics and splenic function.

      We appreciate your input, which has significantly contributed to enhancing the clarity and accuracy of our manuscript. The revision ensures that our research is presented with a focused scope, aligning closely with our experimental investigations and findings.

      (6) Physiologically, there is no evidence that there is any "free iron" in cells, making line 89 point inaccurate.

      Thank you for highlighting the concern regarding the reference to "free iron" in cells in line 89 of our manuscript. The term "free iron" in our manuscript was intended to refer to divalent iron (Fe2+), rather than unbound iron ions freely circulating within cells. We acknowledge that the term "free iron" might lead to misconceptions, as it implies the presence of unchelated iron, which is not physiologically common due to the potential for oxidative damage. To rectify this and provide clarity, we’ve revised line 89 of our manuscript to reflect our meaning more accurately. Instead of "free iron," we use "divalent iron (Fe2+)" to avoid any misunderstanding regarding the state of iron in cells. We also ensure that any implications drawn from the presence of Fe2+ in cells are consistent with current scientific literature and understanding.

      (7) Fig 1f no stats

      We appreciate your critical review and suggestions, which help in improving the accuracy and clarity of our research. We’ve revised statistic diagram of new Figure 1F.

      (8) Splenectomy experiments demonstrate that erythrophagocytosis is almost completely replaced by functional macrophages in other tissues (likely Kupffer cells in the liver). there is only a minor defect and no data on whether it is in fact the liver or other organs that provide this replacement function and makes the assertions in lines 345-349 significantly overstated.

      Thank you for your critical assessment of our interpretation of the splenectomy experiments, especially concerning the role of erythrophagocytosis by macrophages in other tissues, such as Kupffer cells in the liver. We appreciate your observation that our assertions may be overstated and acknowledge the need for more specific data to identify which organs compensate for the loss of splenic erythrophagocytosis.

      (1) Splenectomy experiment findings:

      • Our findings in Figure 2D do indicate that in the splenectomized group under NN conditions, erythrophagocytosis is substantially compensated for by functional macrophages in other tissues. This is an important observation that highlights the body's ability to adapt to the loss of splenic function.

      • However, under HH conditions, our data suggest that the spleen plays an important role in managing erythrocyte turnover, as indicated by the significant impact of splenectomy on erythrophagocytosis and subsequent erythrocyte dynamics.

      (2) Addressing the lack of specific organ identification:

      • We acknowledge that our study does not definitively identify which organs, such as the liver or others, take over the erythrophagocytosis function post-splenectomy. This is an important aspect that needs further investigation.

      • To address this, we also plan to perform additional experiments that could more accurately point out the specific tissues compensating for the loss of splenic erythrophagocytosis. This could involve tracking labeled erythrocytes or using specific markers to identify macrophages actively engaged in erythrophagocytosis in various organs.

      (3) Revising manuscript statements:

      Considering your feedback, we’ve revised the statements in lines 345-349 (lines 378-383 in revised manuscript) to enhance the scientific rigor and clarity of our research presentation.

      (9) M1 vs M2 macrophage experiments are irrelevant to the main thrust of the manuscript, there are no references to support the use of only CD16 and CD86 for these purposes, and no stats are provided. It is also unclear why bone marrow monocyte data is presented and how it is relevant to the rest of the manuscript.

      Thank you for your critical evaluation of the relevance and presentation of the M1 vs. M2 macrophage experiments in our manuscript. We appreciate your insights, especially regarding the use of specific markers and the lack of statistical analysis, as well as the relevance of bone marrow monocyte data to our study's main focus.

      (1) Removal of M1 and M2 macrophage data:

      Based on your feedback and our reassessment, we agree that the results pertaining to M1 and M2 macrophages did not align well with the main objectives of our manuscript. Consequently, we have decided to remove the related content on M1 and M2 macrophages from the revised manuscript. This decision was made to ensure that our manuscript remains focused and coherent, highlighting our primary findings without the distraction of unrelated or insufficiently supported data.

      The use of only CD16 and CD86 markers for M1 and M2 macrophage characterization, without appropriate statistical analysis, was indeed a methodological limitation. We recognize that a more comprehensive set of markers and rigorous statistical analysis would be necessary for a meaningful interpretation of M1/M2 macrophage polarization. Furthermore, the relevance of these experiments to the central theme of our manuscript was not adequately established. Our study primarily focuses on erythrophagocytosis and red pulp macrophage dynamics under hypobaric hypoxia, and the M1/M2 polarization aspect did not contribute significantly to this narrative.

      (2) Clarification on bone marrow monocyte data:

      Regarding the inclusion of bone marrow monocyte data, we acknowledge that its relevance to the main thrust of the manuscript was not clearly articulated. In the revised manuscript, we provide a clearer rationale for its inclusion and how it relates to our primary objectives.

      (3) Commitment to clarity and relevance:

      We are committed to ensuring that every component of our manuscript contributes meaningfully to our overall objectives and research questions. Your feedback has been instrumental in guiding us to streamline our focus and present our findings more effectively.

      We appreciate your valuable feedback, which has led to a more focused and relevant presentation of our research. These changes enhance the clarity and impact of our manuscript, ensuring that it accurately reflects our key research findings.

      (10) Biotinolated RBC clearance is enhanced, demonstrating that RBC erythrophagocytosis is in fact ENHANCED, not diminished, calling into question the founding hypothesis that the manuscript proposes.

      Thank you for your critical evaluation of our data on biotinylated RBC clearance, which suggests enhanced erythrophagocytosis under HH conditions. This observation indeed challenges our founding hypothesis that erythrophagocytosis is diminished in this setting. Below is a summary of our findings and methodology.

      (1) Interpretation of RBC labeling results:

      Both the previous results of NHS-biotin labeled RBCs (new Figure 4, A-D) and the current results of PKH67-labeled RBCs (new Figure 4, E-H) demonstrated a decrease in the number of labeled RBCs with an increase in injection time. The production of RBCs, including bone marrow and spleen production, was significantly increased following HH exposure, resulting in a consistent decrease in the proportion of labeled RBCs via flow cytometry detection both in the blood and spleen of mice compared to the NN group. However, compared to the reduced amplitude of fluorescently labeled RBCs observed in blood and spleen under NN exposure, there was a significantly weaker reduction in the amplitude of fluorescently labeled RBCs after HH exposure. Specifically, for blood, the biotin-labeled RBCs decreased by 12.06% under NN exposure and by 7.82% under HH exposure, while the PKH67-labeled RBCs decreased by 9.70% under NN exposure and by 4.09% under HH exposure. For spleen, the biotin-labeled RBCs decreased by 3.13% under NN exposure and by 0.46% under HH exposure, while the PKH67-labeled RBCs decreased by 1.16% under NN exposure and by 0.92% under HH exposure.

      Author response image 14.

      (2) Increased RBCs production under HH conditions:

      It's important to note that RBCs production, including from bone marrow and spleen, was significantly increased following HH exposure. This increase in RBCs production could contribute to the decreased proportion of labeled RBCs observed in flow cytometry analyses, as there are more unlabeled RBCs diluting the proportion of labeled cells in the blood and spleen.

      (3) Analysis of erythrophagocytosis in RPMs:

      Our analysis of PKH67-labeled RBCs content within RPMs following HH exposure showed a significant reduction in the number of PKH67-positive RPMs in the spleen (new Figure 5). This finding suggests a decrease in erythrophagocytosis by RPMs under HH conditions.

      Author response image 15.

      (4) Reconciling the findings:

      The apparent contradiction between enhanced RBC clearance (suggested by the reduced proportion of labeled RBCs) and reduced erythrophagocytosis in RPMs (indicated by fewer PKH67-positive RPMs) may be explained by the increased overall production of RBCs under HH. This increased production could mask the actual erythrophagocytosis activity in terms of the proportion of labeled cells. Therefore, while the proportion of labeled RBCs decreases more significantly under HH conditions, this does not necessarily indicate an enhanced erythrophagocytosis rate, but rather an increased dilution effect due to higher RBCs turnover.

      (5) Revised interpretation and manuscript changes:

      Given these factors, we update our manuscript to reflect this detailed interpretation and clarify the implications of the increased RBCs production under HH conditions on our observations of labeled RBCs clearance and erythrophagocytosis. We appreciate your insightful feedback, which has prompted a careful re-examination of our data and interpretations. We hope that these revisions provide a more accurate and comprehensive understanding of the effects of HH on erythrophagocytosis and RBCs dynamics.

      (11) Legend in Fig 4c-4d looks incorrect and Fig 4e-4f is very non-specific since Wright stain does not provide evidence of what type of cells these are and making for a significant overstatement in the contribution of this data to "confirming" increased erythrophagocytosis in the spleen under HH exposure (line 395-396).

      Thank you for your insightful observations regarding the data presentation and figure legends in our manuscript, particularly in relation to Figure 4 (renamed as Figure 6 in the revised manuscript) and the use of Wright-Giemsa composite staining. We appreciate your constructive feedback and acknowledge the importance of presenting our data with utmost clarity and precision.

      (1) Amendments to Figure legends:

      We recognize the necessity of rectifying inaccuracies in the legends of the previously labeled Figure 4C and D. Corrections have been meticulously implemented to ensure the legends accurately contain the data presented. Additionally, we acknowledge the error concerning the description of Wright staining. The method employed in our study is Wright-Giemsa composite staining, which, unlike Wright staining that solely stains cytoplasm (RBC), is capable of staining both nuclei and cytoplasm.

      (2) Addressing the specificity of Wright-Giemsa Composite staining:

      Our approach involved quantifying RBC retention using Wright-Giemsa composite staining on single splenic cells post-perfusion at 7 and 14 days post HH exposure. We understand and appreciate your concerns regarding the nonspecific nature of Wright staining. Although Wright stain is a general hematologic stain and not explicitly specific for certain cell types, its application in our study aimed to provide preliminary insights. The spleen cells, devoid of nuclei and thus likely to be RBCs, were stained and observed post-perfusion, indicating RBC retention within the spleen.

      (3) Incorporating additional methods for RBC identification:

      To enhance the specificity of our findings, we integrated supplementary methods for RBC identification in the revised manuscript. We employed band3 immunostaining (in the new Figure 6, C-D and G-H) and PKH67 labeling (Figure 6, I-J) for a more targeted identification of RBCs. Band3, serving as a reliable marker for RBCs, augments the specificity of our immunostaining approach. Likewise, PKH67 labeling affords a direct and definitive means to assess RBC retention in the spleen following HH exposure.

      Author response image 16. same as 10

      (4) Revised interpretation and manuscript modifications:

      Based on these enhanced methodologies, we have refined our interpretation of the data and accordingly updated the manuscript. The revised narrative underscores that our conclusions regarding reduced erythrophagocytosis and RBC retention under HH conditions are corroborated by not only Wright-Giemsa composite staining but also by band3 immunostaining and PKH67 labeling, each contributing distinctively to our comprehensive understanding.

      We are committed to ensuring that our manuscript precisely reflects the contribution of each method to our findings and conclusions. Your thorough review has been invaluable in identifying and rectifying areas for improvement in our research report and interpretation.

      (12) Ferroptosis data in Fig 5 is not specific to macrophages and Fer-1 data confirms the expected effect of Fer-1 but there is no data that supports that Fer-1 reverses the destruction of these cells or restores their function in hypoxia. Finally, these experiments were performed in peritoneal macrophages which are functionally distinct from splenic RPM.

      Thank you for your critique of our presentation and interpretation of the ferroptosis data in Figure 5 (renamed as Figure 9 in the revised manuscript), as well as your observations regarding the specificity of the experiments to macrophages and the effects of Fer-1. We value your input and acknowledge the need to clarify these aspects in our manuscript.

      (1) Clarification on cell type used in experiments:

      • We appreciate your attention to the details of our experimental setup. The experiments presented in Figure 9 were indeed conducted on splenic macrophages, not peritoneal macrophages, as incorrectly mentioned in the original figure legend. This was an error in our manuscript, and we have revised the figure legend accordingly to accurately reflect the cell type used.

      (2) Specificity of ferroptosis data:

      • We recognize that the data presented in Figure 9 need to be more explicitly linked to the specific macrophage population being studied. In the revised manuscript, we ensure that the discussion around ferroptosis data is clearly situated within the framework of splenic macrophages.

      • We also provide additional methodological details in the 'Methods' section to reinforce the specificity of our experiments to splenic macrophages.

      (3) Effects of Fer-1 on macrophage function and survival:

      • Regarding the effect of Fer-1, we agree that while our data confirms the expected effect of Fer-1 in inhibiting ferroptosis, we have not provided direct evidence that Fer-1 reverses the destruction of macrophages or restores their function in hypoxia.

      • To address this, we propose additional experiments to specifically investigate the impact of Fer-1 on the survival and functional restoration of splenic macrophages under hypoxic conditions. This would involve assessing not only the inhibition of ferroptosis but also the recovery of macrophage functionality post-treatment.

      (4) Revised interpretation and manuscript changes:

      • We’ve revised the relevant sections of our manuscript to reflect these clarifications and proposed additional studies. This includes modifying the discussion of the ferroptosis data to more accurately represent the cell types involved and the limitations of our current findings regarding the effects of Fer-1.

      • The revised manuscript presents a more detailed interpretation of the ferroptosis data, clearly describing what our current experiments demonstrate and what remains to be investigated.

      We are grateful for your insightful feedback, which has highlighted important areas for improvement in our research presentation. We think that these revisions will enhance the clarity and scientific accuracy of our manuscript, ensuring that our findings and conclusions are well-supported and precisely communicated.

      Reviewer #2 (Recommendations For The Authors):

      The following questions and remarks should be considered by the authors:

      (1) The methods should clearly state whether the HH was discontinued during the 7 or 14 day exposure for cleaning, fresh water etc. Moreover, how was CO2 controlled? The procedure for splenectomy needs to be described in the methods.

      Thank you for your inquiry regarding the specifics of our experimental methods, particularly the management of HH exposure and the procedure for splenectomy. We appreciate your attention to detail and the importance of these aspects for the reproducibility and clarity of our research.

      (1) HH exposure conditions:

      In our experiments, mice were continuously exposed to HH for the entire duration of 7 or 14 days, without interruption for activities such as cleaning or providing fresh water. This uninterrupted exposure was crucial for maintaining consistent hypobaric conditions throughout the experiment. The hypobaric chamber was configured to ensure a ventilation rate of 25 air exchanges per minute. This high ventilation rate was effective in regulating the concentration of CO2 inside the chamber, thereby maintaining a stable environment for the mice.

      (2) The splenectomy was performed as follows:

      After anesthesia, the mice were placed in a supine position, and their limbs were fixed. The abdominal operation area was skinned, disinfected, and covered with a sterile towel. A median incision was made in the upper abdomen, followed by laparotomy to locate the spleen. The spleen was then carefully pulled out through the incision. The arterial and venous directions in the splenic pedicle were examined, and two vascular forceps were used to clamp all the tissue in the main cadre of blood vessels below the splenic portal. The splenic pedicle was cut between the forceps to remove the spleen. The end of the proximal hepatic artery was clamped with a vascular clamp, and double or through ligation was performed to secure the site. The abdominal cavity was then cleaned to ensure there was no bleeding at the ligation site, and the incision was closed. Post-operatively, the animals were housed individually. Generally, they were able to feed themselves after recovering from anesthesia and did not require special care.

      We hope this detailed description addresses your queries and provides a clear understanding of the experimental conditions and procedures used in our study. These methodological details are crucial for ensuring the accuracy and reproducibility of our research findings.

      (2) The lack of changes in MCH needs explanation? During stress erythropoiesis some limit in iron availability should cause MCH decrease particularly if the authors claim that macrophages for rapid iron recycling are decreased. Fig 1A is dispensable. Fig 1G NN control 14 days does not make sense since it is higher than 7 days of HH.

      Thank you for your inquiry regarding the lack of changes in Mean Corpuscular Hemoglobin (MCH) in our study, particularly in the context of stress erythropoiesis and decreased macrophage-mediated iron recycling. We appreciate the opportunity to provide further clarification on this aspect.

      (1) Explanation for stable MCH levels:

      • Our research identified a decrease in erythrophagocytosis and iron recycling in the spleen following HH exposure. Despite this, the MCH levels remained stable. This observation can be explained by considering the compensatory roles of other organs, particularly the liver and duodenum, in maintaining iron homeostasis.

      • Specifically, our investigations revealed an enhanced capacity of the liver to engulf RBCs and process iron under HH conditions. This increased hepatic erythrophagocytosis likely compensates for the reduced splenic activity, thereby stabilizing MCH levels.

      (2) Role of hepcidin and DMT1 expression:

      Additionally, hypoxia is known to influence iron metabolism through the downregulation of Hepcidin and upregulation of Divalent Metal Transporter 1 (DMT1) expression. These alterations lead to enhanced intestinal iron absorption and increased blood iron levels, further contributing to the maintenance of MCH levels despite reduced splenic iron recycling.

      (3) Revised Figure 1 and data presentation

      To address the confusion regarding the data presented in Figure 1G, we have made revisions in our manuscript. The original Figure 1G, which did not align with the expected trends, has been removed. In its place, we have included a statistical chart of Figure 1F in the new version of Figure 1G. This revision will provide a clearer and more accurate representation of our findings.

      (4) Manuscript updates and future research:

      • We update our manuscript to incorporate these explanations, ensuring that the rationale behind the stable MCH levels is clearly articulated. This includes a discussion on the role of the liver and duodenum in iron metabolism under hypoxic conditions.

      • Future research could explore in greater detail the mechanisms by which different organs contribute to iron homeostasis under stress conditions like HH, particularly focusing on the dynamic interplay between hepatic and splenic functions.

      We thank you for your insightful question, which has prompted a thorough re-examination of our findings and interpretations. We believe that these clarifications will enhance the overall understanding of our study and its implications in the context of iron metabolism and erythropoiesis under hypoxic conditions.

      (3) Fig 2 the difference between sham and splenectomy is really marginal and not convincing. Is there also a difference at 7 days? Why does the spleen size decrease between 7 and 14 days?

      Thank you for your observations regarding the marginal differences observed between sham and splenectomy groups in Figure 2, as well as your inquiries about spleen size dynamics over time. We appreciate this opportunity to clarify these aspects of our study.

      (1) Splenectomy vs. Sham group differences:

      • In our experiments, the difference between the sham and splenectomy groups under HH conditions, though subtle, was consistent with our hypothesis regarding the spleen's role in erythrophagocytosis and stress erythropoiesis. Under NN conditions, no significant difference was observed between these groups, which aligns with the expectation that the spleen's contribution is more pronounced under hypoxic stress.

      (2) Spleen size dynamics and peak stress erythropoiesis:

      • The observed splenic enlargement prior to 7 days can be attributed to a combination of factors, including the retention of RBCs and extramedullary hematopoiesis, which is known to be a response to hypoxic stress.

      • Prior research has elucidated that splenic stress-induced erythropoiesis, triggered by hypoxic conditions, typically attains its zenith within a timeframe of 3 to 7 days. This observation aligns with our Toluidine Blue (TO) staining results, which indicated that the apex of this response occurs at the 7-day mark (as depicted in Figure 1, F-G). Here, the culmination of this peak is characteristically succeeded by a diminution in extramedullary hematopoiesis, a phenomenon that could elucidate the observed contraction in spleen size, particularly in the interval between 7 and 14 days.

      • This pattern of splenic response under prolonged hypoxic stress is corroborated by studies such as those conducted by Wang et al. (2021), Harada et al. (2015), and Cenariu et al. (2021). These references collectively underscore that the spleen undergoes significant dynamism in reaction to sustained hypoxia. This dynamism is initially manifested as an enlargement of the spleen, attributable to escalated erythropoiesis and erythrophagocytosis. Subsequently, as these processes approach normalization, a regression in spleen size ensues.

      We’ve revised our manuscript to include a more detailed explanation of these splenic dynamics under HH conditions, referencing the relevant literature to provide a comprehensive context for our findings. We will also consider performing additional analysis or providing further data on spleen size changes at 7 days to support our observations and ensure a thorough understanding of the splenic response to hypoxic stress over time.

      (4) Fig 3 B the clusters should be explained in detail. If the decrease in macrophages in Fig 3K/L is responsible for the effect, why does splenectomy not have a much stronger effect? How do the authors know which cells died in the calcein stained population in Fig 3D?

      Thank you for your insightful questions regarding the details of our data presentation in Figure 3, particularly about the identification of cell clusters and the implications of macrophage reduction. We appreciate the opportunity to address these aspects and clarify our findings.

      (1) Explanation of cell clusters in Figure 3B:

      • In the revised manuscript, we have included detailed notes for each cell population represented in Figure 3B (Figure 3D in revised manuscript). These notes provide a clearer understanding of the cell types present in each cluster, enhancing the interpretability of our single-cell sequencing data.

      • This detailed annotation will help readers to better understand the composition of the splenic cell populations under study and how they are affected by hypoxic conditions.

      (2) Impact of splenectomy vs. macrophage reduction:

      • The interplay between the reduction in macrophage populations, as evidenced by our single-cell sequencing data, and the ramifications of splenectomy presents a multifaceted scenario. Notably, the observed decline in macrophage numbers following HH exposure does not straightforwardly equate to a comparable alteration in overall splenic function, as might be anticipated with splenectomy.

      • In the context of splenectomy under HH conditions, a significant escalation in the RBCs count was observed, surpassing that in non-splenectomized mice exposed to HH. This finding underscores the spleen's critical role in modulating RBCs dynamics under HH. It also indirectly suggests that the diminished phagocytic capacity of the spleen following HH exposure contributes to an augmented RBCs count, albeit to a lesser extent than in the splenectomy group. This difference is attributed to the fact that, while the number of RPMs in the spleen post-HH is reduced, they are still present, unlike in the case of splenectomy, where they are entirely absent.

      • Splenectomy entails the complete removal of the spleen, thus eliminating a broad spectrum of functions beyond erythrophagocytosis and iron recycling mediated by macrophages. The nuanced changes observed in our study may be reflective of the spleen's diverse functionalities and the organism's adaptive compensatory mechanisms in response to the loss of this organ.

      (3) Calcein stained population in Figure 3D:

      • Regarding the identification of cell death in the calcein-stained population in Figure 3D (Figure 3A in revised manuscript), we acknowledge that the specific cell types undergoing death could not be distinctly determined from this analysis alone.

      • The calcein staining method allows for the visualization of live (calcein-positive) and dead (calcein-negative) cells, but it does not provide specific information about the cell types. The decrease in macrophage population was inferred from the single-cell sequencing data, which offered a more precise identification of cell types.

      (4) Revised manuscript and data presentation:

      • Considering your feedback, we have revised our manuscript to provide a more comprehensive explanation of the data presented in Figure 3, including the nature of the cell clusters and the interpretation of the calcein staining results.

      • We have also updated the manuscript to reflect the removal of Figure 3K/L results and to provide a more focused discussion on the relevant findings.

      We are grateful for your detailed review, which has helped us to refine our data presentation and interpretation. These clarifications and revisions will enhance the clarity and scientific rigor of our manuscript, ensuring that our conclusions are well-supported and accurately conveyed.

      (5) Is the reduced phagocytic capacity in Fig 4B significant? Erythrophagocytosis is compromised due to the considerable spontaneous loss of labelled erythrocytes; could other assays help? (potentially by a modified Chromium release assay?). Is it necessary to stimulated phagocytosis to see a significant effect?

      Thank you for your inquiry regarding the significance of the reduced phagocytic capacity observed in Figure 4B, and the potential for employing alternative assays to elucidate erythrophagocytosis dynamics under HH conditions.

      (1) Significance of reduced phagocytic capacity:

      The observed reduction in the amplitude of fluorescently labeled RBCs in both the blood and spleen under HH conditions suggests a decrease in erythrophagocytosis. This is indicative of a diminished phagocytic capacity, particularly when contrasted with NN conditions.

      (2) Investigation of erythrophagocytosis dynamics:

      To delve deeper into erythrophagocytosis under HH, we employed Tuftsin to enhance this process. Following the injection of PKH67-labeled RBCs and subsequent HH exposure, we noted a significant decrease in PKH67 fluorescence in the spleen, particularly marked after the administration of Tuftsin. This finding implies that stimulated erythrophagocytosis can influence RBCs lifespan.

      (3) Erythrophagocytosis under normal and hypoxic conditions:

      Under normal conditions, the reduction in phagocytic activity is less apparent without stimulation. However, under HH conditions, our findings demonstrate a clear weakening of the phagocytic effect. While we established that promoting phagocytosis under NN conditions affects RBC lifespan, the impact of enhanced phagocytosis under HH on RBCs numbers was not explicitly investigated.

      (4) Potential for alternative assays:

      Considering the considerable spontaneous loss of labeled erythrocytes, alternative assays such as a modified Chromium release assay could provide further insights. Such assays might offer a more nuanced understanding of erythrophagocytosis efficiency and the stability of labeled RBCs under different conditions.

      (5) Future research directions:

      The implications of these results suggest that future studies should focus on comparing the effects of stimulated phagocytosis under both NN and HH conditions. This would offer a clearer picture of the impact of hypoxia on the phagocytic capacity of macrophages and the subsequent effects on RBC turnover.

      In summary, our findings indicate a diminished erythrophagocytic capacity, with enhanced phagocytosis affecting RBCs lifespan. Further investigation, potentially using alternative assays, would be beneficial to comprehensively understand the dynamics of erythrophagocytosis in different physiological states.

      (6) Can the observed ferroptosis be influenced by bi- and not trivalent iron chelators?

      Thank you for your question regarding the potential influence of bi- and trivalent iron chelators on ferroptosis under hypoxic conditions. We appreciate the opportunity to discuss the implications of our findings in this context.

      (1) Analysis of iron chelators on ferroptosis:

      In our study, we did not specifically analyze the effects of bi- and trivalent iron chelators on ferroptosis under hypoxia. However, our observations with Deferoxamine (DFO), a well-known iron chelator, provide some insights into how iron chelation may influence ferroptosis in splenic macrophages under hypoxic conditions.

      (2) Effect of DFO on oxidative stress markers:

      Our findings showed that under 1% O2, there was an increase in Malondialdehyde (MDA) content, a marker of lipid peroxidation, and a decrease in Glutathione (GSH) content, indicative of oxidative stress. These changes are consistent with the induction of ferroptosis, which is characterized by increased lipid peroxidation and depletion of antioxidants. Treatment with Ferrostatin-1 (Fer-1) and DFO effectively reversed these alterations. This suggests that DFO, like Fer-1, can mitigate ferroptosis in splenic macrophages under hypoxia, primarily by impacting MDA and GSH levels.

      Author response image 17.

      (3) Potential role of iron chelators in ferroptosis:

      The effectiveness of DFO in reducing markers of ferroptosis indicates that iron availability plays a crucial role in the ferroptotic process under hypoxic conditions. It is plausible that both bi- and trivalent iron chelators could influence ferroptosis, given their ability to modulate iron availability within cells. Since ferroptosis is an iron-dependent form of cell death, chelating iron, irrespective of its valence state, could potentially disrupt the process by limiting the iron necessary for the generation of reactive oxygen species and lipid peroxidation.

      (4) Additional research and manuscript updates:

      Our study highlights the need for further research to explore the differential effects of various iron chelators on ferroptosis, particularly under hypoxic conditions. Such studies could provide a more comprehensive understanding of the role of iron in ferroptosis and the potential therapeutic applications of iron chelators. We update our manuscript to include these findings and discuss the potential implications of iron chelation in the context of ferroptosis under hypoxic conditions. This will provide a broader perspective on our research and its significance in understanding the mechanisms of ferroptosis.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #2 (Public Review):

      Weaknesses:

      There are however, substantial concerns about the interpretation of the findings and limitations to the current analysis. In particular, Analysis of single unit activity is absent, making interpretation of population clusters and decoding less interpretable. These concerns should be addressed to make sure that the results can be interpreted clearly in an active field that already contains a number of confusing and possibly contradictory findings.

      We addressed this important point (which was also made by reviewer #1) in our previous revision. Specifically, we included additional analyses that operate at the level of single units rather than the population level, as requested by the reviewer. For example, we assessed, separately for each recorded neuron, whether there was a statistically significant difference in the magnitude of neural activity between hit and miss trials. This approach allowed us to fully balance the numbers of hit and miss trials at each sound level that were entered into the analysis. The results revealed that a large proportion (close to 50%) of units were task modulated, i.e. had significantly different response magnitudes between hit and miss trials, and that this proportion was not significantly different between lesioned and non-lesioned mice. It is therefore no longer correct to say that “analysis of single unit activity is absent”, and we would be grateful if this statement could be changed.  

      Reviewer #2 (Recommendations For The Authors): 

      The authors have done a good job addressing the main concerns from the previous review. There are a few additional points that hopefully do not require substantial additional edits. 

      Figure 5/supplements. While the authors provide compelling evidence that clusters and overall activity patterns are similar for lesioned and control animals, there do appear to be some differences. For instance, the hit/miss difference for cluster 3 (the "auditory" cluster) appears to be absent for lesioned mice (Fig 5S3 D). Can the hit-miss difference be quantified? 

      We agree that there are some differences between the activity profiles of lesioned and non-lesioned mice: Inspection of panels A and C of Figure 5 – figure supplement 3, for instance, indicates that there is a relatively high proportion of neurons in cluster 3 of the non-lesioned mice that exhibit prolonged elevated activity in hit trials and a relatively lower proportion of those neurons in cluster 3 of lesioned mice. This likely explains the difference in the average response profiles of cluster 3 between the two groups pointed out by the reviewer. Furthermore, there is a slightly larger pre-stimulus dip in hit trial activity for lesioned than non-lesioned mice in cluster 1, a more pronounced short latency peak in hit trial activity for lesioned mice in cluster 2 as well as differences in other clusters. However, these differences are not inconsistent with our interpretation of these data in that we describe the activity profiles as being “similar” and exhibiting a “close correspondence” (rather than as being identical). Having considered this carefully, we do not believe that attempting to quantify these small differences would add much value here or help the reader with the interpretation of these data, especially given that the activity profiles of all neurons that make up each cluster are plotted in panels A and C.  

      Could the mice have been using somatosensory information to perform the task? A wideband click presented from a free-field speaker could have energy in a low frequency range that triggers a whisker response. Given the moderate but not insignificant somatosensory input into the IC shell, this doesn't seem like a trivial concern, and it could substantially impact interpretation of the results. Without wanting to complicate things too much, the authors might consider one or more of these questions: What's the frequency content of the click? Can a deaf mouse perform the task? Can an AC-lesioned mouse learn/perform the task with close-field acoustic stimulation? Or for a highfrequency tone target rather than a click?

      This is an interesting suggestion. We have, in the context of another study, trained mice in our lab to detect somatosensory stimulation (a brush stroke to their whiskers) and consistently found that it takes them much longer (often two weeks or more) to learn to respond to a stimulation of their whiskers than to the presentation of a sound. The brush strokes applied to the whiskers in those experiments were 50-150 ms in duration and were thus orders of magnitude greater in both their duration and amplitude and considerably more salient than any somatosensory stimulus that could potentially arise from the clicks presented here. Therefore, we consider it highly unlikely that mice learned to use somatosensory information potentially picked up by their whiskers to perform the click detection task.  

      L. 63. The authors might want to cite some recent work from the Apostilides lab on the properties of AC-IC projections as well as non-auditory signals in the IC. 

      There are two recent papers from the Apostolides lab that are relevant to our study. We already cite Quass et al., 2023. We have now added Ford et al., 2024 as well.

      Changes to manuscript:

      Line 81: “This raises the possibility that these context-dependent effects may be inherited from the auditory cortex (Ford et al., 2024)”.

      L. 220. "sound-responsive neurons" It is possible to report the representation of sound-responsive neurons in the different clusters? This might help tease apart what processes contribute to their respective activity. Not a big problem if the samples can't be registered easily.

      Sound-driven neurons were identified on the basis of a subset (those trials in which sounds were presented at levels from 53 dB SPL to 65 dB SPL) of the trials used for the clustering analysis so the analyses are not directly comparable.

      p. 603. "quieter stimuli" What sound level was actually used in the 2p experiments? Was it fixed at a single level per animal?

      Sound level was not fixed at a single level. A total of nine different sound levels were used per mouse. We apologize that this was not made clear previously.  

      Changes to manuscript:

      Line 603: “Once the mice had achieved a stable level of performance (typically two days with d’ > 1.5), quieter stimuli (41-71 dB SPL) were introduced. For each mouse a total of 9 different sound levels were used and the range of sound levels was adjusted to each animal’s behavioral performance to avoid floor and ceiling effects and could, therefore, differ from mouse to mouse.”

      L. 747. Something is not right with this formula. It appears that it will always reduce to a value of 1/2.

      Thanks for spotting this. There are two typos in this formula. This has been fixed and now reads (line 749):  

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Recommendations for The Authors):

      Q1: In response to reviewers you noted totally 292 sequenced LECs, however in reviewer figure 3 B the numbers seem to add up to 221. Please include mention of the total number of LEC sequences. Please mention line 119, page 4 the total number of explored LEC transcriptomes

      Thank you for your carefully review. We have updated Fig 2A, 2C and 2E. It was 242 (not 292) LECs included in our initial analysis, which contains the sample of d5 post MI in raw data (E-MTAB-7895). We dropped d5 in our subsequent analysis because the change in d5 did not significant differ from d3. Therefore, we included 221 LECs in our final analysis as we updated in Fig2A, 2C and 2E.

      Q2-1: Figure 3A supposedly shows % of LEC subpopulations relative to their numbers found in day 0 samples. However, there seem to be some errors, because for example the subpop LEC Cap I include 13 cells day 1 and 6 cells day 1, which corresponds to 46% of initial numbers. However, from your graph 3B the blue population seems to occupy 10%. Please revise or explain how these relative % were calculated.

      Thank you for your question. In the Figure 3A, each column was calculated by dn/d0*100%, that is d0=57/57*100%=100%, and d1= 21/57*100%=36.84%, d3=9/57*100%=15.79%, d7, d14, d28...Therefor, Cap I in d0 (13 cells) is 13/57*100%=22.81%, and Cap I in d1(6 cells) is 6/57*100%= 10.53%.

      Q2-2: Further, based on the relative % of LEC subpopulations, using the numbers mentioned in Fig 3B, it would appear that the relative frequency LEC cap II population is actually stable at around 20-30% of all LECs per time point throughout the study (except day 1 drop). This contrasts with line 136 p. 4 statement. I would also urge caution for interpreting too much into the variation of relative levels of LEC co, as these represent exceeding rare cells in your samples, and could reflect technical issues rather than true biological variation (total LEC co numbers analyzed ranging from 1-24 cells/ time point). The same could be said of LEC cap II and cap III.

      We strongly agree with your comment on the proportion of LEC cell subtypes post MI. As you pointed out, we have revised the result description on Page 4, line 137-143 as followed.

      “In the early stages of myocardial infarction (D1 and D3), the quantity of LECs decreased sharply. The number of LECs gradually increasing from day 7 and returning to normal levels by day 14 after MI. Moreover, from day 14 onwards, the number and proportion of Ca I type LECs significantly increased.”

      Q3: Please list in supplement the gene features used to identify in spatial transcriptomics the different LEC subpopulations, as their profiles (notably for capillary LECs) don't appear to be very different based on data in Fig 2F.

      We have supplied gene features in supplementary materials.

      Q4: In section 2.7 you refer to Gal9 secretion. Please replace with expression as no measure of protein levels from LECs has been described in your study.

      Thank you for your suggestion, we have replaced secretion with expression.

      Q5: The updated method to exclude non-lymphatic cells from lymphatic vessel analyses by incorporating pdpn as an additional marker ('present costained areas wherever possible' line 350 p 10)

      Thank you for your correction. We have updated the description as follows and lighted them in the manuscript: rabbit anti-Lyve1 (1:300, ab14917, Abcam, UK), [Syrian hamster anti-Podoplanin (1:100, 53-5381-82, Thermo, USA), rabbit anti-Prox1(1:300, ab199359, Abcam, UK), both anti-podoplain and anti-prox1 are additional markers co-stained with Lyve1 to exclude non-lymphatic cells from lymphatic vessel].

      Q6: Fig 1B, it is highly surprising to see the lymphatic density in the BZ go from 25 um² at day 3 to more than 1000 um² only four days later (day 7). Is it possible that your day 3 measurements were in the infarct area, and not BZ area? The H&E image shown in Fig1a for d3 sample would seem to indicate the analysis was done in a dead area, rather than BZ. Please revise (perhaps select similar zone as shown for d1 in fig 6D, adjusted for subepicardial region and not mid-myocardial as seems to be the case currently), and also provide lymphatic area measures in healthy myocardium for day 0 samples. The unit used (um²) also would depend on the size of the area examined. Is this unit per image? If so please report total imaged area as a reference.

      A6: Thank you for your reminding and advises. We have labeled each zone on H&E and IF images in Fig1-supplementary Fig2B, and updated a clearer histological photo taken at 3 days post MI in Fig1A. Furthermore, we recalculated the lymphatic vessel area ratio as you suggested by calculating the ratio of LEC co-stained area to total imaged area under 100-fold magnification.

      Q7: The mention that CD68 antibody isn't compatible with lyve1 antibody could easily have been bridged by using other macrophage markers, such as F4/80, which is readily available and often used marker for macs in mice and comes notably as a rat anti-mouse F4-80. It would have added much more relevant information to exclude Lyve1-/F4/80+ cells as compared to the current analysis, which may indeed include in area measures Lyve1+ /Pdpn- single cells erroneously spotted as 'lymphatic vessels'

      Thank you for your excellent suggestion. We co-stained the sample with F4/80 and LYVE1 and supplied in the Fig1-supplementary Figure 1E, as shown in Author response image 1.

      Author response image 1.

      Immunofluorescence (IF) co-staining of tissue section with F4/80 and LYVE1 in sham and MI mice model at d3, d7, d14, and d28 post-MI. LYVE1: lymphatic vessel endothelial hyaluronan receptor 1; DAPI: 4’6-diamidino-2-phenylindole; scale bar in 10×-100 μm, 40×-25μm.

      Reviewer 2 (Recommendations for The Authors):

      Q1: Language expression must be improved. Many incomplete sentences exist throughout the manuscript. A few examples: Line 70-71: In order to further elucidate the effects and regulatory mechanisms of the lymphatic vessels in the repair process of myocardial injury following MI. Line 71-73. This study, integrated single-cell sequencing and spatial transcriptome data from mouse heart tissue at different timepoints after MI from publicly available data (E-MTAB-7895, GSE214611) in the ArrayExpress and gene expression omnibus (GEO) databases. Line 88-89: Since the membrane protein LYVE1 can present lymphatic vessel morphology more clearly than PROX1.

      Thank you for your correction. We have carefully inspected and corrected the whole manuscript.

      Q2: The type of animal models (i.e., permanent MI or MI plus reperfusion) included in Array Express and gene expression omnibus (GEO) databases must be clearly defined as these two models may have completely different effects on lymphatic vessel development during post-MI remodeling.

      Thank you for your excellent suggestion. The animal models used in both E-MTAB-7895 and GSE214611 are permanent MI. We have modified the model information in the methodology section (page 12, line 400-401).

      Q3: Line 119-120: Caution must be taken regarding Cav1 as a lymphocyte marker because Cav1 is expressed in all endothelial cells, not limited to LEC.

      Thanks for your reminding. Cav 1 used in our clustering is one of the marker gene for its different expression in sub-types of LECs, referred in article PMID: 31402260

      Q4: Figure 1 legend needs to be improved. RZ, BZ, and IZ need to be labeled in all IF images. Day 0 images suggest that RZ is the tissue section from the right ventricle.

      Thank you for your suggestion. We have labeled and updated the regions of RZ, BZ, and IZ in H&E and IF image in Figure1-Figure supplement 2B.

      Q5: The discussion section needs to be improved and better focused on the findings from the current study.

      Thank you for your good comment. Based on your suggestion, we have revised the first paragraph of the discussion from lines 250-256 (Page 7) as followed:

      Cardiac lymphatics play an important role in myocardial edema and inflammation. This study, for the first time, integrated single-cell sequencing data and spatial transcriptome data from mouse heart tissue at different time points of post-MI, and identified four transcriptionally distinct subtypes of LECs and their dynamic transcriptional heterogeneity distribution in different regions of myocardial tissue post-MI. These subgroups of LECs were shown to form different function involved in the inflammation, apoptosis, ferroptosis, and water absorption related regulation of vasopressin during the process of myocardial repair after MI.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In their manuscript entitled 'The domesticated transposon protein L1TD1 associates with its ancestor L1 ORF1p to promote LINE-1 retrotransposition', Kavaklıoğlu and colleagues delve into the role of L1TD1, an RNA binding protein (RBP) derived from a LINE1 transposon. L1TD1 proves crucial for maintaining pluripotency in embryonic stem cells and is linked to cancer progression in germ cell tumors, yet its precise molecular function remains elusive. Here, the authors uncover an intriguing interaction between L1TD1 and its ancestral LINE-1 retrotransposon.

      The authors delete the DNA methyltransferase DNMT1 in a haploid human cell line (HAP1), inducing widespread DNA hypo-methylation. This hypomethylation prompts abnormal expression of L1TD1. To scrutinize L1TD1's function in a DNMT1 knock-out setting, the authors create DNMT1/L1TD1 double knock-out cell lines (DKO). Curiously, while the loss of global DNA methylation doesn't impede proliferation, additional depletion of L1TD1 leads to DNA damage and apoptosis.

      To unravel the molecular mechanism underpinning L1TD1's protective role in the absence of DNA methylation, the authors dissect L1TD1 complexes in terms of protein and RNA composition. They unveil an association with the LINE-1 transposon protein L1-ORF1 and LINE-1 transcripts, among others.

      Surprisingly, the authors note fewer LINE-1 retro-transposition events in DKO cells compared to DNMT1 KO alone.

      Strengths:

      The authors present compelling data suggesting the interplay of a transposon-derived human RNA binding protein with its ancestral transposable element. Their findings spur interesting questions for cancer types, where LINE1 and L1TD1 are aberrantly expressed.

      Weaknesses:

      Suggestions for refinement:

      The initial experiment, inducing global hypo-methylation by eliminating DNMT1 in HAP1 cells, is intriguing and warrants more detailed description. How many genes experience misregulation or aberrant expression? What phenotypic changes occur in these cells? Why did the authors focus on L1TD1? Providing some of this data would be helpful to understand the rationale behind the thorough analysis of L1TD1.

      The finding that L1TD1/DNMT1 DKO cells exhibit increased apoptosis and DNA damage but decreased L1 retro-transposition is unexpected. Considering the DNA damage associated with retro-transposition and the DNA damage and apoptosis observed in L1TD1/DNMT1 DKO cells, one would anticipate the opposite outcome. Could it be that the observation of fewer transposition-positive colonies stems from the demise of the most transpositionpositive colonies? Further exploration of this phenomenon would be intriguing.

      Reviewer #2 (Public review):

      In this study, Kavaklıoğlu et al. investigated and presented evidence for a role for domesticated transposon protein L1TD1 in enabling its ancestral relative, L1 ORF1p, to retrotranspose in HAP1 human tumor cells. The authors provided insight into the molecular function of L1TD1 and shed some clarifying light on previous studies that showed somewhat contradictory outcomes surrounding L1TD1 expression. Here, L1TD1 expression was correlated with L1 activation in a hypomethylation dependent manner, due to DNMT1 deletion in HAP1 cell line. The authors then identified L1TD1 associated RNAs using RIPSeq, which display a disconnect between transcript and protein abundance (via Tandem Mass Tag multiplex mass spectrometry analysis). The one exception was for L1TD1 itself, is consistent with a model in which the RNA transcripts associated with L1TD1 are not directly regulated at the translation level. Instead, the authors found L1TD1 protein associated with L1-RNPs and this interaction is associated with increased L1 retrotransposition, at least in the contexts of HAP1 cells. Overall, these results support a model in which L1TD1 is restrained by DNA methylation, but in the absence of this repressive mark, L1TD1 is expression, and collaborates with L1 ORF1p (either directly or through interaction with L1 RNA, which remains unclear based on current results), leads to enhances L1 retrotransposition. These results establish feasibility of this relationship existing in vivo in either development or disease, or both.

      Comments on revised version:

      In general, the authors did an acceptable job addressing the major concerns throughout the manuscript. This revision is much clearer and has improved in terms of logical progression.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The authors have addressed all my questions in the revised version of the manuscript.

      Reviewer #2 (Recommendations for the authors):

      Revised comments:

      A few points we'd like to see addressed are our comments about the model (Figure S7C), as this is important for the readership to understand this complex finding. Please try to apply some quantification, if possible (question 8). Please do your best to tone down the direct relationship of these findings to embryology (question 11). Based on both reviewer comments, we believe addressing reviewer #1s "Suggestions for refinement" (2 points), would help us change our view of solid to convincing.

      Responses to changes:

      Major

      (1) The study only used one knockout (KO) cell line generated by CRISPR/Cas9.

      Considering the possibility of an off-target effect, I suggest the authors attempt one or both of these suggestions.

      A)  Generate or acquire a similar DMNT1 deletion that uses distinct sgRNAs, so that the likelihood of off-targets is negligible. A few simple experiments such as qRT-PCR would be sufficient to suggest the same phenotype.

      B)  Confirm the DNMT1 depletion also by siRNA/ASO KD to phenocopy the KO effect.

      (2) In addition to the strategies to demonstrate reproducibility, a rescue experiment restoring DNMT1 to the KO or KD cells would be more convincing. (Partial rescue would suffice in this case, as exact endogenous expression levels may be hard to replicate).

      We have undertook several approaches to study the effect of DNMT1 loss or inactivation: As described above, we have generated a conditional KO mouse with ablation of DNMT1 in the epidermis. DNMT1-deficient keratinocytes isolated from these mice show a significant increase in L1TD1 expression. In addition, treatment of primary human keratinocytes and two squamous cell carcinoma cell lines with the DNMT inhibitor aza-deoxycytidine led to upregulation of L1TD1 expression. Thus, the derepression of L1TD1 upon loss of DNMT1 expression or activity is not a clonal effect.

      Also, the spectrum of RNAs identified in RIP experiments as L1TD1-associated transcripts in HAP1 DNMT1 KO cells showed a strong overlap with the RNAs isolated by a related yet different method in human embryonic stem cells. When it comes to the effect of L1TD1 on L1-1 retrotranspostion, a recent study has reported a similar effect of L1TD1 upon overexpression in HeLa cells [4].

      All of these points together help to convince us that our findings with HAP1 DNMT KO are in agreement with results obtained in various other cell systems and are therefore not due to off-target effects. With that in mind, we would pursue the suggestion of Reviewer 1 to analyze the effects of DNA hypomethylation upon DNMT1 ablation.

      Thank you for addressing this concern. The reference to Beck 2021 and the additional cells lines (R2: keratinocytes and R3: squamous cell carcinoma) provides sufficient evidence that this result is unlikely to be a result of clonal expansion or off targets.

      Question: Was the human ES Cell RIP Experiment shown here? What is the overlap?

      We refer to the recently published study by Jin et al. (PMID: 38165001). As stated in the Discussion, the majority of L1TD1-associated transcripts in HAP1 cells (69%) identified in our study were also reported as L1TD1 targets in hESCs suggesting a conserved binding affinity of this domesticated transposon protein across different cell types.  

      (3) As stated in the introduction, L1TD1 and ORF1p share "sequence resemblance" (Martin 2006). Is the L1TD1 antibody specific or do we see L1 ORF1p if Fig 1C were uncropped?

      (6) Is it possible the L1TD1 antibody binds L1 ORF1p? This could make Figure 2D somewhat difficult to interpret. Some validation of the specificity of the L1TD1 antibody would remove this concern (see minor concern below).

      This is a relevant question. We are convinced that the L1TD1 antibody does not crossreact with L1 ORF1p for the following reasons: Firstly, the antibody does not recognize L1 ORF1p (40 kDa) in the uncropped Western blot for Figure 1C (Figure R4A). Secondly, the L1TD1 antibody gives only background signals in DKO cells in the indirect immunofluorescence experiment shown in Figure 1E of the manuscript.

      Thirdly, the immunogene sequence of L1TD1 that determines the specificity of the antibody was checked in the antibody data sheet from Sigma Aldrich. The corresponding epitope is not present in the L1 ORF1p sequence.

      Finally, we have shown that the ORF1p antibody does not cross-react with L1TD1 (Figure R4B).

      Response: Thank you for sharing these images. These full images relieve concerns about specificity. The increase of ORF1P in R4B and Main figure 3C is interesting and pointed out in the manuscript. Not for the purposes of this review, but the observation of reduced transposition despite increased ORF1P could be an interesting follow up to this study (combined with the similar UPF1 result could indicate a complex of some kind).

      (4) In abstract (P2), the authors mentioned that L1TD1 works as an RNA chaperone, but in the result section (P13), they showed that L1TD1 associates with L1 ORF1p in an RNA independent manner. Those conclusions appear contradictory. Clarification or revision is required.

      Our findings that both proteins bind L1 RNA, and that L1TD1 interacts with ORF1p are compatible with a scenario where L1TD1/ORF1p heteromultimers bind to L1 RNA. The additional presence of L1TD1 might thereby enhance the RNA chaperone function of ORF1p. This model is visualized now in Suppl. Figure S7C.

      Response: Thank you for the model. To further clarify, do you mean that L1TD1 can bind L1 RNA, but this is not needed for the effect, however this "bonus" binding (that is enabled by heteromultimerization) appears to enhance the retrotransposition frequency? Do you think L1TD1 is binding L1 RNA in this context or simply "stabilizing" ORF1P (Trimer) RNP?

      Based on our data, L1TD1 associates with L1 RNA and interacts with L1 ORF1p. Both features might contribute to the enhanced retrotransposition frequency. Interestingly, the L1TD1 protein shares with its ancestor L1 ORF1p the non-canonical RNA recognition motif and the coiled-coil motif required for the trimerization but has two copies instead of one of the C-terminal domain (CTD), a structure with RNA binding and chaperone function. We speculate that the presence of an additional CTD within the L1TD1 protein might thereby enhance the RNA binding and chaperone function of L1TD1/ORF1p heteromultimers.

      (5) Figure 2C fold enrichment for L1TD1 and ARMC1 is a bit difficult to fully appreciate. A 100 to 200-fold enrichment does not seem physiological. This appears to be a "divide by zero" type of result, as the CT for these genes was likely near 40 or undetectable. Another qRT-PCR based approach (absolute quantification) would be a more revealing experiment. This is the validation of the RIP experiments and the presentation mode is specifically developed for quantification of RIP assays (Sigma Aldrich RIP-qRT-PCR: Data Analysis Calculation Shell). The unspecific binding of the transcript in the absence of L1TD1 in DNMT1/L1TD1 DKO cells is set to 1 and the value in KO cells represents the specific binding relative the unspecific binding. The calculation also corrects for potential differences in the abundance of the respective transcript in the two cell lines. This is not a physiological value but the quantification of specific binding of transcripts to L1TD1. GAPDH as negative control shows no enrichment, whereas specifically associated transcripts show strong enrichement. We have explained the details of RIPqRT-PCR evaluation in Materials and Methods (page 14) and the legend of Figure 2C in the revised manuscript.

      Response: Thank you for the clarification and additional information in the manuscript.

      (6) Is it possible the L1TD1 antibody binds L1 ORF1p? This could make Figure 2D somewhat difficult to interpret. Some validation of the specificity of the L1TD1 antibody would remove this concern (see minor concern below).

      See response to (3).

      Response: Thanks.

      (7) Figure S4A and S4B: There appear to be a few unusual aspects of these figures that should be pointed out and addressed. First, there doesn't seem to be any ORF1p in the Input (if there is, the exposure is too low). Second, there might be some L1TD1 in the DKO (lane 2) and lane 3. This could be non-specific, but the size is concerning. Overexposure would help see this.

      The ORF1p IP gives rise to strong ORF1p signals in the immunoprecipitated complexes even after short exposure. Under these conditions ORF1p is hardly detectable in the input. Regarding the faint band in DKO HAP1 cells, this might be due to a technical problem during Western blot loading. Therefore, the input samples were loaded again on a Western blot and analyzed for the presence of ORF1p, L1TD1 and beta-actin (as loading control) and shown as separate panel in Suppl. Figure S4A.

      The enhanced image is clearer. Thanks.

      S4A and S4B now appear to the S6A and S6B, is that correct? (This is due to the addition of new S1 and S2, but please verify image orders were not disturbed).

      Yes, the input is shown now as a separate panel in Suppl. Figure S6A.

      (8) Figure S4C: This is related to our previous concerns involving antibody cross-reactivity. Figure 3E partially addresses this, where it looks like the L1TD1 "speckles" outnumber the ORF1p puncta, but overlap with all of them. This might be consistent with the antibody crossreacting. The western blot (Figure 3C) suggests an upregulation of ORF1p by at least 23x in the DKO, but the IF image in 3E is hard to tell if this is the case (slightly more signal, but fewer foci). Can you return to the images and confirm the contrast are comparable? Can you massively overexpose the red channel in 3E to see if there is residual overlap? In Figure 3E the L1TD1 antibody gives no signal in DNMT1/L1TD1 DKO cells confirming that it does not recognize ORF1p. In agreement with the Western blot in Figure 3C the L1 ORF1p signal in Figure 3E is stronger in DKO cells. In DNMT1 KO cells the L1 ORF1p antibody does not recognize all L1TD1 speckles. This result is in agreement with the Western blot shown above in Figure R4B and indicates that the L1 ORF1p antibody does not recognize the L1TD1 protein. The contrast is comparable and after overexposure there are still L1TD1 specific speckles. This might be due to differences in abundance of the two proteins.

      Response: Suggestion: Would it be possible to use a program like ImageJ to supplement the western blot observation? Qualitatively, In figure 3E, it appears that there is more signal in the DKO, but this could also be due to there being multiple cells clustered together or a particularly nicely stained region. Could you randomly sample 20-30 cells across a few experiments to see if this holds up. I am interested in whether the puncta in the KO image(s) is a very highly concentrated region and in the DKO this is more disperse. Also, the representative DKO seems to be cropped slightly wrong. (Please use puncta as a guide to make the cropping more precise)

      As suggested by the reviewer we have quantified the signals of 60 KO cells and 56 DKO cells in three different IF experiments by ImageJ. We measured a 1.4-fold higher expression level of L1 ORF1p in DKO cells. However, the difference is not statistically significant. This is most probably due to the change in cell size and protein content during the cell cycle with increasing protein contents from G1 to G2. Western blot analysis provides signals of comparable protein amounts representing an average expression levels over ten thousands of cells. Nevertheless, the quantification results reflect in principle the IF pictures shown in Figure 3E but IF is probably not the best method to quantify protein amounts. We have also corrected Figure 3E.

      Author response image 1.

      (9) The choice of ARMC1 and YY2 is unclear. What are the criteria for the selection?

      ARMC1 was one of the top hits in a pilot RIP-seq experiment (IP versus input and IP versus IgG IP). In the actual RIP-seq experiment with DKO HAP1 cells instead of IgG IP as a negative control, we found ARMC1 as an enriched hit, although it was not among the top 5 hits. The results from the 2nd RIP-seq further confirmed the validity of ARMC1 as an L1TD1interacting transcript. YY2 was of potential biological relevance as an L1TD1 target due to the fact that it is a processed pseudogene originating from YY1 mRNA as a result of retrotransposition. This is mentioned on page 6 of the revised manuscript.

      Response: Appreciated!

      (10) (P16) L1 is the only protein-coding transposon that is active in humans. This is perhaps too generalized of a statement as written. Other examples are readily found in the literature.

      Please clarify.

      We will tone down this statement in the revised manuscript.

      Response: Appreciated! To further clarify, the term "active" when it comes to transposable elements, has not been solidified. It can span "retrotransposition competent" to "transcripts can be recovered". There are quite a few reports of GAG transcripts and protein from various ERV/LTR subfamilies in various cells and tissues (in mouse and human at least), however whether they contribute to new insertions is actively researched.

      (11) In both the abstract and last sentence in the discussion section (P17), embryogenesis is mentioned, but this is not addressed at all in the manuscript. Please refrain from implying normal biological functions based on the results of this study unless appropriate samples are used to support them.

      Much of the published data on L1TD1 function are related to embryonic stem cells [3- 7].

      Therefore, it is important to discuss our findings in the context of previous reports.

      Response: It is well established that embryonic stem cells are not a perfect or direct proxies for the inner cell mass of embryos, as multiple reports have demonstrated transcriptomic, epigenetic, chromatin accessibility differences. The exact origin of ES cells is also considered controversial. We maintain that the distinction between embryos/embryogenesis and the results presented in the manuscript are not yet interchangeable. An important exception would be complex models of embryogenesis such as embryoids, (or synthetic/artificial embryo models that have been carefully been termed as such so as to not suggest direct implications to embryos). https://www.nature.com/articles/ncb2965  

      https://link.springer.com/article/10.1007/s00018-018-2965-y  

      https://www.cell.com/developmental-cell/abstract/S1534-5807(24)00363-0?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS1534580724003630%3Fshowall%3Dtrue

      We have deleted the corresponding paragraph in the Discussion.

      (12) Figure 3E: The format of Figures 1A and 3E are internally inconsistent. Please present similar data/images in a cohesive way throughout the manuscript. We show now consistent IF Figures in the revised manuscript.

      Response: Thanks

      Minor:

      In general:

      Still need checking for typos, mostly in Materials and Methods section; Please keep a consistent writing style throughout the whole manuscript. If you use L1 ORF1p, then please use L1 instead of LINE-1, or if you keep LINE-1 in your manuscript, then you should use LINE-1 ORF1p.

      A lab member from the US checked again the Materials and Methods section for typos. We keep the short version L1 ORF1p.

      (1) Intro:

      - Is L1Td1 in mice and Humans? How "conserved" is it and does this suggest function? Murine and human L1TD1 proteins share 44% identity on the amino acid level and it was suggested that the corresponding genes were under positive selection during evolution with functions in transposon control and maintenance of pluripotency [8].

      - Why HAP1? (Haploid?) The importance of this cell line is not clear.

      HAP1 is a nearly haploid human cancer cell line derived from the KBM-7 chronic myelogenous leukemia (CML) cell line [9, 10]. Due to its haploidy is perfectly suited and widely used for loss-of-function screens and gene editing. After gene editing cells can be used in the nearly haploid or in the diploid state. We usually perform all experiments with diploid HAP1 cell lines. Importantly, in contrast to other human tumor cell lines, this cell line tolerates ablation of DNMT1. We have included a corresponding explanation in the revised manuscript on page 5, first paragraph.

      - Global methylation status in DNMT1 KO? (Methylations near L1 insertions, for example?)

      The HAP1 DNMT1 KO cell line with a 20 bp deletion in exon 4 used in our study was validated in the study by Smits et al. [11]. The authors report a significant reduction in overall DNA methylation. However, we are not aware of a DNA methylome study on this cell line. We show now data on the methylation of L1 elements in HAP1 cells and upon DNMT1 deletion in the revised manuscript in Suppl. Figure S1B.

      Response: Looks great!

      (2) Figure 1:

      - Figure 1C. Why is LMNB used instead of Actin (Fig1D)?

      We show now beta-actin as loading control in the revised manuscript.

      - Figure 1G shows increased Caspase 3 in KO, while the matching sentence in the result section skips over this. It might be more accurate to mention this and suggest that the single KO has perhaps an intermediate phenotype (Figure 1F shows a slight but not significant trend).

      We fully agree with the reviewer and have changed the sentence on page 6, 2nd paragraph accordingly.

      - Would 96 hrs trend closer to significance? An interpretation is that L1TD1 loss could speed up this negative consequence.

      We thank the reviewer for the suggestion. We have performed a time course experiment with 6 biological replicas for each time point up to 96 hours and found significant changes in the viability upon loss of DNMT1 and again significant reduction in viability upon additional loss of L1TD1 (shown in Figure 1F). These data suggest that as expected loss of DNMT1 leads to significant reduction viability and that additional ablation of L1TD1 further enhances this effect.

      Response: Looks good!

      - What are the "stringent conditions" used to remove non-specific binders and artifacts (negative control subtraction?)

      Yes, we considered only hits from both analyses, L1TD1 IP in KO versus input and L1TD1 IP in KO versus L1TD1 IP in DKO. This is now explained in more detail in the revised manuscript on page 6, 3rd paragraph.

      (3) Figure 2:

      - Figure 2A is a bit too small to read when printed.

      We have changed this in the revised manuscript.

      - Since WT and DKO lack detectable L1TD1, would you expect any difference in RIP-Seq results between these two?

      Due to the lack of DNMT1 and the resulting DNA hypomethylation, DKO cells are more similar to KO cells than WT cells with respect to the expressed transcripts.

      - Legend says selected dots are in green (it appears blue to me). We have changed this in the revised manuscript.

      - Would you recover L1 ORF1p and its binding partners in the KO? (Is the antibody specific in the absence of L1TD1 or can it recognize L1?) I noticed an increase in ORF1p in the KO in Figure 3C.

      Thank you for the suggestion. Yes, L1 ORF1p shows slightly increased expression in the proteome analysis and we have marked the corresponding dot in the Volcano plot (Figure 3A).

      - Should the figure panel reference near the (Rosspopoff & Trono) reference instead be Sup S1C as well? Otherwise, I don't think S1C is mentioned at all.

      - What are the red vs. green dots in 2D? Can you highlight ERV and ALU with different colors?

      We added the reference to Suppl. Figure S1C (now S3C) in the revised manuscript. In Figure 2D L1 elements are highlighted in green, ERV elements in yellow, and other associated transposon transcripts in red.

      Response: Much better, thanks!

      - Which L1 subfamily from Figure 2D is represented in the qRT-PCR in 2E "LINE-1"? Do the primers match a specific L1 subfamily? If so, which? We used primers specific for the human L1.2 subfamily.

      - Pulling down SINE element transcripts makes some sense, as many insertions "borrow" L1 sequences for non-autonomous retro transposition, but can you speculate as to why ERVs are recovered? There should be essentially no overlap in sequence.

      In the L1TD1 evolution paper [8], a potential link between L1TD1 and ERV elements was discussed:

      "Alternatively, L1TD1 in sigmodonts could play a role in genome defense against another element active in these genomes. Indeed, the sigmodontine rodents have a highly active family of ERVs, the mysTR elements [46]. Expansion of this family preceded the death of L1s, but these elements are very active, with 3500 to 7000 speciesspecific insertions in the L1-extinct species examined [47]. This recent ERV amplification in Sigmodontinae contrasts with the megabats (where L1TD1 has been lost in many species); there are apparently no highly active DNA or RNA elements in megabats [48]. If L1TD1 can suppress retroelements other than L1s, this could explain why the gene is retained in sigmodontine rodents but not in megabats."

      Furthermore, Jin et al. report the binding of L1TD1 to repetitive sequences in transcripts [12]. It is possible that some of these sequences are also present in ERV RNAs.

      Response: Interesting, thanks for sharing

      - Is S2B a screenshot? (the red underline).

      No, it is a Powerpoint figure, and we have removed the red underline.

      (4) Figure 3:

      - Text refers to Figure 3B as a western blot. Figure 3B shows a volcano plot. This is likely 3C but would still be out of order (3A>3C>3B referencing). I think this error is repeated in the last result section.

      - Figure and legends fail to mention what gene was used for ddCT method (actin, gapdh, etc.).

      - In general, the supplemental legends feel underwritten and could benefit from additional explanations. (Main figures are appropriate but please double-check that all statistical tests have been mentioned correctly).

      Thank you for pointing this out. We have corrected these errors in the revised manuscript.

      (5) Discussion:

      - Aluy connection is interesting. Is there an "Alu retrotransposition reporter assay" to test whether L1TD1 enhances this as well?

      Thank you for the suggestion. There is indeed an Alu retrotransposition reporter assay reported be Dewannieux et al. [13]. The assay is based on a Neo selection marker. We have previously tested a Neo selection-based L1 retrotransposition reporter assay, but this system failed to properly work in HAP1 cells, therefore we switched to a blasticidin based L1 retrotransposition reporter assay. A corresponding blasticidin-based Alu retrotransposition reporter assay might be interesting for future studies (mentioned in the Discussion, page 11 paragraph 4 of the revised manuscript.

      (6) Material and Methods :

      - The number of typos in the materials and methods is too numerous to list. Instead, please refer to the next section that broadly describes the issues seen throughout the manuscript.

      Writing style

      (1) Keep a consistent style throughout the manuscript: for example, L1 or LINE-1 (also L1 ORF1p or LINE-1 ORF1p); per or "/"; knockout or knock-out; min or minute; 3 times or three times; media or medium. Additionally, as TE naming conventions are not uniform, it is important to maintain internal consistency so as to not accidentally establish an imprecise version.

      (2) There's a period between "et al" and the comma, and "et al." should be italic.

      (3) The authors should explain what the key jargon is when it is first used in the manuscript, such as "retrotransposon" and "retrotransposition".

      (4) The authors should show the full spelling of some acronyms when they use it for the first time, such as RNA Immunoprecipitation (RIP).

      (5) Use a space between numbers and alphabets, such as 5 μg. (6) 2.0 × 105 cells, that's not an "x".

      (7) Numbers in the reference section are lacking (hard to parse).

      (8) In general, there are a significant number of typos in this draft which at times becomes distracting. For example, (P3) Introduction: Yet, co-option of TEs thorough (not thorough, it should be through) evolution has created so-called domesticated genes beneficial to the gene network in a wide range of organisms. Please carefully revise the entire manuscript for these minor issues that collectively erode the quality of this submission. Thank you for pointing out these mistakes. We have corrected them in the revised manuscript. A native speaker from our research group has carefully checked the paper. In summary, we have added Supplementary Figure S7C and have changed Figures 1C, 1E, 1F, 2A, 2D, 3A, 4B, S3A-D, S4B and S6A based on these comments.

      Response: Thank you for taking these comments on board!

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public Review): 

      Summary: 

      The manuscript focuses on the role of the deubiquitinating enzyme UPS-50/USP8 in endosome maturation. The authors aimed to clarify how this enzyme drives the conversion of early endosomes into late endosomes. Overall, they did achieve their aims in shedding light on the precise mechanisms by which UPS-50/USP8 regulates endosome maturation. The results support their conclusions that UPS-50 acts by disassociating RABX-5 from early endosomes to deactivate RAB-5 and by recruiting SAND-1/Mon1 to activate RAB-7. This work is commendable and will have a significant impact on the field. The methods and data presented here will be useful to the community in advancing our understanding of endosome maturation and identifying potential therapeutic targets for diseases related to endosomal dysfunction. It is worth noting that further investigation is required to fully understand the complexities of endosome maturation. However, the findings presented in this manuscript provide a solid foundation for future studies. 

      We thank this reviewer for the instructive suggestions and encouragement.

      Strengths: 

      The major strengths of this work lie in the well-designed experiments used to examine the effects of UPS-50 loss. The authors employed confocal imaging to obtain a picture of the aftermath of the USP-50 loss. Their findings indicated enlarged early endosomes and MVB-like structures in cells deficient in USP-50/USP8. 

      We thank this reviewer for the instructive suggestions and encouragement.

      Weaknesses: 

      Specifically, there is a need for further investigation to accurately characterize the anomalous structures detected in the usp-50 mutant. Also, the correlation between the presence of these abnormal structures and ESCRT-0 is yet to be addressed, and the current working model needs to be revised to prevent any confusion between enlarged early endosomes and MVBs. 

      Excellent suggestions. USP8 has been identified as a protein associated with ESCRT components, which are crucial for endosomal membrane deformation and scission, leading to the formation of intraluminal vesicles (ILVs) within multivesicular bodies (MVBs). In usp-50 mutants, we observed a significant reduction in the punctate signals of HGRS-1::GFP and STAM-1 (Figure 1G and H; and Figure1-figure supplement 1B), indicating a disruption in ESCRT-0 complex localization (Author response image 1). Additionally, lysosomal structures are markedly reduced in these mutants. In contrast, we found that early endosomes, as marked by FYVE, RAB-5, RABEX5, and EEA1, are significantly enlarged in usp-50 mutants. Electron microscopy (EM) imaging further revealed an increase in large cellular vesicles containing various intraluminal structures. Given the reduction in lysosomal structures and the enlargement of early endosomes in usp-50 mutants, these enlarged vesicles are likely aberrant early endosomes rather than late endosomal or lysosomal structures. To address potential confusion, we have revised the manuscript according to the reviewer's comments and updated the model to accurately reflect these observations.

      Reviewer #2 (Public Review): 

      Summary: 

      In this study, the authors study how the deubiquitinase USP8 regulates endosome maturation in C. elegans and mammalian cells. The authors have isolated USP8 mutant alleles in C. elegans and used multiple in vivo reporter lines to demonstrate the impact of USP8 loss-of-function on endosome morphology and maturation. They show that in USP8 mutant cells, the early endosomes and MVB-like structures are enlarged while the late endosomes and lysosomal compartments are reduced. They elucidate that USP8 interacts with Rabx5, a guanine nucleotide exchange factor (GEF) for Rab5, and show that USP8 likely targets specific lysine residue of Rabx5 to dissociate it from early endosomes. They also find that the localization of USP8 to early endosomes is disrupted in Rabx5 mutant cells. They observe that in both Rabx5 and USP8 mutant cells, the Rab7 GEF SAND-1 puncta which likely represents late endosomes are diminished, although Rabex5 is accumulated in USP8 mutant cells. The authors provide evidence that USP8 regulates endosomal maturation in a similar fashion in mammalian cells. Based on their observations they propose that USP8 dissociates Rabex5 from early endosomes and enhances the recruitment of SAND-1 to promote endosome maturation. 

      We thank this reviewer for the instructive suggestions and encouragement.

      Strengths: 

      The major highlights of this study include the direct visualization of endosome dynamics in a living multi-cellular organism, C. elegans. The high-quality images provide clear in vivo evidence to support the main conclusions. The authors have generated valuable resources to study mechanisms involved in endosome dynamics regulation in both the worm and mammalian cells, which would benefit many members of the cell biology community. The work identifies a fascinating link between USP8 and the Rab5 guanine nucleotide exchange factor Rabx5, which expands the targets and modes of action of USP8. The findings make a solid contribution toward the understanding of how endosomal trafficking is controlled. 

      We thank this reviewer for the instructive suggestions and encouragement.

      Weaknesses: 

      - The authors utilized multiple fluorescent protein reporters, including those generated by themselves, to label endosomal vesicles. Although these are routine and powerful tools for studying endosomal trafficking, these results cannot tell whether the endogenous proteins (Rab5, Rabex5, Rab7, etc.) are affected in the same fashion. 

      Good suggestion. Indeed, to test whether the endogenous proteins (Rab5, Rabex5, Rab7, etc.) are affected in the same fashion as fluorescent protein reporters, we supplemented our approach with the utilization of endogenous markers. These markers, including Rab5, RAB-5, Rabex5, RABX-5, and EEA1 for early endosomes, as well as RAB-7, Mon1a, and Mon1b for late endosomes, were instrumental in our investigations (refer to Figure 3, Figure 6, Figure 5-figure supplement 1, Figure 5-figure supplement 2, and Figure 6-figure supplement 1). Our comprehensive analysis, employing various methodologies such as tissue-specific fused proteins, CRISPR/Cas9 knock-in, and antibody staining, consistently highlights the critical role of USP8 in early-to-late endosome conversion.

      - The authors clearly demonstrated a link between USP8 and Rabx5, and they showed that cells deficient in both factors displayed similar defects in late endosomes/lysosomes. However, the authors didn't confirm whether and/or to which extent USP8 regulates endosome maturation through Rabx5. Additional genetic and molecular evidence might be required to better support their working model. 

      Excellent point. To test whether USP-50 regulates endosome maturation through RABX-5, we performed additional genetic analyses. In rabx-5(null) mutant animals, the morphology of 2xFYVE-labeled early endosomes is comparable to that of wild-type controls (Figure 4H and I). Introducing the rabx-5(null) mutation into usp-50(xd413) backgrounds resulted in a significant suppression of the enlarged early endosome phenotype characteristic of usp-50(xd413) mutants (Figure 4H and I). These findings suggest that USP-50 may modulate the size of early endosomes through its interaction with RABX-5.

      Reviewer #3 (Public Review): 

      Summary: 

      The authors were trying to elucidate the role of USP8 in the endocytic pathway. Using C. elegans epithelial cells as a model, they observed that when USP8 function is lost, the cells have a decreased number and size in lysosomes. Since USP8 was already known to be a protein linked to ESCRT components, they looked into what role USP8 might play in connecting lysosomes and multivesicular bodies (MVB). They observed fewer ESCRT-associated vesicles but an increased number of "abnormal" enlarged vesicles when USP8 function was lost. At this specific point, it's not clear what the objective of the authors was. What would have been their hypothesis addressing whether the reduced lysosomal structures in USP8 (-) animals were linked to MVB formation? Then they observed that the abnormally enlarged vesicles, marked by the PI3P biosensor YFP-2xFYVE, are bigger but in the same number in USP8 (-) compared to wild-type animals, suggesting homotypic fusion. They confirmed this result by knocking down USP8 in a human cell line, and they observed enlarged vesicles marked by YFP-2xFYVE as well. At this point, there is quite an important issue. The use of YFP-2xFYVE to detect early endosomes requires the transfection of the cells, which has already been demonstrated to produce differences in the distribution, number, and size of PI3P-positive vesicles (doi.org/10.1080/15548627.2017.1341465). The enlarged vesicles marked by YFP-2xFYVE would not necessarily be due to the loss of UPS8. In any case, it appears relatively clear that USP8 localizes to early endosomes, and the authors claim that this localization is mediated by Rabex-5 (or Rabx-5). They finally propose that USP8 dissociates Rabx-5 from early endosomes facilitating endosome maturation. 

      Weaknesses: 

      The weaknesses of this study are, on one side, that the results are almost exclusively dependent on the overexpression of fusion proteins. While useful in the field, this strategy does not represent the optimal way to dissect a cell biology issue. On the other side, the way the authors construct the rationale for each approximation is somehow difficult to follow. Finally, the use of two models, C. elegans and a mammalian cell line, which would strengthen the observations, contributes to the difficulty in reading the manuscript. 

      The findings are useful but do not clearly support the idea that USP8 mediates Rab5-Rab7 exchange and endosome maturation, In contrast, they appear to be incomplete and open new questions regarding the complexity of this process and the precise role of USP8 within it. 

      We thank this reviewer for the insightful comments. Fluorescence-fused proteins serve as potent tools for visualizing subcellular organelles both in vivo and in live settings. Specifically, in epidermal cells of worms, the tissue-specific expression of these fused proteins is indispensable for studying organelle dynamics within living organisms. This approach is necessitated by the inherent limitations of endogenously tagged proteins, whose fluorescence signals are often weak and unsuitable for live imaging or genetic screening purposes. Acknowledging concerns raised by the reviewer regarding potential alterations in organelle morphology due to overexpression of certain fused proteins, we supplemented our approach with the utilization of endogenous markers. These markers, including Rab5, RAB-5, Rabex5, RABX-5, and EEA1 for early endosomes, as well as RAB-7, Mon1a, and Mon1b for late endosomes, were instrumental in our investigations (refer to Figure 3, Figure 6, Figure 5-figure supplement 1, Figure 5-figure supplement 2, and Figure 6-figure supplement 1). Our comprehensive analysis, employing various methodologies such as tissue-specific fused proteins, CRISPR/Cas9 knock-in, and antibody staining, consistently highlights the critical role of USP8 in early-to-late endosome conversion. Specifically, we discovered that the recruitment of USP-50/USP8 to early endosomes is depending on Rabex5. However, instead of stabilizing Rabex5, the recruitment of USP-50/USP8 leads to its dissociation from endosomes, concomitantly facilitating the recruitment of the Rab7 GEF SAND-1/Mon1. In cells with loss-of-function mutations in usp-50/usp8, we observed enhanced RABX-5/Rabex5 signaling and mis-localization of SAND-1/Mon1 proteins from endosomes. Consequently, this disruption impairs endolysosomal trafficking, resulting in the accumulation of enlarged vesicles containing various intraluminal contents and rudimentary lysosomal structures.

      Through an unbiased genetic screen, verified by cultured mammalian cell studies, we observed that loss-of-function mutations in usp-50/usp8 result in diminished lysosome/late endosomes. Electron microscopy (EM) analysis indicated that usp-50 mutation leads to abnormally enlarged vesicles containing various intraluminal structures in worm epidermal cells. USP8 is known to regulate the endocytic trafficking and stability of numerous transmembrane proteins. Given that lysosomes receive and degrade materials generated by endocytic pathways, we hypothesized that the abnormally enlarged vesicular structures observed in usp-50 or usp8 mutant cells correspond to the enlarged vesicles coated by early endosome markers. Indeed, in the absence of usp8/usp-50, the endosomal Rab5 signal is enhanced, while early endosomes are significantly enlarged. Given that Rab5 guanine nucleotide exchange factor (GEF), Rabex5, is essential for Rab5 activation, we further investigated its dynamics. Additional analyses conducted in both worm hypodermal cells and cultured mammalian cells revealed an increase of endosomal Rabex5 in response to usp8/usp-50 loss-of-function. Live imaging studies further demonstrated active recruitment of USP8 to newly formed Rab5-positive vesicles, aligning spatiotemporally with Rabex5 regulation. Through systematic exploration of putative USP-50 binding partners on early endosomes, we identified its interaction with Rabex5. Comprehensive genetics and biochemistry experiments demonstrated that USP8 acts through K323 site de-ubiquitination to dissociate Rabex5 from early endosomes and promotes the recruitment of the Rab7 GEF SAND-1/Mon1. In summary, our study began with an unbiased genetic screen and subsequent examination of established theories, leading to the formulation of our own hypothesis. Through multifaceted approaches, we unveiled a novel function of USP8 in early-to-late endosome conversion.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Within Figures 1K-N, diverse anomalous structures were detected in the usp-50 mutant. Further scrutiny is needed to definitively characterize these structures, particularly as the images in Figures 1M and 1L exhibit notable similarities to lamellar bodies.

      We thank the reviewer for the insightful question regarding the resemblance between the vesicles observed in our study and lamellar bodies (LBs). Lamellar bodies are specialized organelles involved in lipid storage and secretion1, prominently studied in keratinocytes of the skin and alveolar type II (ATII) epithelial cells in the lung2. These organelles contain not only lipids but also cell-type specific proteins and lytic enzymes. Due to their acidic pH and functional similarities, LBs are classified as lysosome-related organelles (LROs) or secretory lysosomes3,4. In usp-50 mutants, we observed a considerable number of abnormal vesicles, some of which contain threadlike membrane structures and exhibit morphological similarities to LBs (Figure 2O). However, further analysis with a comprehensive panel of lysosome-related markers demonstrated a significant reduction in lysosomal structures within these mutants. In contrast, vesicles marked by early endosome markers, such as FYVE, RAB-5, RABX-5, and EEA1, were notably enlarged. These results suggest that the enlarged vesicles observed in usp-50 mutants are more likely aberrant early endosomes rather than true lamellar bodies. We have revised the manuscript to reflect these findings and to clearly differentiate between these structures and lysosome-related organelles.

      (2) The correlation between the presence of these abnormal structures and ESCRT-0 remains unaddressed, thus the assertion that UPS-50 regulates endolysosome trafficking in conjunction with ESCRT-0 lacks empirical support.

      We thank the reviewer for the valuable suggestions. We apologize for any confusion and appreciate the opportunity to clarify our findings. The ESCRT machinery is essential for driving endosomal membrane deformation and scission, which leads to the formation of intraluminal vesicles (ILVs) within multivesicular bodies (MVBs). Recent research has shown that the absence of ESCRT components results in a reduction of ILVs in worm gut cells5. In wild type animals, the ESCRT-0 components HGRS-1 and STAM-1 display a distinct punctate distribution (Figure 1G and H). However, in usp-50 mutants, the punctate signals of HGRS-1::GFP and STAM-1::GFP are significantly reduced (Figure 1G and H; and Figure 1-figure supplement 1B), indicating a role for USP-50 in stabilizing the ESCRT-0 complex. Our TEM analysis revealed an accumulation of abnormally enlarged vesicles containing intraluminal structures in usp-50 mutants. When we examined a panel of early endosome and late endosome/lysosome markers, we found that early endosomes are significantly enlarged, while late endosomal/lysosomal structures are markedly reduced in these mutants. This suggests that the abnormal structures observed in usp-50 mutants are likely enlarged early endosomes rather than classical MVBs. To further investigate whether the reduction in ESCRT components contributes to the late endosome/lysosome defects, we analyzed stam-1 mutants. In these mutants, the size of RAB-7-coated vesicles was reduced (Author response image 1C), and the lysosomal marker LAAT-1 indicated a reduction in lysosomal structures (Author response image 1B). These results highlight the importance of the ESCRT complex in late endosome/lysosome formation. However, the morphology of early endosomes, as marked by 2xFYVE, remained similar to that of wild type in stam-1 mutants (Author response image 1A). Therefore, while reduced ESCRT-0 components may contribute to the late endosome/lysosome defects observed in usp-50 mutants, the enlargement of early endosomes in these mutants may involve additional mechanisms. We have revised the manuscript to incorporate these insights and to address the reviewer's comments more comprehensively.

      Author response image 1.

      (A) Confocal fluorescence images of hypodermis expressing YFP::2xFYVE to detect EEs in L4 stage animals in wild type and stam-1(ok406) mutants. Scale bar: 5 μm. (B) Confocal fluorescence images of hypodermal cell 7 (hyp7) expressing the LAAT-1::GFP marker to highlight lysosome structures in 3-day-old adult animals. Compared to wild type, LAAT-1::GFP signal is reduced in stam-1(ok406) animals. Scale bar, 5 μm. (C) The reduction of punctate endogenous GFP::RAB-7 signals in stam-1(ok406) animals. Scale bar: 10 μm.

      (3) Endosomal dysfunction typically leads to significant alterations in the spatial arrangement of marker proteins across distinct endosomes. In the manuscript, the authors examined the distribution and morphology of early endosomes, multivesicular bodies (MVBs), late endosomes, and lysosomes in a usp-50 deficient background primarily through single-channel confocal imaging. By employing two color images showing RAB-5 and RAB-7, in conjunction with HGRS-1, a more comprehensive picture of the aftermath of USP-50 loss can be obtained.

      Good suggestions. We have conducted a double-labeling analysis to examine the distribution of RAB-5 and RAB-7 in conjunction with HGRS-1. In wild type animals, HGRS-1 exhibits a punctate distribution that is partially co-localized with both RAB-5 and RAB-7. In contrast, in usp-50 mutants, the punctate signal of HGRS-1 is significantly reduced, along with its co-localization with RAB-5 and RAB-7 (Author response image 2). These results suggest that, in the absence of USP-50, the stabilization of ESCRT-0 components on endosomes is compromised.

      Author response image 2.

      ESCRT-0 is adjacent to both early endosomes and late endosomes. (A) Confocal fluorescence images of wild-type and usp-50(xd413) hypodermis at L4 stage co-expressing HGRS-1::GFP (hgrs-1 promoter) and endogenous wrmScarlet::RAB-5. (B) HGRS-1 and RAB-5 puncta were analyzed to produce Manders overlap coefficient M1 (HGRS-1/RAB-5) and M2 (RAB-5/HGRS-1) (N=10). (C) Confocal fluorescence images of wild-type and usp-50(xd413) hypodermis at L4 stage co-expressing HGRS-1::GFP (hgrs-1 promoter) and endogenous wrmScarlet::RAB-7. (D) HGRS-1 and RAB-7 puncta were analyzed to produce Manders overlap coefficient M1 (HGRS-1/RAB-7) and M2 (RAB-7/HGRS-1) (N=10). Scale bar: 10 μm for (A) and (C).

      (4) The authors observed enlarged early endosomes in cells depleted of usp-50/usp8, along with enlarged MVB-like structures identified through TEM. The potential identity of these structures as the same organelle could be determined using CLEM.

      We thank the reviewer for the valuable suggestion. Our TEM analysis identified a large number of abnormally enlarged vesicles with various intraluminal structures accumulated in usp-50 mutants. As the reviewer correctly noted, CLEM (correlative light and electron microscopy) would be an ideal approach to further characterize these structures. We have been attempting to implement CLEM in C. elegans for a few years. Given that CLEM relies on fluorescence markers, in this study we focused on two tagged proteins, RAB-5 and RABX-5, which show enlargement in their vesicles in usp-50 mutants. Unfortunately, we encountered significant challenges with this approach, as the GFP-tagged RAB-5 and RABX-5 signals did not survive the electron microscopy procedure. Attempts to align EM sections with residual GFP signaling yielded results that were not convincing. Consequently, we concentrated our analysis on a panel of molecular markers, including 2xFYVE, RAB-5, RABX-5, RAB-7, and LAAT-1. These markers consistently indicated that early endosomes are specifically enlarged in usp-50 mutants, while late endosomal/lysosomal structures are notably reduced. Thus, the abnormal structures identified in usp-50 mutants via TEM are likely to be enlarged early endosomes rather than the classical view of MVBs. We have revised the manuscript to reflect these findings and to clarify this point.

      (5) The working model depicted in Figure 6 Y (right) requires revision, as it has the potential to mislead authors into mistaking enlarged early endosomes for multivesicular bodies (MVBs).

      We thank the reviewer for the excellent suggestion. We have revised the model to clarify that it is the enlarged early endosomes, rather than MVBs, that are observed in usp-50 mutants.

      Reviewer #2 (Recommendations For The Authors):

      (1) Is there any change of Rabx5 protein level in USP8/USP50 mutant cells?

      Good question. In the absence of usp-50/usp8, we indeed observed a noticeable increase in the signal of Rabex5 on endosomes. To determine whether usp-50/usp8 affects the protein level of Rabex5, we investigated the endogenous levels of RABX-5 using the RABX-5::GFP knock-in line. Compared to wild-type controls, we found an elevated protein level of RABX-5::GFP in the knock-in line (Author response image 3). This suggests that USP-50 may play a role in the destabilization of RABX-5/Rabex5 in vivo.

      Author response image 3.

      The endogenous RABX-5 protein level is increased in usp-50 mutants. (A) The RABX-5::GFP KI protein level is increased in usp-50(xd413). (B) Quantification of endogenous RABX-5::GFP protein level in wild type and usp-50(xd413) mutant animals.

      (2) It is interesting that "The rabx-5(null) animals are healthy and fertile and do not display obvious morphological or behavioral defects.", which seems contrary to its role in regulating USP8 localization and endosome maturation.

      It has been previously documented that rabx-5 functions redundantly with rme-6, another RAB-5 GEF in C. elegans, to regulate RAB-5 localization in oocytes6. RNA interference (RNAi) targeting rabx-5 in a rme-6 mutant background results in synthetic lethality, whereas neither rabx-5 nor rme-6 single mutants are essential for worm viability. RME-6 co-localizes with clathrin-coated pits, while Rabex-5 is localized to early endosomes. Rabex-5 forms a stable complex with Rabaptin-5 and is part of a large EEA1-positive complex on early endosomes, whereas RME-6 does not interact with Rabaptin-5 (RABN-5) or EEA-1. These findings suggest that while RME-6 and RABX-5 may function redundantly, they likely play distinct roles in regulating intracellular trafficking processes. In the absence of RABX-5, USP-50 appears to lose its endosomal localization, although the size of the early endosome remains comparable to that of wild type. This observation contrasts with the phenotype associated with USP-50 loss-of-function, in which the early endosome is notably enlarged. These results suggest that residual USP-50 present in the endosomes is sufficient to maintain its role in the endocytic pathway. Conversely, the complete absence of USP-50 likely disrupts the transition of early endosomes to late endosomes, indicating a crucial role of USP-50 in this conversion process. It is also noteworthy that, although loss-of-function of rabx-5 does not result in obvious changes to early endosomes, increasing the gene expression level of rabx-5/Rabex-5 alone is sufficient to cause enlargement of early endosomes (Author response image 4) . Indeed, we observed that loss-of-function mutations in u_sp-50/usp_8 lead to abnormally enlarged early endosomes, accompanied by an enhanced signal of endosomal RABX-5. When the rabx-5(null) mutation was introduced into usp-50 mutant animals, the enlarged early endosome phenotype seen in usp-50 mutants was significantly suppressed (Figure 4H and I). This implies that maintaining a lower level of Rab5 GEF may be crucial for endolysosomal trafficking.

      (3) Does Rabx5 mutation has any impact on early endosomes?

      To address the question, we utilized the CRISPR/Cas9 technique to create a molecular null for rabx-5 (Figure 4E). In the rabx-5(null) mutant animals, we found that the 2xFYVE-labeled early endosomes are indistinguishable from wild type (Figure 4H and 4I). Given that r_abx-5_ functions redundantly with rme-6, another RAB-5 GEF in C. elegans, it is likely that the regulation of early endosome size involves a cooperative interaction between RABX-5 and RME-6.

      (4) The authors observed a reduction of ESCRT-0 components in USP8 mutant cells, could this contribute to the late endosome/lysosome defects?

      Good suggestion. In wild-type animals, the two ESCRT-0 components, HGRS-1 and STAM-1, exhibit a distinct punctate distribution (Figure 1G and H). However, in usp-50 mutants, the punctate signals of HGRS-1::GFP and STAM-1::GFP are significantly diminished (Figure 1G and H; and Figure 1-figure supplement 1B), which aligns with the role of USP-50 in stabilizing the ESCRT-0 complex. To investigate whether the reduction in ESCRT components might contribute to defects in late endosome/lysosome formation, we examined stam-1 mutants. In stam-1 mutants, we observed a reduction in the size of RAB-7-coated vesicles (Author response image 1). Further, when we introduced the lysosomal marker LAAT-1::GFP into stam-1 mutants, we found a substantial decrease in lysosomal structures compared to wild-type animals (Author response image 1). This suggests that the ESCRT complex is essential for proper late endosome/lysosome formation. In contrast, the morphology of early endosomes, as indicated by the 2xFYVE marker, appeared normal in stam-1 mutants, similar to wild-type animals (Author response image 1). This implies that while a reduction in ESCRT-0 components may contribute to the late endosome/lysosome defects observed in usp-50 mutants, the early endosome enlargement phenotype in _usp-5_0 mutants may involve additional mechanisms.

      (5) Rabx5 is accumulated in USP8 mutant cells, I am very curious about the phenotype of USP8-Rabx5 double mutants. Could over-expression of Rabx5 (wild type or mutant forms) cause any defects?

      Excellent suggestions. To address the question, we employed the CRISPR/Cas9 technique to create a molecular null for rabx-5 (Figure 4E). In the rabx-5(null) mutant animals, we observed that the punctate USP-50::GFP signal became diffusely distributed (Figure 4F and G). This suggests that rabx-5 is necessary for the endosomal localization of USP-50. Interestingly, in rabx-5(null) mutant animals, the 2xFYVE-labeled early endosomes appeared similar to those in wild-type animals (Figure 4H and I). When rabx-5(null) was introduced into usp-50 mutant animals, the enlarged early endosome phenotype observed in usp-50 was significantly suppressed (Figure 4H and I). This finding indicates that usp-50 indeed functions through rabx-5 to regulate early endosome size. Additionally, we constructed strains overexpressing either wild-type or K323R mutant RABX-5. Our results showed that overexpression of wild-type RABX-5 led to early endosome enlargement (as indicated by YFP::2xFYVE labeling) (Author response image 4A, B and D). In contrast, overexpression of the K323R mutant RABX-5 did not result in noticeable early endosome enlargement (Author response image 4A, C and D). Together, these data are in consistent with our model that USP-50 may regulate RABX-5 by deubiquitinating the K323 site.

      Author response image 4.

      (A-C) Over-expression wild type RABX-5 causes enlarged EEs (labeled by YFP::2xFYVE) while RABX-5(K323R) mutant form does not. (D) Quantification of the volume of individual YFP::2xFYVE vesicles. Data are presented as mean ± SEM. ****P<0.0001. ns, not significant. One-way ANOVA with Tukey’s test.

      (6) Rabx5 could be ubiquitinated at K88 and K323, and Rabx5-K323R showed different activity when compared with the wild-type protein in USP8 mutant cells. Could the authors provide evidence that USP8 could remove the ubiquitin modification from K323 in Rabx5 protein?

      We appreciate the reviewer's insightful suggestions. To explore the potential of USP-50 in removing ubiquitin modifications from lysine 323 on the RABX-5 protein, we undertook a series of experiments. Initially, we sought to determine whether USP-50 influences the ubiquitination level of RABX-5 in vivo. However, due to the low expression levels of USP-50, we encountered challenges in obtaining adequate amounts of USP-50 protein from worm lysates. To overcome this, we expressed USP-50::4xFLAG in HEK293 cells for subsequent affinity purification. Concurrently, we utilized anti-GFP agarose beads to purify RABX-5::GFP from worms expressing the rabx-5::gfp construct. We then incubated RABX-5::GFP with USP-50::4xFLAG for varying durations and performed immunoblotting with an anti-ubiquitin antibody. As shown in Author response image 5A, our results revealed a decrease in the ubiquitination level of RABX-5 in the presence of USP-50, suggesting that USP-50 directly deubiquitinates RABX-5. Previous studies have indicated that only a minor fraction of recombinant RABX-5 undergoes ubiquitination in HeLa cells, which is believed to have functional significance7. Our findings are consistent with this observation, as only a small fraction of RABX-5 in worms is ubiquitinated. Rabex-5 is known to interact with both K63- and K48-linked poly-ubiquitin chains. To further elucidate whether USP-50 specifically targets K48 or K63-linked ubiquitination at the K323 site of RABX-5, we incubated various HA-tagged ubiquitin mutants with either wild-type or K323R mutant RABX-5 protein. Our results indicated that the K323R mutation reduces K63-linked ubiquitination of RABX-5 (Author response image 5). This experiment was repeated multiple times with consistent results. Additionally, while overexpression of wild-type RABX-5 led to an enlargement of early endosomes, as evidenced by YFP::2xFYVE labeling, overexpression of the K323R mutant did not produce a noticeable effect on endosome size (Author response image 4). Collectively, this finding indicates that RABX-5 is subject to ubiquitin modification in vivo and that USP-50 plays a significant role in regulating this modification at the K323 site.

      Author response image 5.

      (A) RABX-5::GFP protein was purified from worm lysates using anti-GFP antibody. FLAG-tagged USP-50 was purified from HEK293T cells using anti-FLAG antibody. Purified RABX-5::GFP was incubated with USP-50::4FLAG for indicated times (0, 15, 30, 60 mins), followed by immunoblotting using antibody against ubiquitin, FLAG or GFP. In the presence of USP-50::4xFLAG, the ubiquitination level of RABX-5::GFP is decreased. (B) Quantification of RABX-5::GFP ubiquitination level from three independent experiments. (C) HEK293T cells were transfected with HA-Ub or indicated mutants and 4xFLAG tagged RABX-5 or RABX-5 K323R mutant for 48h. The cells were subjected to pull down using the FLAG beads, followed by immunoblotting using antibody against HA or Flag.

      (7) The authors described "the almost identical phenotype of usp-50/usp8 and sand-1/Mon1 mutants", found protein-protein interaction between USP8 and sand-1, and showed that sand1-GFP signal is diminished in USP8 mutant cells. These observations fit with the possibility that USP8 regulates the stability of sand-1 to promote endosomal maturation. Could this be tested and integrated into the current model?

      are grateful for the insightful comments provided by the reviewer. Rab5, known to be activated by Rabex-5, plays a crucial role in the homotypic fusion of early endosomes. Rab5 effectors also include the Rab7 GEF SAND-1/Mon1–Ccz1 complex. Rab7 activation by SAND-1/Mon1-Ccz1 complex is essential for the biogenesis and positioning of late endosomes (LEs) and lysosomes, and for the fusion of endosomes and autophagosomes with lysosomes. The Mon1-Ccz1 complex is able to interact with Rabex5, causing dissociation of Rabex5 from the membrane, which probably terminates the positive feedback loop of Rab5 activation and then promotes the recruitment and activation of Rab7 on endosomes. In our study, we identified an interaction between USP-50 and the Rab5 GEF, RABX-5. In the absence of USP-50, we observed an increased endosomal localization of RABX-5 and the formation of abnormally enlarged early endosomes. This phenotype is reminiscent of that seen in sand-1 loss-of-function mutants, which also exhibit enlarged early endosomes and a concomitant reduction in late endosomes/lysosomes. Notably, USP-50 also interacts with SAND-1, suggesting a potential role in regulating its localization. We could propose several models to elucidate how USP-50 might influence SAND-1 localization, including:

      (1) USP-50 may stabilize SAND-1 through direct de-ubiquitination.

      (2) In the absence of USP-50, the sustained presence of RABX-5 could lead to continuous Rab5 activation, which might hinder or delay the recruitment of SAND-1.

      (3) USP-50 could facilitate SAND-1 recruitment by promoting the dissociation of RABX-5.

      We are actively investigating these models in our laboratory. Due to space constraints, a more detailed exploration of how USP-50 regulates SAND-1 stability will be presented in a separate publication.

      References:

      (1) Schmitz, G., and Müller, G. (1991). Structure and function of lamellar bodies, lipid-protein complexes involved in storage and secretion of cellular lipids. J Lipid Res 32, 1539-1570.

      (2) Dietl, P., and Frick, M. (2021). Channels and Transporters of the Pulmonary Lamellar Body in Health and Disease. Cells-Basel 11. https://doi.org/10.3390/cells11010045.

      (3) Raposo, G., Marks, M.S., and Cutler, D.F. (2007). Lysosome-related organelles: driving post-Golgi compartments into specialisation. Current opinion in cell biology 19, 394-401. https://doi.org/10.1016/j.ceb.2007.05.001.

      (4) Weaver, T.E., Na, C.L., and Stahlman, M. (2002). Biogenesis of lamellar bodies, lysosome-related organelles involved in storage and secretion of pulmonary surfactant. Semin Cell Dev Biol 13, 263-270. https://doi.org/10.1016/s1084952102000551.

      (5) Ott, D.P., Desai, S., Solinger, J.A., Kaech, A., and Spang, A. (2024). Coordination between ESCRT function and Rab conversion during endosome maturation. bioRxiv, 2024.2005.2014.594104. https://doi.org/10.1101/2024.05.14.594104.

      (6) Sato, M., Sato, K., Fonarev, P., Huang, C.J., Liou, W., and Grant, B.D. (2005). Caenorhabditis elegans RME-6 is a novel regulator of RAB-5 at the clathrin-coated pit. Nature cell biology 7, 559-569. https://doi.org/10.1038/ncb1261.

      (7) Mattera, R., Tsai, Y.C., Weissman, A.M., and Bonifacino, J.S. (2006). The Rab5 guanine nucleotide exchange factor Rabex-5 binds ubiquitin (Ub) and functions as a Ub ligase through an atypical Ub-interacting motif and a zinc finger domain. The Journal of biological chemistry 281, 6874-6883. https://doi.org/10.1074/jbc.M509939200.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review):

      Comments on revisions)

      The authors have done a good job at revising the manuscript to put this work into the context of earlier work on brainstem central pattern generators.

      Thank you.

      I still believe the case for the method is not as convincing as it would have been if the method had been validated first on oscillations produced by a known CPG model. Why would the inference of synaptic types from the model CPG voltage oscillations be predetermined? Such inverse problems are quite complicated and their solution is often not unique or sufficiently constrained. Recovering synaptic weights (or CPG parameters) from limited observations of a highly nonlinear system is not warranted (Gutenkunst et al., Universally sloppy parameter sensitivities in systems biology models, PLoS Comp. Biol. 2007; www.doi.org/10.1371/journal.pcbi.0030189) especially when using surrogate biological models like Hodgkin-Huxley models.

      The model of the CPG is irrelevant for such a test of validity because what we reconstruct are postsynaptic conductances of an individual neuron. The network creates a periodic input to this neuron and thus forms a periodic pattern of excitatory and inhibitory conductances. The nature of this input, whether autonomously generated or created artificially (say by periodic optogenetic stimulation), is generally not important. To illustrate this, we used a one-compartment conductance-based (Hodgkin-Huxley style) model neuron incorporating a certain common set of channels (fast sodium (I<sub>NaF</sub>), potassium delayed rectifier (I<sub>Kdr</sub>), persistent sodium (I<sub>NaP</sub>), calcium-dependent potassium (I<sub>KCa</sub>), and cationic non-specific current (I<sub>CAN</sub>)), as well as excitatory and inhibitory synaptic channels whose conductances were implemented as predefined periodic functions. The test suggested by the reviewer would be to implement a current-step protocol similar to the experiments and apply our technique to see if the reconstructed conductance profiles match those predefined functions. Below we show the reconstruction steps for the following arbitrarily chosen pattern:

      𝑔<sub>𝐸𝑋𝐶</sub>(𝑡) /𝑔<sub>𝐿𝐸𝐴𝐾</sub> = 0.1(1 + sin(π𝑡)) and 𝑔<sub>𝐼𝑁𝐻</sub>(𝑡)/𝑔<sub>𝐿𝐸𝐴𝐾</sub> = 0.1 (1 + cos(π𝑡)). Author response image 1 below shows the baseline activity of this model neuron in the absence of the injected current.

      Author response image 1.

      Then we applied a current-step protocol with four steps producing different levels of hyperpolarization and applied our method by calculating the total conductance using linear regression (see the current-voltage plots below) and then decomposing it into the excitatory and inhibitory components.

      Author response image 2.

      As one can see, the reconstructed conductances in Author response image 3 below are nearly identical to their theoretical profiles. This is not surprising because all voltage-dependent currents in the model neuron were inactive in the range of voltages matching our experimental conditions. Therefore, the model could be reduced to just the leak current, synaptic currents and the injected current, which matches precisely the model we used in our manuscript.

      Author response image 3.

      In p.2, the edited section refers to the interspike interval being much smaller than the period of the network. More important is to mention the relationship between the decay time of inhibitory synapses and the period of the network.

      This interpretation misunderstands the focus of our method. The edited sections (including in the theory section of Results) highlight the conditions under which the capacitive current becomes negligible, emphasizing that the membrane time constant must be much smaller than the network oscillation period. This separation of time scales ensures that the membrane potential adjusts quickly to changes in postsynaptic conductance, rendering the capacitive current insignificant over the network’s rhythm. In contrast, the synaptic decay time governs how presynaptic inputs are transduced into postsynaptic conductances—a process relevant to understanding synaptic dynamics but not directly tied to our method’s core objective. Our approach reconstructs postsynaptic conductances from intracellular recordings, not presynaptic spike trains. While interpreting these conductance profiles in terms of specific synaptic connections would indeed involve synaptic decay dynamics, such an analysis exceeds the scope of our paper. Thus, the condition emphasized in the edited sections—concerning the membrane time constant and network period—is the critical one for our method’s applicability, and the synaptic decay time, while relevant to broader synaptic modeling, does not undermine our conclusions.

      We have added the requirement for a much smaller membrane time constant in the Introduction on page 2. The Results theory section already incorporates an extensive discussion of this requirement.

      Comments from the editors:

      We apologize for the delay in coming to this decision, but there was quite a bit of post-review discussion that needed to be resolved. There are two issues that the reviewers agree should be addressed. They remain unconvinced that the simplifying assumptions of the approach are valid. 1) The main issue with the phase argument is that the biological synaptic conductance depends on time and not on the phase of the respiratory cycle as mentioned in the first round of reviews. The approximation g(t)=g(phase) seems to be far too simple to be biologically realistic.

      As we elaborate below, time and phase are fundamentally and mathematically equivalent representations of the same underlying dynamics in a periodic system, and thus, a phase-based representation—where conductances are expressed as functions of the cycle’s phase—is a justified and effective approach for capturing their behavior. We have added this explanation to the theory section of Results. Below are the bases for our assertion.

      In a periodic system, such as the respiratory CPG, the system’s behavior repeats at regular intervals, defined by a period T. For the respiratory cycle in our experimental preparation, this period is approximately 3–4 seconds, encompassing phases like inspiration, post-inspiration, and expiration. In such systems:

      Time (t) is a continuous variable that progresses linearly.

      Phase (φ) represents the position within one cycle, typically normalized between 0 and 1 (or 0 to 2π in some contexts). It can be mathematically related to time via: φ(t) = (t mod T)/T, where (t mod T) is the time elapsed within the current cycle.

      Because the system is periodic, any variable that repeats with period T—such as synaptic conductance in a rhythmically active network—can be expressed as a function of either time or phase. Specifically, if g(t) is periodic with period T, then g(t) = g(t+T). This periodicity allows us to redefine g(t) in terms of phase: g(t) = g(φ(t)), where φ(t) maps time onto a repeating cycle. Thus, in a periodic system, time and phase are fundamentally equivalent representations of the same underlying dynamics. Saying that synaptic conductance depends on phase is mathematically equivalent to saying it depends on time in a periodic manner.

      In a rhythmically active network like the respiratory central pattern generator (CPG), the synaptic conductances, regardless of the specific mechanisms by which they are formed, exhibit periodicity that matches the network’s oscillatory cycle. This occurs because the conductances are driven by the repetitive activity of presynaptic neurons, which are synchronized to the network’s overall rhythm. As a result, the synaptic conductances vary with the same period as the network, making a phase-based representation—where conductances are expressed as functions of the cycle’s phase—a justified and effective approach for capturing their behavior. In our study, we utilized the in situ arterially perfused brainstem-spinal cord preparation from mature rats, which is known to produce a highly periodic respiratory rhythm. To ensure the consistency of this periodicity, we carefully selected recordings where the coefficient of variation of the respiratory cycle period was less than 10%, as outlined in our methods. This strict selection criterion confirms the stability and regularity of the rhythm, supporting the validity of using a phase representation to analyze the synaptic conductances.

      (2) Figure S1 is problematic. First, the currents injected appear to be infinitesimally small.

      There was a typo in the current units, which should be nA and not pA, as evident from the injected current–membrane potential plots in Figure 1B. Figure S1 has been corrected.

      Second, the input resistance is completely independent of voltage, as though there was little or no contribution from hyperpolarization activated currents, which would be surprising.

      While hyperpolarization-activated currents are indeed present in many neuronal types and could theoretically affect input resistance, our data consistently show linear I-V relationships across the voltage range tested (-60 to -100 mV) for the neurons analyzed (see Figure S1 and Author response image 4-9 below). This linearity suggests that, under our experimental conditions, the contribution of voltage-dependent currents, such as h-currents, is negligible within this range.

      Additionally, we now indicate in the manuscript in the theory section of Results how the presence of significant hyperpolarization-activated h-currents would impact our synaptic conductance reconstruction method. In current-clamp recordings, non-linearity from h-currents could introduce voltage-dependent changes in total conductance unrelated to synaptic inputs, potentially skewing the reconstruction. However, this concern does not apply to voltage-clamp recordings, where the membrane potential is held constant, eliminating contributions from voltage-dependent intrinsic currents. As strong evidence of the minimal influence of h-currents, we directly compared synaptic conductance reconstructions using both current-clamp and voltage-clamp protocols in a subset of neurons. The results from these two approaches were highly consistent, indicating that h-currents do not significantly affect our findings. This robustness across experimental methods reinforces the reliability of our conclusions.

      Together, the linear I-V relationships and the agreement between current- and voltage-clamp reconstructions provide compelling evidence that our method accurately captures synaptic conductances without interference from h-currents.

      Typical examples of I-V relationships for each respiratory neuron firing phenotype:

      Author response image 4.

      ramp-I

      Author response image 5.

      pre-I/I

      Author response image 6.

      post-I

      Author response image 7.

      aug-E

      Author response image 8.

      early-I

      Author response image 9.

      late-I

    1. Author Response

      The following is the authors’ response to the previous reviews.

      We thank the reviewers for truly valuable advice and comments. We have made multiple corrections and revisions to the original pre-print accordingly per the following comments:

      1. Pro1153Leu is extremely common in the general population (allele frequency in gnomAD is 0.5). Further discussion is warranted to justify the possibility that this variant contributes to a phenotype documented in 1.5-3% of the population. Is it possible that this variant is tagging other rare SNPs in the COL11A1 locus, and could any of the existing exome sequencing data be mined for rare nonsynonymous variants?

      One possible avenue for future work is to return to any existing exome sequencing data to query for rare variants at the COL11A1 locus. This should be possible for the USA MO case-control cohort. Any rare nonsynonymous variants identified should then be subjected to mutational burden testing, ideally after functional testing to diminish any noise introduced by rare benign variants in both cases and controls. If there is a significant association of rare variation in AIS cases, then they should consider returning to the other cohorts for targeted COL11A1 gene sequencing or whole exome sequencing (whichever approach is easier/less expensive) to demonstrate replication of the association.

      Response: Regarding the genetic association of the common COL11A1 variant rs3753841 (p.(Pro1335Leu)), we do not propose that it is the sole risk variant contributing to the association signal we detected and have clarified this in the manuscript. We concluded that it was worthy of functional testing for reasons described here. Although there were several common variants in the discovery GWAS within and around COL11A1, none were significantly associated with AIS and none were in linkage disequilibrium (R2>0.6) with the top SNP rs3753841. We next reviewed rare (MAF<=0.01) coding variants within the COL11A1 LD region of the associated SNP (rs3753841) in 625 available exomes representing 46% of the 1,358 cases from the discovery cohort. The LD block was defined using Haploview based on the 1KG_CEU population. Within the ~41 KB LD region (chr1:103365089- 103406616, GRCh37) we found three rare missense mutations in 6 unrelated individuals, Table below. Two of them (NM_080629.2: c.G4093A:p.A1365T; NM_080629.2:c.G3394A:p.G1132S), from two individuals, are predicted to be deleterious based on CADD and GERP scores and are plausible AIS risk candidates. At this rate we could expect to find only 4-5 individuals with linked rare coding variants in the total cohort of 1,358 which collectively are unlikely to explain the overall association signal we detected. Of course, there also could be deep intronic variants contributing to the association that we would not detect by our methods. However, given this scenario, the relatively high predicted deleteriousness of rs3753841 (CADD= 25.7; GERP=5.75), and its occurrence in a GlyX-Y triplet repeat, we hypothesized that this variant itself could be a risk allele worthy of further investigation.

      Author response table 1.

      We also appreciate the reviewer’s suggestion to perform a rare variant burden analysis of COL11A1. We did conduct pilot gene-based analysis in 4534 European ancestry exomes including 797 of our own AIS cases and 3737 controls and tested the burden of rare variants in COL11A1. SKATO P value was not significant (COL11A1_P=0.18), but this could due to lack of power and/or background from rare benign variants that could be screened out using the functional testing we have developed.

      1. COL11A1 p.Pro1335Leu is pursued as a direct candidate susceptibility locus, but the functional validation involves both: (a) a complementation assay in mouse GPCs, Figure 5; and (b) cultured rib cartilage cells from Col11a1-Ad5 Cre mice (Figure 4). Please address the following:

      2A. Is Pro1335Leu a loss of function, gain of function, or dominant negative variant? Further rationale for modeling this change in a Col11a1 loss of function cell line would be helpful.

      Response: Regarding functional testing, by knockdown/knockout cell culture experiments, we showed for the first time that Col11a1 negatively regulates Mmp3 expression in cartilage chondrocytes, an AIS-relevant tissue. We then tested the effect of overexpressing the human wt or variant COL11A1 by lentiviral transduction in SV40-transformed chondrocyte cultures. We deleted endogenous mouse Col11a1 by Cre recombination to remove the background of its strong suppressive effects on Mmp3 expression. We acknowledge that Col11a1 missense mutations could confer gain of function or dominant negative effects that would not be revealed in this assay. However as indicated in our original manuscript we have noted that spinal deformity is described in the cho/cho mouse, a Col11a1 loss of function mutant. We also note the recent publication by Rebello et al. showing that missense mutations in Col11a2 associated with congenital scoliosis fail to rescue a vertebral malformation phenotype in a zebrafish col11a2 KO line. Although the connection between AIS and vertebral malformations is not altogether clear, we surmise that loss of the components of collagen type XI disrupt spinal development. in vivo experiments in vertebrate model systems are needed to fully establish the consequences and genetic mechanisms by which COL11A1 variants contribute to an AIS phenotype.

      2B. Expression appears to be augmented compared WT in Fig 5B, but there is no direct comparison of WT with variant.

      Response: Expression of the mutant (from the lentiviral expression vector) is increased compared to mutant. We observed this effect in repeated experiments. Sequencing confirmed that the mutant and wildtype constructs differed only at the position of the rs3753841 SNP. At this time, we cannot explain the difference in expression levels. Nonetheless, even when the variant COL11A1 is relatively overexpressed it fails to suppress MMP3 expression as observed for the wildtype form.

      2C. How do the authors know that their complementation data in Figure 5 are specific? Repetition of this experiment with an alternative common nonsynonymous variant in COL11A1 (such as rs1676486) would be helpful as a comparison with the expectation that it would be similar to WT.

      Response: We agree that testing an allelic series throughout COL11A1 could be informative, but we have shifted our resources toward in vivo experiments that we believe will ultimately be more informative for deciphering the mechanistic role of COL11A1 in MMP3 regulation and spine deformity.

      2D. The y-axes of histograms in panel A need attention and clarification. What is meant by power? Do you mean fold change?

      Response: Power is directly comparable to fold change but allows comparison of absolute expression levels between different genes.

      2E. Figure 5: how many technical and biological replicates? Confirm that these are stated throughout the figures.

      Response: Thank you for pointing out this oversight. This information has been added throughout.

      1. Figure 2: What does the gross anatomy of the IVD look like? Could the authors address this by showing an H&E of an adjacent section of the Fig. 2 A panels?

      Response: Panel 2 shows H&E staining. Perhaps the reviewer is referring to the WT and Pax1 KO images in Figure 3? We have now added H&E staining of WT and Pax1 KO IVD as supplemental Figure 3E to clarify the IVD anatomy.

      1. Page 9: "Cells within the IVD were negative for Pax1 staining ..." There seems to be specific PAX1 expression in many cells within the IVD, which is concerning if this is indeed a supposed null allele of Pax1. This data seems to support that the allele is not null.

      Response: We have now added updated images for the COL11A1 and PAX1 staining to include negative controls in which we omitted primary antibodies. As can be seen, there is faint autofluorescence in the PAX1 negative control that appears to explain the “specific staining” referred to by the reviewer. These images confirm that the allele is truly a null.

      1. There is currently a lack of evidence supporting the claim that "Col11a1 is positively regulated by Pax1 in mouse spine and tail". Therefore, it is necessary to conduct further research to determine the direct regulatory role of Pax1 on Col11a1.

      Response: We agree with the reviewer and have clarified that Pax1 may have either a direct or indirect role in Col11a1 regulation.

      1. There is no data linking loss of COL11A1 function and spine defects in the mouse model. Furthermore, due to the absence of P1335L point mutant mice, it cannot be confirmed whether P1335L can actually cause AIS, and the pathogenicity of this mutation cannot be directly verified. These limitations need to be clearly stated and discussed. A Col11a1 mouse mutant called chondroysplasia (cho), was shown to be perinatal lethal with severe endochondral defects (https://pubmed.ncbi.nlm.nih.gov/4100752/). This information may help contextualize this study.

      Response: We partially agree with the reviewer. Spine defects are reported in the cho mouse (for example, please see reference 36 Hafez et al). We appreciate the suggestion to cite the original Seegmiller et al 1971 reference and have added it to the manuscript.

      1. A recent article (PMID37462524) reported mutations in COL11A2 associated with AIS and functionally tested in zebrafish. That study should be cited and discussed as it is directly relevant for this manuscript.

      Response: We agree with the reviewer that this study provides important information supporting loss of function I type XI collagen in spinal deformity. Language to this effect has been added to the manuscript and this study is now cited in the paper.

      1. Please reconcile the following result on page 10 of the results: "Interestingly, the AISassociated gene Adgrg6 was amongst the most significantly dysregulated genes in the RNA-seq analysis (Figure 3c). By qRT-PCR analysis, expression of Col11a1, Adgrg6, and Sox6 were significantly reduced in female and male Pax1-/- mice compared to wild-type mice (Figure 3d-g)." In Figure 3f, the downregulation of Adgrg6 appears to be modest so how can it possibly be highlighted as one of the most significantly downregulated transcripts in the RNAseq data?

      Response: By “significant” we were referring to the P-value significance in RNAseq analysis, not in absolute change in expression. This language was clearly confusing, and we have removed it from the manuscript.

      1. It is incorrect to refer to the primary cell culture work as growth plate chondrocytes (GPCs), instead, these are primary costal chondrocyte cultures. These primary cultures have a mixture of chondrocytes at differing levels of differentiation, which may change differentiation status during the culturing on plastic. In sum, these cells are at best chondrocytes, and not specifically growth plate chondrocytes. This needs to be corrected in the abstract and throughout the manuscript. Moreover, on page 11 these cells are referred to as costal cartilage, which is confusing to the reader.

      Response: Thank you for pointing out these inconsistencies. We have changed the manuscript to say “costal chondrocytes” throughout.

      Minor points

      • On 10 of the Results: "These data support a mechanistic link between Pax1 and Col11a1, and the AIS-associated genes Gpr126 and Sox6, in affected tissue of the developing tail." qRT-PCR validation of Sox6, although significant, appears to be very modestly downregulated in KO. Please soften this statement in the text.

      Response: We have softened this statement.

      • Have you got any information about how the immortalized (SV40) costal cartilage affected chondrogenic differentiation? The expression of SV40 seemed to stimulate Mmp13 expression. Do these cells still make cartilage nodules? Some feedback on this process and how it affects the nature of the culture what be appreciated.

      Response: The “+ or –“ in Figure 5 refers to Ad5-cre. Each experiment was performed in SV40-immortalized costal chondrocytes. We have removed SV40 from the figure and have clarified the legend to say “qRT-PCR of human COL11A1 and endogenous mouse Mmp3 in SV40 immortalized mouse costal chondrocytes transduced with the lentiviral vector only (lanes 1,2), human WT COL11A1 (lane 3), or COL11A1P1335L. Otherwise we absolutely agree that understanding Mmp13 regulation during chondrocyte differentiation is important. We plan to study this using in vivo systems.

      • Figure 1: is the average Odds ratio, can this be stated in the figure legend?

      Response: We are not sure what is being asked here. The “combined odds ratio” is calculated as a weighted average of the log of the odds.

      • A more consistent use of established nomenclature for mouse versus human genes and proteins is needed.

      Human:GENE/PROTEIN Mouse: Gene/PROTEIN

      Response: Thank you for pointing this out. The nomenclature has been corrected throughtout the manuscript.

      • There is no Figure 5c, but a reference to results in the main text. Please reconcile. -There is no Figure 5-figure supplement 5a, but there is a reference to it in the main text. Please reconcile.

      Response: Figure references have been corrected.

      • Please indicate dilutions of all antibodies used when listed in the methods.

      Response: Antibody dilutions have been added where missing.

      • On page 25, there is a partial sentence missing information in the Histologic methods; "#S36964 Invitrogen, CA, USA)). All images were taken..."

      Response: We apologize for the error. It has been removed.

      • Table 1: please define all acronyms, including cohort names.

      Response: We apologize for the oversight. The legend to the Table has been updated with definitions of all acronyms.

      • Figure 2: Indicate that blue staining is DAPI in panel B. Clarify that "-ab" as an abbreviation is primary antibody negative.

      Response: A color code for DAPI and COL11A! staining has been added and “-ab” is now defined.

      • Page 4: ADGRG6 (also known as GPR126)...the authors set this up for ADGRG6 but then use GPR126 in the manuscript, which is confusing. For clarity, please use the gene name Adgrg6 consistently, rather than alternating with Gpr126.

      Response: Thank you for pointing this out. GPR126 has now been changed to ADGRG6 thoughout the manuscript.

      • REF 4: Richards, B.S., Sucato, D.J., Johnston C.E. Scoliosis, (Elsevier, 2020). Is this a book, can you provide more clarity in the Reference listing?

      Response: Thank you for pointing this out. This reference has been corrected.

      • While isolation was addressed, the methods for culturing Rat cartilage endplate and costal chondrocytes are poorly described and should be given more text.

      Response: Details about the cartilage endplate and costal chondrocyte isolation and culture have been added to the Methods.

      • Page 11: 1st paragraph, last sentence "These results suggest that Mmp3 expression"... this sentence needs attention. As written, I am not clear what the authors are trying to say.

      Response: This sentence has been clarified and now reads “These results suggest that Mmp3 expression is negatively regulated by Col11a1 in mouse costal chondrocytes.”

      • Page 13: line 4 from the bottom, "ECM-clearing"? This is confusing do you mean ECM degrading?

      Response: Yes and thank you. We have changed to “ECM-degrading”.

      • Please use version numbers for RefSeq IDs: e.g. NM_080629.3 instead of NM_080629

      Response: This change has been made in the revised manuscript.

      • It would be helpful for readers if the ethnicity of the discovery case cohort was clearly stated as European ancestry in the Results main text.

      Response: “European ancestry” has been added at first description of the discovery cohort in the manuscript.

      • Avoid using the term "mutation" and use "variant" instead.

      Response: Thank you for pointing this out. “Variant” is now used throughout the manuscript.

      • Define error bars for all bar charts throughout and include individual data points overlaid onto bars.

      Response: Thank you. Error bars are now clarified in the Figure legends.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #2 (Public review):

      Summary:

      The authors reported that mutations were identified in the ZC3H11A gene in four adolescents from 1015 high myopia subjects in their myopia cohort. They further generated Zc3h11a knockout mice utilizing the CRISPR/Cas9 technology.

      Comments on revisions:

      Chong Chen and colleagues revised the manuscript; however, none of my suggestions from the initial review have been sufficiently addressed.

      (1) I indicated that the pathogenicity and novelty of the mutation need to be determined according to established guidelines and databases. However, the conclusion was still drawn without sufficient justification.

      Thank you for your valuable feedback on the assessment of mutation pathogenicity and novelty. We regret to inform you that complete familial genetic information required for segregation analysis is currently unavailable in this study. Despite our exhaustive efforts to contact the four mutation carriers and their relatives, we encountered the following uncontrollable limitations: Two patients could not be further traced due to invalid contact information, one patient had relocated to another region, making sample collection logistically unfeasible, the remaining patient explicitly declined family participation in genetic testing due to privacy concerns.

      We fully acknowledge that the lack of pedigree data may affect the certainty of pathogenicity evaluation. To address this limitation, we systematically analyzed the four ZC3H11A missense mutations (c.412G>A p.V138I, c.128G>A p.G43E, c.461C>T p.P154L, and c.2239T>A p.S747T) based on ACMG guidelines and database evidence. The key findings are summarized below: All of the identified mutations exhibited very low frequencies or does not exist in the Genome Aggregation Database (gnomAD) and Clinvar, and using pathogenicity prediction software SIFT, PolyPhen2, and CADD, most of them display high pathogenicity levels. Among them, c.412G>A, c.128G>A and c.461C>T were located in or around a domain named zf-CCCH_3 (Figure 1A and B). Furthermore, all of the mutation sites were located in highly conserved amino acids across different species (Figure 1C). The four mutations induced higher structural flexibility and altered the negative charge at corresponding sites, potentially disrupting protein-RNA interactions (Figure 1D and E). Concurrently, overexpression of mutant constructs (ZC3H11A-V138I, ZC3H11A-G43E, ZC3H11A-P154L, and ZC3H11A-S747T) revealed significantly reduced nuclear IκBα mRNA levels compared to the wild-type, suggesting impaired NF-κB pathway regulation (Supplementary Figure 4). Zc3h11a knockout mice also exhibited a myopic phenotype, with alterations in the PI3K-AKT and NF-κB signaling pathways. Integrating this evidence, the mutations meet the following ACMG criteria: PM1 (domain-located mutations), PM2 (extremely low population frequency), PP3 (computational predictions supporting pathogenicity), PS3 (functional validation via experimental assays). Under the ACMG framework, these mutations are classified as "Likely Pathogenic".

      Regarding the novelty of this mutation, comprehensive searches in ClinVar, dbSNP, and HGMD databases revealed no prior reports associating this variant with myopia. Similarly, a PubMed literature search identified no direct evidence linking this mutation to myopia. Based on this evidence, we classify this variant as a likely pathogenic and novel mutation.

      On the other hand, we acknowledge that the absence of family segregation data may reduce the confidence in pathogenicity assessment. Nevertheless, functional experiments and converging multi-level evidence strongly support the reliability of our conclusion. Future studies will prioritize family-based validation to strengthen the evidence chain. We sincerely appreciate your attention to this matter and kindly request your understanding of the practical limitations inherent to this research.

      (2) The phenotype of heterozygous mutant mice is too weak to support the gene's contribution to high myopia. The revised manuscript does not adequately address these discrepancies. Furthermore, no explanation was provided for why conditional gene deletion was not used to avoid embryonic lethality, nor was there any discussion on tissue- or cell-specific mechanistic investigations.

      We sincerely appreciate your insightful comments regarding the relationship between murine phenotypes and human disease. We fully acknowledge your concerns about the phenotypic strength of Zc3h11a heterozygous mutant mice and their association with high myopia (HM) pathogenesis. Here we provide point-by-point responses to your valuable comments: Our study demonstrates that Zc3h11a heterozygous mutant mice exhibit myopic refractive phenotypes with upregulated myopia-associated factors (TGF-β1, MMP2, and IL6), although axial elongation did not reach statistical significance. Notably, at 4 and 6 weeks of age, Het mice did display longer axial lengths and vitreous chamber depths compared to WT mice. While these differences did not reach statistical significance at other time points, an increasing trend was still observed. Several technical considerations may explain these findings: The small murine eye size (where 1D refractive change corresponds to only 5-6μm axial length change). The theoretical resolution limit of 6μm for the SD-OCT device used in this study. These factors likely contributed to the marginal statistical significance observed in the subtle changes of vitreous chamber depth and axial length measurements. Additionally, existing research indicates that axial length measurements from frozen sections in age-matched mice tend to be longer than those obtained through in vivo measurements. This phenomenon may reflect species differences between humans and mice - while both show significant refractive power changes, the axial length differences are less pronounced in mice. These results align with previous reports of phenotypic differences between mouse models and human myopia.

      To address these issues comprehensively, we have added a dedicated discussion section in the revised manuscript specifically examining these axial length measurement considerations, following your valuable suggestion.

      Additionally, we regret to inform you that the currently available floxed ZC3H11A mouse strain requires a minimum of 12-18 months for custom construction, which exceeds our research timeline due to current resource limitations in our team. To address this gap, we have supplemented the discussion section with additional content regarding tissue- and cell-specific mechanisms. Based on your constructive suggestions, we will prioritize the following in our subsequent work: Collaborate with transgenic animal centers to generate Zc3h11a conditional knockout mice. Evaluate the impact of specific knockouts on myopia progression using form-deprivation (FDM) models. While we recognize the limitations of our current study, we believe that by integrating clinical cohort data, phenotypic evidence, and functional experiments, this research provides valuable directional evidence for ZC3H11A's potential role in myopia pathogenesis. Your comments will significantly contribute to improving our future research design, and we sincerely hope you can recognize the exploratory significance of our current findings.

      (3) The title, abstract, and main text continue to misrepresent the role of the inflammatory intracellular PI3K-AKT and NF-κB signaling cascade in inducing high myopia. No specific cell types have been identified as contributors to the phenotype. The mice did not develop high myopia, and no relationship between intracellular signaling and myopia progression has been demonstrated in this study.

      Thank you for your valuable comments regarding the interpretation of signaling pathways in our study. We fully acknowledge your rigorous concerns about the role of PI3K-AKT and NF-κB signaling cascades in high myopia and recognize that we did not identify specific cell types contributing to the observed phenotype. In response to your feedback, we have removed the hypothetical statement linking genetic changes within inflammatory cells to the development of myopia. The current interpretation is strictly based on experimental evidence of pathway relevance and is supported by the theoretical basis presented in the reference, specifically that loss of Zc3h11a leads to activation of the PI3K-AKT and NF-κB pathways in retinal cells, contributing to the myopic phenotype.

      Author response image 1.

      Model of the association between inflammation and myopia progression. Activated mAChR3 (M3R) activates phosphoinositide 3-kinase (PI3K)–AKT and mitogen-associated protein kinase (MAPK) signaling pathways, in turn activating NF-κB and AP1 (i.e., the Jun.-Fos heterodimer) and stimulating the expression of the target genes NF-κB, MMP2, TGFβ, IL- 1β and -6, and TNF-α. MMP2 and TGF-β promote tissue remodeling and TNF-α may act in a paracrine feedback loop in the retina or sclera to activate NF-κB during myopia progression.

      To address the limitations raised, we will prioritize the following in future studies: Cell-type-specific knockout models to identify key cellular contributors. Mechanistic investigations to establish causal relationships between signaling pathways and myopia progression. We sincerely appreciate your rigorous review, which has significantly improved the scientific accuracy and clarity of our manuscript. We believe the revised version better reflects both the novelty and limitations of our findings. We kindly request your recognition of the study’s contributions while acknowledging its current constraints.

      Reviewer #3 (Public review):

      Chen et al have identified a new candidate gene for high myopia, ZC3H11A, and using a knock-out mouse model, have attempted to validate it as a myopia gene and explain a potential mechanism. They identified 4 heterozygous missense variants in highly myopic teenagers. These variants are in conserved regions of the protein, and predicted to be damaging, but the only evidence the authors provide that these specific variants affect protein function is a supplement figure showing decreased levels of IκBα after transfection with overexpression plasmids (not specified what type of cells were transfected). This does not prove that these mutations cause loss of function, in fact it implies they have a gain-of-function mechanism. They then created a knock-out mouse. Heterozygotes show myopia at all ages examined but increased axial length only at very early ages. Unfortunately, the authors do not address this point or examine corneal structure in these animals. They show that the mice have decreased B-wave amplitude on electroretinogram (a sign of retinal dysfunction associated with bipolar cells), and decreased expression of a bipolar cell marker, PKCα. On electron microscopy, there are morphologic differences in the outer nuclear layer (where bipolar, amacrine, and horizontal cell bodies reside). Transcriptome analysis identified over 700 differentially expressed genes. The authors chose to focus on the PI3K-AKT and NF-κB signaling pathways and show changes in expression of genes and proteins in those pathways, including PI3K, AKT, IκBα, NF-κB, TGF-β1, MMP-2 and IL-6, although there is very high variability between animals. They propose that myopia may develop in these animals either as a result of visual abnormality (decreased bipolar cell function in the retina) or by alteration of NF-κB signaling. These data provide an interesting new candidate variant for development of high myopia, and provide additional data that MMP2 and IL6 have a role in myopia development. For this revision, none of my previous suggestions have been addressed.

      Reviewer #3 (Recommendations for the authors):

      None of these suggestions were addressed in the revision:

      Major issues:

      (1) Figure 2: refraction is more myopic but axial length is not longer - why is this not discussed and explored? The text claims the axial length is longer, but that is not supported by the figure. If this is a measurement issue, that needs to be discussed in the text.

      We sincerely appreciate your valuable comments regarding the relationship between refractive status and axial length in our study. In response to your concerns, we have conducted an in-depth analysis and would like to address the issues as follows:

      Our data demonstrate significant differences in refractive error between heterozygous (Het) and wild-type (WT) mice during the 4-10 weeks. Notably, at 4 and 6 weeks of age, Het mice did exhibit longer axial lengths and greater vitreous chamber depth compared to WT mice, although these differences did not reach statistical significance at other time points while still showing an increasing trend. Additional measurements of corneal curvature revealed no significant differences between groups. Considering the small size of mouse eyes (where a 1D refractive change corresponds to only 5-6μm axial length change) and the theoretical resolution limit of 6μm for the SD-OCT device used in this study, these technical factors may account for the marginal statistical significance of the observed small changes in vitreous chamber depth and axial length measurements. Furthermore, existing studies have shown that axial length measurements from frozen sections tend to be longer than those obtained from in vivo measurements in age-matched mice. These considerations provide plausible explanations for the apparent discrepancy between refractive changes and axial length parameters. Following your suggestion, we have added a dedicated discussion section addressing these axial length measurement issues in the revised manuscript. We fully understand your concerns regarding data consistency, and your comments have prompted us to conduct more comprehensive and thorough analysis of our results. We believe the revised manuscript now more accurately reflects our findings while providing important technical references for future studies.

      (2)  Slipped into the methods is a statement that mice with small eyes or ocular lesions were excluded. How many mice were excluded? Are the authors ignoring another phenotype of these mice?

      We appreciate your attention to the exclusion criteria and their implications. Below we provide a detailed clarification: A total of 7 mice (4 Het-KO and 3 WT) with small eyes or ocular lesions were excluded from the observation cohort. These anomalies were consistent with the baseline incidence of spontaneous malformations observed in historical colony data of wild-type C57BL/6J mice (approximately 11%), and were not attributed to the Zc3h11a heterozygous knockout. We have added the above content in the methods section. Your insightful comment has significantly strengthened our reporting rigor. We hope this clarification alleviates your concerns regarding potential selection bias or overlooked phenotypes.

      Minor/Word choice issues:

      All the figure legends need to be improved so that each figure can be interpreted without having to refer to the text.

      Thank you for your valuable comments. We have made modifications to the legend of each graphic, as detailed in the main text.

      Abstract: line 24: use refraction, not "vision"

      Thank you for your valuable comments. The “Vision” has been changed to “refraction”.

      Line 28: re-word "density of bipolar cell-labeled proteins" Do the authors mean density of bipolar cells? Or certain proteins were less abundant in bipolar cells?

      Thank you for your rigorous review of this terminology. We acknowledge the need to clarify the precise meaning of the phrase "density of bipolar cell-labeled proteins." In the original text, this term specifically refers to the expression abundance of the bipolar cell-specific marker protein PKCα, which was identified using immunofluorescence labeling techniques. Specifically: We utilized PKCα (a bipolar cell marker) to label bipolar cell populations. The "density" was quantified by measuring the fluorescence signal intensity per unit area in confocal microscopy images, rather than direct cell counting. This metric reflects changes in the expression of the specific marker protein (PKCα) within bipolar cells, which indirectly correlates with alterations in bipolar cell populations. To address ambiguity, we have revised the terminology throughout the manuscript to "bipolar cell-labelled protein PKCα immunofluorescence abundance".

      Additionally, since fluorescence intensity quantification is inherently semi-quantitative, we have included Western blot results for PKCα in the revised manuscript (Figure 3I, J) to validate the expression changes observed via immunofluorescence. We sincerely appreciate your feedback, which has significantly improved the precision of our manuscript.

      Line 45: axial length, not ocular axis

      Thank you for your valuable comments. The “ocular axis” has been changed to “axial length”.

      Lines73-75: confusing

      Thank you for your valuable comments. The relevant content has been modified to “Multiple zinc finger protein genes (e.g., ZNF644, ZC3H11B, ZFP161, ZENK) are associated with myopia or HM. Of these, ZC3H11B (a human homolog of ZC3H11A) and five GWAS loci (Schippert et al., 2007; Shi et al., 2011; Szczerkowska et al., 2019; Tang et al., 2020; Wang et al., 2004) correlate with AL elongation or HM severity. Proteomic studies further suggest ZC3H11A involvement in the TREX complex, implicating RNA export mechanisms in myopia pathogenesis”

      Line 138: what is dark 3.0 and dark 10.0

      Thank you for your valuable comments. The relevant content has been modified to “Upon dark adaptation, b-wave amplitudes in seven-week-old Het-KO mice were significantly lower at dark 3.0 (0.48 log cd·s/m²) and dark 10.0 (0.98 log cd·s/m²) compared to WT mice.” A detailed description has been added to the main text methods.

      Line 171-175: the GO terms of "biological processes" and "molecular functions" are so broad as to be meaningless.

      Thank you for your valuable comments. The relevant content has been modified to “GO enrichment analysis revealed significant enrichment of differentially expressed genes in the following functions: Zinc ion transmembrane transport (GO:0071577) within metal ion homeostasis, associated with retinal photoreceptor maintenance (Ugarte and Osborne, 2001), RNA biosynthesis and metabolism (GO:0006366) in transcriptional regulation, potentially influencing ocular development, negative regulation of NF-κB signaling (GO:0043124) in inflammatory modulation, a pathway involved in scleral remodelling (Xiao et al., 2025), calcium ion binding (GO:0005509), critical for phototransduction (Krizaj and Copenhagen, 2002), zinc ion transmembrane transporter activity (GO:0005385), participating in retinal zinc homeostasis (Figure 5C and D).”

      Line 257-259: which results indicated loss of Zc3h11a inhibited translocation of IκBα from nucleus to cytoplasm? Results of this study, or the previously referenced study?

      We sincerely appreciate your critical inquiry regarding the mechanistic relationship between Zc3h11a deficiency and IκBα translocation. We are grateful for this opportunity to clarify this important point. The findings regarding Zc3h11a-mediated regulation of IκBα mRNA nuclear export and its impact on NF-κB signaling originate from the study by Darweesh et al. The key experimental evidence demonstrates that: The depletion of Zc3h11a leads to nuclear retention of IκBα mRNA, resulting in failure to maintain normal levels of cytoplasmic IκBα mRNA and protein. This defect in IκBα mRNA export disrupts the essential inhibitory feedback loop on NF-κB activity, causing hyperactivation of this pathway. This manifests as upregulation of numerous innate immune-related mRNAs, including IL-6 and a large group of interferon-stimulated genes.While our study references this mechanism to explain the observed NF-κB dysregulation in Zc3h11a Het-KO mice, the specific nuclear export mechanism was indeed elucidated by Darweesh et al. The reference has been inserted into the corresponding position in the main text. Importantly, our research extends these previous molecular insights into the phenotypic context of myopia.

      We sincerely regret any ambiguity in the original text and deeply appreciate your rigorous approach in ensuring proper attribution of these fundamental findings. Your comment has significantly improved the clarity and accuracy of our manuscript.

      Figure 6 shows decrease of both mRNA and protein expression, but nothing about translocation.

      Thank you for your valuable comments. The research results of Darweesh et al. showed that Zc3h11a protein plays a role in regulation of NF-κB signal transduction. Depletion of Zc3h11a resulted in enhanced NF-κB mediated signaling, with upregulation of numerous innate immune related mRNAs, including IL-6 and a large group of interferon-stimulated genes. IL-6 upregulation in the absence of the Zc3h11a protein correlated with an increased NF-κB transcription factor binding to the IL-6 promoter and decreased IL-6 mRNA decay. The enhanced NF-κB signaling pathway in Zc3h11a deficient cells correlated with a defect in IκBα inhibitory mRNA and protein accumulation. Upon Zc3h11a depletion The IκBα mRNA was retained in the cell nucleus resulting in failure to maintain normal levels of the cytoplasmic IκBα mRNA and protein that is essential for its inhibitory feedback loop on NF-κB activity. These findings demonstrate that ZC3H11A can regulate the NF-κB pathway by controlling the translocation of IκBα mRNA, a mechanism that was indeed elucidated by Darweesh et al. We sincerely apologize for any lack of clarity in our original description and have now inserted the appropriate reference in the relevant section of the main text.

      We deeply appreciate your valuable comments in identifying this ambiguity in our manuscript, which have significantly improved the accuracy and clarity of our work.

      Line 283: what do you mean "may confer embryonic lethality"? Were they embryonic lethal or not?

      We sincerely appreciate your critical request for clarification. Our experimental data from 15 pregnancies of Zc3h11a Het-KO mice intercrosses (n = 15 litters) conclusively confirmed the absence of homozygous knockout (Homo-KO) pups at birth. These findings align with the embryonic lethality of Zc3h11a homozygous deletion as reported by Younis et al. We fully acknowledge the ambiguity in our original phrasing and have revised the text to:“Second, Zc3h11a homozygous KO (Homo-KO) mice were not obtained in our study because homozygous deletion of exons confer embryonic lethality.”Your vigilance in ensuring terminological precision has greatly strengthened the rigor of our manuscript. We hope this clarification fully resolves your concerns.

      Line 338: What is meant that Het-KO mice were constructed at 4 weeks of age? Do these mice not have a germline mutation?

      Thank you for your valuable comments. We have revised the following content: “The germline heterozygous Zc3h11a knockout (Het-KO) mice were generated by CRISPR/Cas9-mediated gene editing at the embryonic stage on a C57BL/6J background, provided by GemPharmatech Co., Ltd (Nanjing, China). Phenotypic analyses were initiated when the mice reached four weeks of age.”

      Line 346-347: how many mice were excluded due to having small eyes or ocular lesions? The methods section should state how refraction and ocular biometrics were measured.

      Thank you for your valuable comments. We have added or revised the following content: “To exclude potential confounding effects of spontaneous ocular developmental abnormalities, a total of 7 mice (4 Het-KO and 3 WT) with small eyes or ocular lesions were excluded from the observation cohort. These anomalies were consistent with the baseline incidence of spontaneous malformations observed in historical colony data of wild-type C57BL/6J mice (approximately 11%), and were not attributed to the Zc3h11a heterozygous knockout.

      The methods for measuring refraction and ocular biometrics are as follows and have been added to the original method. Refractive measurements were performed by a researcher blinded to the genotypes. Briefly, in a darkroom, mice were gently restrained by tail-holding on a platform facing an eccentric infrared retinoscope (EIR) (Schaeffel et al., 2004; Zhou et al., 2008a). The operator swiftly aligned the mouse position to obtain crisp Purkinje images centered on the pupil using detection software (Schaeffel et al., 2004), enabling axial measurements of refractive state and pupil size. Three repeated measurements per eye were averaged for analysis. The anterior chamber (AC) depth, lens thickness, vitreous chamber (VC) depth, and axial length (AL) of the eye were measured by real-time optical coherence tomography (a custom built OCT) (Zhou et al., 2008b). In simple terms, after anesthesia, each mouse was placed in a cylindrical holder on a positioning stage in front of the optical scanning probe. A video monitoring system was used to observe the eyes during the process. Additionally, by detecting the specular reflection on the corneal apex and the posterior lens apex in the two dimensional OCT image, the optical axis of the mouse eye was aligned with the axis of the probe. Eye dimensions were determined by moving the focal plane with a stepper motor and recording the distance between the interfaces of the eyes. Then, using the designed MATLAB software and appropriate refractive indices, the recorded optical path length was converted into geometric path length. Each eye was scanned three times, and the average value was taken.”

      Line 428: what age retinas

      Thank you for your meticulous attention to the experimental design details. Regarding the age of retinal samples, we have clarified the following in the revised manuscript:" Retinas were harvested from four-week-old mice for RNA sequencing." This revision enhances the transparency and reproducibility of our methodology. We deeply appreciate your rigorous review.

      Figure 3 D-F: these images are too small to adequately assess, please show at higher magnification. Are there fewer bipolar cells, or just decreased expression of PKC? From these images, expression of ZC3H11A does not appear decreased, but the retina appears thinner. Is that true, or are these poorly matched sections?

      Thank you for your professional insights regarding image quality and data interpretation. Your rigorous review has significantly enhanced the scientific rigor of our study. We hereby address your concerns point by point: The images in Figures 3D-F were acquired using a Zeiss LSM880 confocal microscope with a 10x eyepiece and 20x objective lens, a standard magnification for retinal section imaging that balances cellular resolution with full-thickness structural preservation. We quantified PKCα immunofluorescence intensity (a bipolar cell-specific marker) to assess changes in bipolar cell populations, rather than direct cell counting. This metric reflects PKCα expression abundance as a proxy for bipolar cell alterations (Figure 3H). To clarify terminology, we have revised the text to "bipolar cell-labelled protein PKCα immunofluorescence abundance" and detailed the methodology in the revised Methods section. Recognizing the semi-quantitative nature of fluorescence intensity analysis, we supplemented these data with Western blot results confirming reduced PKCα protein levels (Figure 3I). Zc3h11a expression was validated both by immunofluorescence intensity (Figure 3G) and Western blot (Figures 6F, H) quantification, confirming reduced expression in Zc3h11a Het-KO retinas. The apparent "retinal thinning" observed in histology sections stems from technical artifacts during tissue processing (fixation, dehydration, sectioning), not biological differences. HE staining, which better preserves sample morphology, showed no structural or thickness differences between Zc3h11a Het-KO mice and wild-type mice (Supplementary Figure 2).

      Your expert feedback has driven us to establish a more robust validation framework. We believe the revised data now more accurately reflect the biological reality and sincerely hope these improvements meet your approval.

      Figure 3G-J: Relative fluorescence intensity of immunohistochemistry is not a valid measure of protein expression.

      We sincerely appreciate your thorough review and valuable comments regarding the immunofluorescence quantification method in Figures 3G-J. In response to your concern that "relative fluorescence intensity is not an effective quantitative measure of protein expression," we have implemented the following improvements to our analysis and validation: To ensure result reliability, all immunofluorescence experiments followed strict protocols: experimental and control samples were fixed, stained, and imaged in the same batch to eliminate inter-batch variability. Imaging was performed using a Zeiss LSM 880 confocal microscope with identical parameters, and the relative fluorescence intensity of specific signals per unit area was measured and statistically analyzed using ZEN software. We fully acknowledge the semi-quantitative nature of relative fluorescence intensity measurements. Therefore, we validated key differentially expressed proteins using Western blot analysis: The Western blot results for Zc3h11a (Figures 6F, H) were completely consistent with the relative fluorescence intensity trends (Figure 3G). Additionally, the newly included Western blot data for PKCα (Figure 3 I) further confirmed the reliability of our relative fluorescence intensity quantification. Your expert advice has significantly enhanced the rigor of our study. Should any additional data or clarification be required, we would be pleased to provide further support.

      Figure 4: what are the arrows pointing at? This should be in the Figure legend. What is MB? Why are there no scale bars? What is difference between E and F, not clear from legend.

      We sincerely appreciate your thorough review of Figure 4 and your valuable suggestions. In response to your concerns, we have carefully examined and improved the relevant content with the following modifications and clarifications: We sincerely apologize for not clearly indicating the arrow annotations in the original figure legend. In the revised version, we have provided detailed explanations for the arrow indicators: black arrows indicate perinuclear space dilation, blue arrows indicate cytoplasmic edema, and red arrows indicate disorganized and loosely arranged membrane discs. The updated legend has been clearly marked below Figure 4 in the main text. MB represents membrane discs, which are critical subcellular structures in the outer segments of retinal photoreceptor cells (rods and cones). They are responsible for light signal capture and transduction (containing visual pigments such as rhodopsin). The structural integrity of MB is essential for normal visual function. The scale bars in the original figures were located in the lower right corner of each subpanel, with specific parameters as follows: Figures 4A and B: magnification ×1000, scale bar 10 μm; Figures 4C and D: magnification ×700, scale bar 20 μm; Figures 4E and G: magnification ×2000, scale bar 5 μm; Figures 4F and H: magnification ×7000, scale bar 2 μm. Both Figures 4E and 4F show electron microscopy images of membrane discs (MB) in wild-type mouse photoreceptor cells. The only difference lies in the magnification: Figure 4E (×2000) demonstrates the overall arrangement pattern of membrane discs, while Figure 4F (×7000) focuses on ultrastructural details of the membrane discs (such as structural integrity). We have thoroughly checked the consistency between the figures and text, and have supplemented detailed legend descriptions in the main text. Once again, we sincerely appreciate your rigorous review, which has significantly enhanced the scientific rigor and readability of our study. Should you have any further suggestions, we would be happy to incorporate them.

      Figure 5A: Why such a large y-axis? Figure legend does not match figure

      We sincerely appreciate your careful review of Figure 5A and your valuable suggestions regarding the figure details. In response to your concerns, we have thoroughly examined and improved the relevant content as follows: The Y-axis of the volcano plot represents -log₁₀(p-value), where the magnitude of the values reflects statistical significance. Our RNA-seq data underwent rigorous multiple testing correction, and the adjusted p-values for some genes were extremely small, resulting in large values after -log₁₀ transformation. We have re-examined the data distribution and confirmed that the expanded Y-axis range is solely due to a small number of highly significant genes (as shown in the figure, the majority of genes remain clustered in the lower half of the Y-axis). This result accurately reflects the true data characteristics.

      We sincerely apologize for the inadvertent error in the original labeling of "Up/Down" in the figure legend. This has now been corrected, and we strictly adhere to the following threshold criteria: Significantly upregulated (Up): adjusted p-value < 0.05 and log₂(FC) ≥ 1. Significantly downregulated (Down): adjusted p-value < 0.05 and log₂(FC) ≤ -1. To ensure the reliability of our conclusions, we have rechecked the raw data, statistical analysis, and visualization process. We confirmed that all significant genes strictly meet the above threshold criteria and that the visualization accurately reflects the true results. The revised figure has been updated in the manuscript as Figure 5A. We deeply appreciate your valuable feedback, which has helped us correct the errors in the figure and improve its accuracy and readability.

      Figure 6F: Based on the western blot, only Zc3h11a appears different.

      Thank you for your careful evaluation of the Western blot data in Figure 6F. We fully understand your concerns regarding the visual differences in PI3K and p-AKT/AKT bands and appreciate the opportunity to clarify the quantitative methodology and biological significance of these findings. Below we provide a detailed explanation of the experimental design and data analysis.

      First, the data for each group were derived from retinal samples of three independent mice, with all experiments performed in parallel to control for technical variability. Image analysis was conducted using ImageJ software with standardized settings for grayscale quantification. Zc3h11a and PI3K levels were normalized to GAPDH as an internal reference, while p-AKT levels were calculated as a ratio to total AKT. The results showed that Zc3h11a protein levels were significantly reduced (p < 0.01, Figures 6F and H), consistent with the expected effects of heterozygous knockout, with good agreement between visual and statistical results. For PI3K and p-AKT/AKT, the bands appeared visually similar due to: The nonlinear nature of Western blot chemiluminescence signals in the saturation range, which compresses subtle quantitative differences in the images; the fact that p-AKT represents only 5-15% of the total AKT pool, making small proportional changes difficult to discern visually. However, it is important to note that both PI3K and p-AKT/AKT showed statistically significant differences between groups (p < 0.001 and p < 0.01, respectively; Figures 6G and I). Furthermore, signal transduction pathways exhibit cascade amplification effects - in the PI3K-AKT pathway, even small changes in upstream proteins can produce significant downstream effects (e.g., NF-κB activation) through kinase cascades (Figure 6J). Additionally, our RNA-Seq results revealed activation of the PI3K-AKT signaling pathway in Zc3h11a Het-KO mice (Figure 5D), and the qRT-PCR results were consistent with the western blot results (Figure 6A-C). Your expert comments have prompted us to present these data differences with greater biological rigor. Although the visual differences are subtle, based on statistical significance, pathway characteristics, and RNA sequencing, and qRT-PCR data, we believe these changes have biological relevance. We sincerely appreciate your commitment to data rigor and respectfully request your recognition of both the experimental results and the scientific logic of this study.

      Figure 8: What is the role of ZC3H11A in this figure? Are the authors proposing that ZC3H11A regulates the translation of IκBα? They have not shown any evidence of that.

      Thank you for your insightful exploration of the role of ZC3H11A in Figure 8. We appreciate your critical review and hope to elucidate the mechanistic framework behind our findings. In Figure 8, Zc3h11a is depicted as a regulator of IκBα mRNA nucleocytoplasmic transport, a mechanism originally elucidated by Darweesh et al. Their studies demonstrated that Zc3h11a binds to IκBα mRNA and promotes its nuclear export. Loss of Zc3h11a results in nuclear retention of IκBα mRNA, leading to reduced cytoplasmic IκBα protein levels and subsequent hyperactivation of the NF-κB pathway. While the specific nuclear export mechanism has been elucidated by Darweesh et al., our study demonstrates that Zc3h11a haploinsufficiency results in decreased IκBα mRNA and protein levels in the retina (Figure 7), linking Zc3h11a haploinsufficiency to NF-κB pathway dysregulation in myopia and highlighting that these molecular insights can be extended to a new pathological context (myopia). Your critical comments have enhanced the clarity of our mechanistic concepts and we hope that these descriptions will demonstrate the importance of ZC3H11A as a new candidate gene for myopia.

    1. Author Response

      The following is the authors’ response to the previous reviews.

      eLife assessment

      This study presents valuable findings about synaptic connectivity among subsets of unipolar brush cells (UBCs), a specialized interneuron primarily located in the vestibular lobules of the cerebellar cortex. The evidence supporting the claims are interesting although incomplete in some areas. The work will be of interest to cerebellar neuroscientists as well as those focussed on synaptic properties and mechanisms. Although several compelling pieces of data were presented, substantial work remains to be conducted in order for the hypothesis and predictions of the manuscript to confirm how these factors play out in the actual brain circuit and how it would impact the processing of feedback or feedforward activity that would be required to promote behavior.

      Public Reviews:

      Reviewer #1 (Public Review):

      The manuscript by Hariani et al. presents experiments designed to improve our understanding of the connectivity and computational role of Unipolar Brush Cells (UBCs) within the cerebellar cortex, primarily lobes IX and X. The authors develop and cross several genetic lines of mice that express distinct fluorophores in subsets of UBCs, combined with immunocytochemistry that also distinguishes subtypes of UBCs, and they use confocal microscopy and electrophysiology to characterize the electrical and synaptic properties of subsets of so-labelled cells, and their synaptic connectivity within the cerebellar cortex. The authors then generate a computer model to test possible computational functions of such interconnected UBCs.

      Using these approaches, the authors report that:

      1. GRP-driven TDtomato is expressed exclusively in a subset (20%) of ON-UBCs, defined electrophysiologically (excited by mossy fiber afferent stimulation via activation of UBC AMPA and mGluR1 receptors) and immunocytochemically by their expression of mGluR1.

      2. UBCs ID'd/tagged by mCitrine expression in Brainbow mouse line P079 is expressed in a similar minority subset of OFF-UBCs defined electrophysiologically (inhibited by mossy fiber afferent stimulation via activation of UBC mGluR2 receptors) and immunocytochemically by their expression of Calretinin. However, such mCitrine expression was also detected in some mGluR1 positive UBCs, which may not have shown up electrophysiologically because of the weaker fluorophore expression without antibody amplification.

      3. Confocal analysis of crossed lines of mice (GRP X P079) stained with antibodies to mGluR1 and calretinin documented the existence of all possible permutations of interconnectivity between cells (ON-ON, ON-OFF, OFF-OFF, OFF-ON), but their overall abundance was low, and neither their absolute or relative abundance was quantified.

      4. A computational model (NEURON ) indicated that the presence of an intermediary UBC (in a polysynaptic circuit from MF to UBC to UBC) could prolong bursts (MF-ON-ON), prolong pauses (MF-ON-OFF), cause a delayed burst (MF-OFF-OFF), cause a delayed pause (MF-OFF-ON) relative to solely MF to UBC synapses which would simply exhibit long bursts (MF-ON) or long pauses (MF-OFF).

      The authors thus conclude that the pattern of interconnected UBCs provides an extended and more nuanced pattern of firing within the cerebellar cortex that could mediate longer lasting sensorimotor responses.

      The cerebellum's long known role in motor skills and reflexes, and associated disorders, combined with our nascent understanding of its role in cognitive, emotional, and appetitive processing, makes understanding its circuitry and processing functions of broad interest to the neuroscience and biomedical community. The focus on UBCs, which are largely restricted to vestibular lobes of the cerebellum reduces the breadth of likely interest somewhat. The overall design of specific experiments is rigorous and the use of fluorophore expressing mouse lines is creative. The data that is presented and the writing are clear. However, despite some additional analysis in response to the initial review, the overall experimental design still has issues that reduce overall interpretation (please see specific issues for details), which combined with a lack of thorough analysis of the experimental outcomes undermines the value of the NEURON model results and the advance in our understanding of cerebellar processing in situ (again, please see specific issues for details).

      Specific issues:

      1. All data gathered with inhibition blocked. All of the UBC response data (Fig. 1) was gathered in the presence of GABAAR and Glycine R blockers. While such an approach is appropriate generally for isolating glutamatergic synaptic currents, and specifically for examining and characterizing monosynaptic responses to single stimuli, it becomes problematic in the context of assaying synaptic and action potential response durations for long lasting responses, and in particular for trains of stimuli, when feed-forward and feed-back inhibition modulates responses to afferent stimulation. I.e. even for single MF stimuli, given the >500ms duration of UBC synaptic currents, there is plenty of time for feedback inhibition from Golgi cells (or feedforward, from MF to Golgi cell excitation) to interrupt AP firing driven by the direct glutamatergic synaptic excitation. This issue is compounded further for all of the experiments examining trains of MF stimuli. Beyond the impact of feedback inhibition on the AP firing of any given UBC, it would also obviously reduce/alter/interrupt that UBC's synaptic drive of downstream UBCs. This issue fundamentally undermines our ability to interpret the simulation data of Vm and AP firing of both the modeled intermediate and downstream UBC, in terms of applying it to possible cerebellar cortical processing in situ.

      The goal of Figure 1 was to determine the cell types of labeled UBCs in transgenic mouse lines, which is determined entirely by their synaptic responses to glutamate (Borges-Merjane and Trussell, 2015). Thus, blocking inhibition was essential to produce clear results in the characterization of GRP and P079 UBCs. While GABAergic/glycinergic feedforward and feedback inhibition is certainly important in the intact circuit, it was not our intention, nor was it possible, to study its contribution in the present study. Leaving inhibition unblocked does not lead to a physiologically realistic stimulation pattern in acute brain slices, because electrical stimulation produces synchronous excitation and inhibition by directly exciting Golgi cells, rather than their synaptic inputs. The main inhibition that UBCs receive that are crucial to determining burst or pause durations is not via GABA/glycine, but instead through mGluR2, which lasts for 100-1000s of milliseconds. The main excitation that drives UBC firing is mGluR1 and AMPA, which both last 100-1000s of milliseconds. Thus, these large conductances are unlikely to be significantly shaped by 1-10 ms IPSCs from feedforward and feedback GABA/glycine inhibition. Recent studies that examined the duration of bursting or pausing in UBCs had inhibition blocked in their experiments, presumably for the reasons outlined above (Guo et al., 2021; Huson et al., 2023).

      Below is an example showing the synaptic currents and firing patterns in an ON UBC before and after blocking inhibition. The GABA/glycinergic inhibition is fast, occurs soon after the stimuli and has little to no effect on the slow inward current that develops after the end of stimulation, which is what drives firing for 100s of milliseconds.

      Author response image 1.

      Example showing small effect of GABAergic and glycinergic inhibition on excitatory currents and burst duration. A) Excitatory postsynaptic currents in response to train of 10 presynaptic stimuli at 50 Hz before (black) and after (Grey) blocking GABA and glycine receptors. The slow inward current that occurs at the end of stimulation is little affected. B) Expanded view of the synaptic currents evoked during the train of stimuli. GABA/glycine receptors mediate the fast outward currents that occur immediately after the first couple stimuli. C) Three examples of the bursts caused by the 50 Hz stimulation in the same cell without blocking GABA and glycine receptors. D) Three examples in the same cell after blocking GABA and glycine receptors.

      The authors' response to the initial concern is (to paraphrase), "its not possible to do and its not important", neither of which are soundly justified.

      As stated in the original review, it is fully understandable and appropriate to use GABAAR/GlycineR antagonists to isolate glutamatergic currents, to characterize their conductance kinetics. That was not the issue raised. The issue raised was that then using only such information to generate a model of in situ behavior becomes problematic, given that feedback and lateral inhibition will sculpt action potential output, which of course will then fundamentally shape their synaptic drive of secondary UBCs, which will be further sculpted by their own inhibitory inputs. This issue undermines interpretation of the NEURON model.

      The argument that taking inhibition into account is not possible because of assumed or possible direct electrical excitation of Golgi cells is confusing for two interacting reasons. First, one can certainly stimulate the mossy fiber bundle to get afferent excitation of UBCs (and polysynaptic feedback/lateral inhibitory inputs) without directly stimulating the Golgi cells that innervate any recorded UBC. Yes, one might be stimulating some Golgi cells near the stimulating electrode, but one can position the stimulating electrode far enough down the white matter track (away from the recorded UBC), such that mossy fiber inputs to the recorded UBC can be stimulated without affecting Golgi cells near or synaptically connected to the recorded UBC. Moreover, if the argument were true, then presumably the stimulation protocol would be just as likely to directly stimulate neighboring UBCs, which then drove the recorded UBC's responses. Thus, it is both doable and should be ensured that stimulation of the white matter is distant enough to not be directly activating relevant, connected neurons within the granule cell layer.

      Finally, the authors present three examples of UBC recordings with and without inhibitory inputs blocked, and state "Thus, these large conductances are unlikely to be significantly shaped by 1-10 ms IPSCs from feedforward and feedback GABA/glycine inhibition" and "GABA/glycinergic inhibition...has little to no effect on the slow inward current that develops after the end of stimulation". This response reflects on original concerns about lack of quantification or consideration of important parameters. In particular, while the traces with and without inhibition are qualitatively similar, quantitative considerations indicate otherwise. First, unquantified examples are not adequate to drive conclusions. Regardless, the main issue (how inhibition affects actual responses in situ) is actually highlighted by the authors current clamp recordings of UBC responses, before and after blocking inhibition. The output response is dramatically different, both at early and late time points, when inhibition is blocked. Again, a lack of quantification (of adequate n's) makes it hard to know exactly how important, but quick "eye ball" estimates of impact include: 1) a switch from only low frequency APs initially (without inhibition blocked) to immediate burst of high frequency APs (high enough to not discern individual APs with given figure resolution) when inhibition is blocked, 2) Slow rising to a peak EPSP, followed by symmetrical return to baseline (without inhibition blocked) versus immediate rise to peak, followed by prolonged decay to baseline (with inhibition blocked), 3) substantially shorter duration (~34% shorter) secondary high frequency burst (individual APs not discernible) of APs (with inhibition blocked versus without inhibition blocked), and 4) substantial reduction in number of long delayed APs (with inhibition blocked versus without inhibition blocked). Thus, clearly, feedback/lateral inhibition is actually sculpting AP output at all phases of the UBC response to trains of afferent stimulations. Importantly, the single voltage clamp trace showing little impact of transient IPSCs on the slow EPSC do not take into account likely IPSC influences on voltage-activated conductances that would not occur in voltage-clamp recordings but would be free to manifest in current clamp, and thereby influence AP output, as observed.

      So again, our ability to understand how interconnected UBCs behave in the intact system is undermined by the lack of consideration and quantification of the impact of inhibition, and it not being incorporated into the model. At the very least a strong proviso about lack of inclusion of such information, given the authors' data showing its importance in the few examples shown, should be added to the discussion.

      Thank you for this substantive explanation. Your points are well described and we agree that the single experiment shown is not strong evidence for a lack of importance of Golgi cell inhibition, especially on the temporal dynamics of spiking. Previous work has clearly shown that Golgi cells have several important roles in shaping the activity of the granular layer, including affecting the temporal dynamics of granule cell spikes. However, the work presented here focuses on the feedforward circuitry of UBCs and the large inward and large outward glutamatergic currents that drive spiking or pausing for 100s of milliseconds. Our model does not focus on the aspects that are most sensitive to Golgi cell inhibition, including timing of the first spikes in the UBC’s response. Nor does our model focus on short term plasticity, which we thought was reasonable because the slow currents in UBCs are quite insensitive to the temporal characteristics of glutamate release (See the example in the previous rebuttal). Our model does not include long term plasticity, which is also affected by Golgi cells. For these reasons we agree that the model presented does not explain how feedforward UBC circuits might “play out in the actual brain circuit and how it would impact the processing of feedback or feedforward activity that would be required to promote behavior.” We have included a new paragraph in the discussion clarifying the limitations of this study and the model, reproduced below.

      "Limitations of the model

      Here we addressed how feedforward glutamatergic excitation and inhibition is transformed from one UBC to the next depending on their subtype. The model focuses on AMPA receptor mediated excitation and mGluR2 mediated inhibition. One limitation of the model is that it does not consider feedforward and lateral inhibition from Golgi cells, which shape the spiking of UBCs in response to afferent stimulation. Golgi cells receive mossy fiber input and inhibit UBCs through their corelease of GABA and glycine (Dugue et al., 2005; Rousseau et al., 2012). Golgi cells control the temporal dynamics of the firing of granule cells as well as their gain (Rossi et al., 2003; Kanichay and Silver, 2008) and are critical to larger scale dynamics of the cerebellar cortical network (D‘Angelo, 2008). Purkinje cells provide additional inhibition to ON UBCs that could influence how UBC circuits transform signals (Guo et al., 2016). A more complex model that implements Golgi cells and other critical circuit elements will be needed to investigate the role of feedforward UBC circuits in cerebellar network dynamics and motor behaviors in vivo."

      1. No consideration for involvement of polysynaptic UBCs driving UBC responses to MF stimulation in electrophysiology experiments. Given the established existence (in this manuscript and Dino et al. 2000 Neurosci, Dino et al. 2000 ProgBrainRes, Nunzi and Mugnaini 2000 JCompNeurol, Nunzi et al. 2001 JCompNeurol) of polysynaptic connections from MFs to UBCs to UBCs, the MF evoked UBC responses established in this manuscript, especially responses to trains of stimuli could be mediated by direct MF inputs, or to polysynaptic UBC inputs, or possibly both (to my awareness not established either way). Thus the response durations could already include extension of duration by polysynaptic inputs, and so would overestimate the duration of monosynaptic inputs, and thus polysynaptic amplification/modulation, observed in the NEURON model.

      We are confident that the synaptic responses shown are monosynaptic for several reasons. UBCs receive a single mossy fiber input on their dendritic brush, and thus if our stimulation produces a reliable, short-latency response consistent with a monosynaptic input, then there is not likely to be a disynaptic input, because the main input is accounted for by the monosynaptic response. In all cells included in our data set, the fast AMPA receptor-mediated currents always occurred with short latency (1.24 ± 0.29 ms; mean ± SD; n = 13), high reliability (no failures to produce an EPSC in any of the 13 GRP UBCs in this data set), and low jitter (SD of latency; 0.074 ± 0.046 ms; mean ± SD; n = 13). These measurements have been added to the results section.

      In some rare cases, we did observe disynaptic currents, which were easily distinguishable because a single electrical stimulation produced a burst of EPSCs at variable latencies. Please see example below. These cases of disynaptic input, which have been reported by others (Diño et al., 2000; Nunzi and Mugnaini, 2000; van Dorp and De Zeeuw, 2015) support the conclusion that UBCs receive input from other UBCs.

      Author response image 2.

      Example of GRP UBC with disynaptic input. Three examples of the effect of a single presynaptic stimulus (triangle) in a GRP UBC with presumed disynaptic input. Note the variable latency of the first evoked EPSC, bursts of EPSCs, and spontaneous EPSCs.

      Author response: "UBCs receive a single mossy fiber input on their dendritic brush, and thus if our stimulation produces a reliable, short-latency response consistent with a monosynaptic input, then there is not likely to be a disynaptic input."

      This statement is not congruent with the literature, with early work by Mugnaini and colleagues (Mugnaini et al. 1994 Synapse; Mugnaini and Flores 1994 J. Comp. Neurol.) indicating that UBCs are innervated by 1-2 mossy fibers, which are as likely other UBC terminals as MFs. This leaves open the possibility that so called monosynaptic responses do, as originally suggested, already include polysynaptic feedforward amplification of duration. While the authors also indicate that isolated disynaptic currents can be observed when they occur in isolation, a careful examination and objective documentation of "monosynaptic" responses would address this issue. Presumably, if potential disynaptic UBC inputs occur during a monosynaptic MF response, it would be detected as an abrupt biphasic inward/outward current, due to additional AMPA receptor activation but further desensitization of those already active (as observed by Kinney et al. 1997 J. Neurophysiol: "The delivery of a second MF stimulus at the peak of the slow EPSC evoked a fast EPSC of reduced amplitude followed by an undershoot of the subsequent slow current"). If such polysynaptic inputs are truly absent and are "rare" in isolation, some estimation of how common or not such synaptic amplification is, would improve our understanding of the overall significance of these inputs.

      We are confident that these currents are monosynaptic, because, as suggested, we carefully analyzed the latency, jitter and reliability, which was added to the previous revision. The latency and jitter are strong (quantitative) evidence that the first EPSC evoked was monosynaptic. While some UBCs have been reported to have multiple brushes, or brushes that branch and may contact multiple mossy fibers, or receive synaptic input onto their somas, these cases are rare in our experience in this age of mouse and there is no evidence for them in this dataset. For every trace we made a careful examination and documented that no delayed EPSCs were present. The presence of delayed EPSCs (or ‘abrupt biphasic inward/outward currents’ as described in Kinney et al 1997) would indeed suggest the presence of disynaptic activity or multiple inputs to the UBC, but these would be easily identified, even during a stimulation train. For these reasons we feel that we have established that polysynaptic feedforward amplification of duration is not present

      We agree that the monosynaptic responses could be due to the stimulation of UBC axons. However, the absence of delayed EPSCs again suggests that if stimulation of a presynaptic UBC axon was producing the currents in the recorded UBC, then the axon was severed from the soma and AIS, because this region is necessary for the cell to produce more than a single spike per stimulation. We added a sentence describing the potential for the monosynaptic EPSCs to be due to the stimulation of presynaptic UBC axons.

      Your point is well taken that a discussion of how common or rare these UBC to UBC connections is necessary to more clearly explain how we interpret their significance and we have expanded the paragraph in the discussion that does so. Thank you for this suggestion.

      1. Lack of quantification of subtypes of UBC interconnectivity. Given that it is already established that UBCs synapse onto other UBCs (see refs above), the main potential advance of this manuscript in terms of connectivity is the establishment and quantification of ON-ON, ON-OFF, OFF-ON, and OFF-OFF subtypes of UBC interconnections. But, the authors only establish that each type exists, showing specific examples, but no quantification of the absolute or relative density was provided, and the authors' unquantified wording explicitly or implicitly states that they are not common. This lack of quantification and likely small number makes it difficult to know how important or what impact such synapses have on cerebellar processing, in the model and in situ.

      As noted by the reviewer, the connections between UBCs were rare to observe. We decided against attempting to quantify the absolute or relative density of connections for several reasons. A major reason for rare observations of anatomical connections between UBCs is likely due to the sparse labeling. First, the GRP mouse line only labels 20% of ON UBCs and we are unable to test whether postsynaptic connectivity of GRP ON UBCs is the same as that of the rest of the population of ON UBCs that are not labeled in the GRP mouse line. Second, the Brainbow reporter mouse only labels a small population of Cre expressing cells for unknown reasons. Third, the Brainbow reporter expression was so low that antibody amplification was necessary, which then limited the labeled cells to those close to the surface of the brain slices, because of known antibody penetration difficulties. Therefore, we refrained from estimating the density of these connections, because each of these variables reduced the labeling to unknown degrees and we reasoned that extrapolating our rare observations to the total population would be inaccurate.

      A paper that investigated UBC connectivity using organotypic slice cultures from P8 mice suggests that 2/3 of the UBC population receives UBC input, based on the observation that 2/3 of the mossy fibers did not degenerate as would be expected after 2 days in vitro if they were severed from a distant cell body (Nunzi and Mugnaini, 2000). It remains to be seen if this high proportion is due to the young age of these mice or is also the case in adult mice. Even if these connections are indeed rare, they are expected to have profound effects on the circuit, as each UBC has multiple mossy fiber terminals (Berthie and Axelrad, 1994), and mossy fiber terminals are estimated to contact 40 granule cells each (Jakab and Hamori, 1988). We have added a comment regarding this point to the discussion.

      To address this issue, the authors added the following text to the discussion section: "We did not estimate the density of these UBC to UBC connections, because the sparseness of labeling using these approaches made an accurate calculation impossible. Previous work using organotypic slice cultures from P8 mice estimated that 2/3 of the UBC population receives input from other UBCs (Nunzi & Mugnaini, 2000), although it is unclear whether this is the case in older mice."

      While accurate, the addition doesn't really address the situation, which is that apparently the reported connections are rare. Adding the information about 2/3 of UBCs having UBC inputs in culture, implies the opposite might be true (i.e. that they might be quite common), which is in contrast to the authors' data, so should be reworded for clarity, which should also incorporate the considerations covered in point #2 above. I.e. if the authors do establish that none of their recordings have polysynaptic inputs, and if they determine that the number of cells that showed isolated di-synaptic inputs is indeed rare, then it suggests that these specific polysynaptic connections are in fact rare.

      Thank you for pointing this out. We agree that adding this information is somewhat contradictory to our results and we have added more to this section in the discussion, provided below.

      Anatomically identifiable connections between UBCs were not present in all brain slices and finding them required a careful search. UBC labeling was sparse due to the highly specific genetic labeling techniques and further sparsification by the Brainbow reporter, which made it impossible to estimate the density of these UBC to UBC connections. Electrophysiological evidences suggest that UBC to UBC connections are not common, because spontaneous EPSCs that would indicate a spontaneously firing presynaptic UBC are only rarely observed in UBCs recorded in acute brain slices. In an analysis of feedforward excitation of granule layer neurons, only 4 out of 140 UBCs had this indirect evidence of a firing presynaptic UBC (van Dorp and De Zeeuw, 2015), which suggests that UBC to UBC connections may be rare. On the other hand, previous work using organotypic slice cultures from P8 mice estimated that 2/3 of the UBC population receives input from other UBCs (Nunzi & Mugnaini, 2000). This suggests a much higher density of UBC to UBC connections, but could be due to the young age of the brains used, which is before UBCs have matured (Morin et al., 2001), and also due to increased collateral sprouting that can occur in culture (Jaeger et al., 1988). Another study imaged 2-4 week old rat cerebellar slices at an electron microscopic level and found that 4 out of 14 UBC axon terminals contacted UBC brushes (Diño et al., 2000). Future work is necessary to accurately estimate the density and impact of these feedforward UBC circuits.

      1. Lack of critical parameters in NEURON model.

      A) The model uses # of molecules of glutamate released as the presumed quantal content, and this factor is constant.

      However, no consideration of changes in # of vesicles released from single versus trains of APs from MFs or UBCs is included. At most simple synapses, two sequential APs alters release probability, either up or down, and release probability changes dynamically with trains of APs. It is therefore reasonable to imagine UBC axon release probability is at least as complicated, and given the large surface area of contact between two UBCs, the number of vesicles released for any given AP is also likely more complex.

      B) the model does not include desensitization of AMPA receptors, which in the case of UBCs can paradoxically reduce response magnitude as vesicle release and consequent glutamate concentration in the cleft increases (Linney et al. 1997 JNeurophysiol, Lu et al. 2017 Neuron, Balmer et al. 2021 eLIFE), as would occur with trains of stimuli at MF to ON-UBCs.

      A) The model produces synaptic AMPA and mGluR2 currents that reproduce those we recorded in vitro. We did not find it necessary to implement changes in glutamate release during a train as the model was fit to UBC data with the assumption that the glutamate transient did not change during the train. If there is a change in neurotransmitter release during a train, it is therefore built into the model, which has the advantage of reducing its complexity. UBCs are a special case where the postsynaptic currents are mediated mostly by the total amount of transmitter released. Most of the evoked current occurs tens to hundreds of milliseconds after neurotransmitter release and is therefore much more sensitive to total release and less sensitive to how it is released during the train. The figure below shows the effect of reducing the amount of glutamate released by 10% on each stimulus in the model. Despite a significant change in the pattern of neurotransmitter release, as well as a reduction in the total amount of glutamate, the slow EPSC still decays over the course of hundreds of milliseconds.

      B) The detailed kinetic AMPA receptor model used here accurately reproduces desensitization, which in fact mediates that the slow ON UBC current. This AMPA receptor is a 13-state model, including 4 open states with 1-4 glutamates bound, 4 closed states with 1-4 glutamates bound, 4 desensitized states with 1-4 glutamates bound, and 5 closed states with 0-4 glutamates bound. The forward and reverse rates between different states in the model were fit to AMPA receptor currents recorded from dissociated UBCs and they accurately reproduced the ON UBC currents evoked by synaptic stimulation in our previous work (Balmer et al., 2021).

      Author response image 3.

      Effect of short-term depression of neurotransmitter release. A) The top trace shows the glutamate transient that drives the AMPA receptor model used in our study. No change in release is implemented, although the slow tail of the transient summates during the train. The bottom trace shows the modeled AMPA receptor mediated current. B) In this model the amount of glutamate released on each stimulus is reduced by 10%. The duration of the slow AMPA current is similar, despite a profound change in the pattern of neurotransmitter exposure.

      While the authors have not added the suggested additional parameters, their clarifications regarding the implications of existing parameters, and demonstration of reasonable fits to experimental data, and lack of substantial effect of simulating reduced vesicle release probability,

      1. Lack of quantification of various electrophysiological responses. UBCs are defined (ON or OFF) based on inward or outward synaptic response, but no information is provided about the range of the key parameter of duration across cells, which seems most critical to the current considerations. There is a similar lack of quantification across cells of AP duration in response to stimulation or current injections, or during baseline. The latter lack is particularly problematic because in agreement with previous publications, the raw data in Fig. 1 shows ON UBCs as quiescent until MF stimulation and OFF UBCs firing spontaneously until MF stimulation, but, for example, at least one ON UBC in the NEURON model is firing spontaneously until synaptically activated by an OFF UBC (Fig. 11A), and an OFF UBC is silent until stimulated by a presynaptic OFF UBC (Fig. 11C). This may be expected/explainable theoretically, but then such cells should be observed in the raw data.

      To address this reasonable concern of a general lack of quantification of electrophysiological responses we have added data characterizing the slow inward and outward currents evoked by synaptic stimulation in GRP and P079 UBCs in the results section and in new panels in Figure 1. We report the action potential pause lengths in P079 UBCs and burst lengths in ON UBCs in the results section. However, we favor the duration of the currents to the length of burst and pause, because the currents do not depend on a stable resting membrane potential, which is itself difficult to determine in intracellular recordings of these small cells. In a series of recent publications that focused on UBC firing, the authors argue that cell-attached recordings are necessary to determine accurately the burst and pause lengths, as well as spontaneous firing rates (Guo et al., 2021; Huson et al., 2023). (The trade-off of these extracellular recordings is that the monosynaptic nature of the input is nearly impossible to confirm.) Spontaneous firing rates were variable within both GRP and P079 UBCs from silent to firing regularly or in bursts, as previously reported (Kim et al., 2012; van Dorp and De Zeeuw, 2015). For clarity, we chose to model the GRP UBCs as silent unless receiving synaptic input and P079 UBCs as active unless receiving synaptic input. As the reviewer suggests, we have observed UBCs firing in the patterns similar to those shown in the model UBCs having input from spontaneous presynaptic UBCs. Below are some examples of spontaneous EPSCs and IPSCs in UBCs that suggest the presence of a presynaptic UBC.

      Author response image 4.

      Examples of UBCs that receive spontaneous input. A) Three ON UBCs that had spontaneous EPSCs, suggesting the presence of an active presynaptic UBC. B) Two OFF UBCs that had spontaneous outward currents.

      The authors have added additional analysis and discussion, which adequately addresses this concern.

      Reviewer #2 (Public Review):

      In this paper, the authors presented a compelling rationale for investigating the role of UBCs in prolonging and diversifying signals. Based on the two types of UBCs known as ON and OFF UBC subtypes, they have highlighted the existing gaps in understanding UBCs connectivity and the need to investigate whether UBCs target UBCs of the same subtype, different subtypes, or both. The importance of this knowledge is for understanding how sensory signals are extended and diversified in the granule cell layer.

      The authors designed very interesting approaches to study UBCs connectivity by utilizing transgenic mice expressing GFP and RFP in UBCs, Brainbow approach, immunohistochemical and electrophysiological analysis, and computational models to understand how the feed-forward circuits of interconnected UBCs transform their inputs.

      This study provided evidence for the existence of distinct ON and OFF UBC subtypes based on their electrophysiological properties, anatomical characteristics, and expression patterns of mGluR1 and calretinin in the cerebellum. The findings support the classification of GRP UBCs as ON UBCs and P079 UBCs as OFF UBCs and suggest the presence of synaptic connections between the ON and OFF UBC subtypes. In addition, they found that GRP and P079 UBCs form parallel and convergent pathways and have different membrane capacitance and excitability. Furthermore, they showed that UBCs of the same subtype provide input to one another and modify the input to granule cells, which could provide a circuit mechanism to diversify and extend the pattern of spiking produced by mossy fiber input. Accordingly, they suggested that these transformations could provide a circuit mechanism for maintaining a sensory representation of movement for seconds.

      Overall, the article is well written in a sound detailed format, very interesting with excellent discovery and suggested model.

      I believe the authors have provided appropriate responses and have consequently revised the manuscript in a convincing manner. Although I am not an expert in physiology, I find the explanations and clarifications to be acceptable.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Recommendations for the authors):

      Major comments

      (1) Line 201: The threshold of 0.25 was maintained to select enriched genes, which minimize the value of the GO term enrichment analyses. It may notably explain why the term phagosome is enriched in cluster 7, while experimental data indicate that cluster 7 is not phagocytic. In addition, the authors mentioned in the 1st response to reviewer that they would include DotPlot to illustrate the specificity of the genes corresponding to the main GO terms. This should notably include the ribosomal genes found enriched in cluster 4, which constitute the basis used by the authors to call cluster 4 the progenitor cluster.

      We appreciate the reviewer’s concern regarding our chosen log2FC threshold (0.25) for GO term enrichment. To assess the robustness of our approach, we tested more stringent thresholds (e.g., 0.5) and verified that our overall interpretations remain consistent. However, we acknowledge that certain GO terms, such as phagosome, may appear in clusters that are not primarily phagocytic. This is likely due to the fact that genes involved in vesicle trafficking, endo-lysosomal compartments and intracellular degradation processes overlap with those classically associated with phagocytosis.

      Therefore, the KEGG-based enrichment of phagosome in cluster 7 does not necessarily imply active phagocytosis but could instead reflect these alternative vesicular processes. As we show, cluster 7 correspond to vesicular cells, and as seen in cytology we named these cells after their very high content of vesicular structures. As functional annotation based solely on transcriptomic data can sometimes lead to overinterpretations, we emphasize the importance of biological validation, which we have partially addressed through functional assays in this study.

      Regarding the specificity of ribosomal gene expression in cluster 4, we analyzed the distribution of ribosomal genes expressed across all clusters, as shown in Supplementary Figure S1-J. This analysis demonstrates that cluster 4 is specifically enriched in ribosome-related genes, reinforcing its characterization as a transcriptionally active population. Given that ribosomal gene expression is a key feature often associated with proliferative or metabolically active cells, these findings support our initial interpretation that cluster 4 may represent an undifferentiated or progenitor-like population.

      We acknowledge the reviewer’s suggestion to include a DotPlot to further illustrate the specificity of these genes in cluster 4. However, we believe that Supplementary Figure S1-J already effectively demonstrates this enrichment by presenting the percentage of ribosomal genes per cluster. A DotPlot representation would primarily convey the same information in a different format, but without providing additional insight into the specificity of ribosomal gene expression within cluster 4.

      (2) The lineage analysis is highly speculative and based on weak evidences. Initiating the hemocyte lineage to C4 is based on rRNA expression levels. C6 would constitute a better candidate, notably with the expression of PU-1, ELF2 and GATA3 that regulate progenitors differentiation in mammals (doi: 10.3389/fimmu.2019.00228, doi:10.1128/microbiolspec.mchd-0024-2, doi: 10.1098/rsob.180152) while C4 do not display any specific transcription factors (Figure 7I). In addition, the representation and interpretation of the transcriptome dynamics in the different lineages are erroneous. There are major inconsistencies between the data shown in the heatmaps Fig7C-H, Fig S10 and the dotplot in Fig7I. For example, Gata3 (G31054) and CgTFEB (G30997) illustrate the inconsistency. Fig S10C show GATA3 going down from cluster 4 to cluster 6 while Fig 7I show an increase level of expression in 6 compared to 4. CgTFEB (G30997) decrease from C4 to VC in Fig 7F while it increases according to Fig 7I. At last, Figure 7D: the umap show transition from C4 to C5 while the heatmap mention C4 to C6 (I believe there is a mix up with Figure 7E.

      We sincerely apologize for the inconsistencies noted between the different panels of Figure 7. These discrepancies resulted from using an incorrect matrix dataset during the initial representation. To address this issue, we have fully reprocessed the data and now provide a corrected and improved depiction of gene expression dynamics along the pseudotime trajectory. We are grateful to the reviewer for having help us to correct theses mistakes.

      In the revised version, we offer a comprehensive and consistent representation of expression level variations for key genes identified by the Monocle3 algorithm. Supplementary Figure S10 now presents the average expression variation of these significant genes as a function of pseudotime. Based on this dataset, we carefully selected representative genes to construct panels C to H of Figure 7, ensuring coherence across all figures. These updated panels show both average expression levels and the percentage of expressing cells along the pseudotime trajectory, providing a clearer interpretation of transcriptomic dynamics.

      We appreciate the reviewer’s helpful feedback regarding our lineage analysis and the suggestion that cluster 6 might be a more appropriate progenitor based on the expression of mammalian-like transcription factors such as PU-1, ELF2, and GATA3. Below, we clarify our rationale for choosing cluster 4 as the root of the pseudotime and discuss the functional implications of the identified transcription factors.

      We can hypothesize that clusters 4, 5, or 6 could each potentially represent early progenitor-like states, as these three clusters are transcriptionally close (Lines 539-541). These clusters have not yet been conclusively identified in terms of classical hemocyte morphology, and they appear to arise from ABL- or BBL-type cells. Our decision to root the pseudotime at cluster 4 was motivated by its strong expression of core transcription and translation genes, suggesting a particular stage of translation activity that was not observed for cluster 5 or cluster 6. Cluster 5 and 6 may correspond to a similar population of cells, most probably Blast-Like cells at different stages of cell cycle or differentiation engagement.

      Although cluster 6 expresses PU-1, ELF2, and GATA3, which are known regulators of haematopoietic progenitor differentiation in vertebrates, it is essential to highlight that structural homology does not necessarily imply functional equivalence. Moreover, the expression of PU-1, ELF2, and GATA3 does not strictly characterize a population as “undifferentiated” or progenitor-like. Studies such as those by Buenrostro et al. (Cell, 2018) have demonstrated that these transcription factors can remain active in or reemerge during more lineage-committed stages. For instance, PU-1 is essential for myeloid and B-cell differentiation, GATA3 is involved in T-lymphocyte lineage commitment (though transiently expressed in early progenitors), and ELF2 participates in lineage-specific pathways. Thus, their presence does not imply a primitive state but rather highlights their broader functional roles in guiding and refining lineage decisions. Functional annotation of these transcription factors in invertebrate systems remains speculative, particularly as morphological or molecular markers specific to these early hemocyte lineages are not yet fully established. Further functional assays (e.g., knockdown/overexpression or lineage tracing using cells (ABL and BBL) from clusters 4, 5 and 6) will be necessary to determine which hemocyte population harbor progenitor properties and differentiation potential.

      To further address the reviewer’s concern, we performed complementary pseudotime analyses by initiating Monocle 3 trajectories from clusters 4, 5, and 6 individually, as well as collectively (4/5/6). These analyses (see attached figure) confirm that the overall differentiation topology remains unchanged regardless of the selected root, consistently revealing two main pathways: one leading to hyalinocytes and the other to the granular lineage (ML, SGC, and VC). This consistency strongly suggests that clusters 4, 5, and 6 represent related pools of progenitor-like cells. Therefore, choosing cluster 4 based on its transcription/translation readiness does not alter the inferred branching architecture of hemocyte differentiation.

      We appreciate the reviewer’s suggestions, which have helped us improve our manuscript and clarify our rationale.

      Author response image 1.

      Representation of the trajectories obtained from Monocle3 analysis using different pseudotime origins, showing that changing the rooting did not alter the overall differentiation topology. (A) Pathways identified with cluster 4, (B) cluster 5, (C) cluster 6, and (D) cluster 4/5/6 origins.

      (3) Concerning the AMP expression analysis in Figure 6: the qPCR data show that Cg-BPI and Cg-Defh are expressed broadly in all fractions including 6 and 7, which is in conflict with the statement Line 473 indicating that SGC (fractions 6 and 7) is not expressing AMP. In addition, this analysis should be combined with the expression profile of all AMP in the scRNAseq data (list available in 10.1016/j.fsi.2015.02.040).

      We thank the reviewer for highlighting this point. We acknowledge that the qPCR data show expression of Cg-BPI and Cg-Defh across all fractions, including fractions 6 and 7 corresponding to SGC. However, our conclusion that SGCs do not express antimicrobial peptides (AMPs) was based on a correlation analysis rather than direct detection of AMPs in granular cells. Specifically, the qPCR experiments were designed to measure AMP expression levels in fractionated hemocyte populations relative to a control sample of whole hemolymph. We then performed a correlation analysis between AMP expression levels and the proportion of each hemocyte type in the fractions. This approach allowed us to infer a lower expression of AMP in granular cells, as reflected in the heatmap presented in Figure 6.

      Regarding the suggestion to integrate AMP expression profiles from scRNA-seq data, we wrote that the limited sequencing depth of our scRNA-seq analysis was insufficient to accurately detect AMP expression (Ligne 472-473 → “However, due to the limited sequencing depth, the scRNA-seq analysis was not sensitive enough to reveal AMP expression.”.  Additionally, many of the known AMPs of Crassostrea gigas are not annotated in the genome, further complicating their identification within the scRNA-seq dataset. As a result, we were unable to perform the requested integration of AMP expression profiles from scRNA-seq data.

      (4) The transcription factor expression analysis is descriptive and the interpretation too partial. These data should be compared with other systems. Most transcription factors show functional conservation, notably in the inflammatory pathways, which can provide valuable information to understand the function of the clusters 5 and 6 for which limited data are available.

      We appreciate the reviewer’s suggestion to compare the identified transcription factors with other systems. However, since we did not perform a detailed phylogenetic analysis of the transcription factors identified in our dataset, we refrain from making assumptions about their functional conservation across species. Our analysis aims to provide a descriptive overview of transcription factor expression patterns in hemocyte clusters, which serves as a foundation for future functional studies. While transcription factor profiles may provide insights into the potential roles of clusters 5 and 6, assigning precise functions based solely on bioinformatic predictions remains speculative. Further experimental validation, including functional assays and evolutionary analyses, would be necessary to confirm the roles of these transcription factors, which is beyond the scope of the present study.

      Minor comments

      Line 212-213: the text should be reformulated. In the result part, it is more important to mention that the reannotation is based on conserved proteins functions than to mention the tool Orson.

      We have reworded this section to emphasize that the updated annotation is function-based, using Orson primarily as the bioinformatics tool for improved GO annotation. We now place the emphasis on the conserved protein functions underlying the reannotation. Lines 212-215 : “Using the Orson pipeline (see Materials and Methods), these files were used to extract and process the longest CDSs for GO-term annotation, and we then reannotated each predicted protein by sequence homology, assigning putative functions and improving downstream GO-term analyses.”

      Figure 2: I would recommend homogenizing the two Dotplot representation with the same color gradient and representing the gene numbers in both case.

      We appreciate the reviewer’s suggestion to improve the clarity and consistency of Figure 2. In response, we have homogenized the color gradients across the two DotPlot representations and have included gene numbers in both cases to ensure a more uniform and informative visualization.

      Table 2: pct1 and pct2 should be presented individually like in table 1

      We now present these columns separately (pct1, pct2) as in Table 1, so readers can compare the fraction of expressing cells in each cluster more transparently.

      Line 403-414: how many cells were quantified for the phagocytic experiments ?

      We have added the exact number of cells that were counted to determine phagocytic indices and the number of technical/biological replicates. Line 411, the text was modified : “Macrophage-like cells and small granule cells showed a phagocytic activity of 49 % and 55 %, respectively, and a phagocytosis index of 3.5 and 5.2 particles per cell respectively (Fig. 5B and Supp. Fig. 7B), as confirmed in 3 independent experiments examining a total of 2,807 cells.”

      Line 458: for copper staining, how many cells and how many replicates were done for the quantification ?

      We have specified the number of hemocytes and number of independent replicates used when quantifying rhodanine-stained (copper-accumulating) cells. Line 458 the following text was added : “and a total of 1,562 cells were examined across three independent experiments.”

      Line 461: what are the authors referring to when mentioning the link between copper homeostasis and scRNAseq?

      Single-cell RNA sequencing (scRNA-seq) analysis revealed an upregulation of several copper transport– related genes, including G4790 (a copper transporter) with a 2.7 log2FC and a pct ratio of 42, as well as the divalent cation transporters G5864 (zinc transporter ZIP10) and G4920 (zinc transporter 8), specifically in cluster 3 cells identified as small granule cells. These findings reinforce a potential role for this cluster in metal homeostasis.

      We modified lines 462-467 as : “ These results provide functional evidence that small granule cells (SGCs) are specialized in metal homeostasis in addition to phagocytosis, as suggested by the scRNA-seq data identifying cluster 3. Specifically, single-cell RNA sequencing revealed an upregulation of copper transport– related genes, including G4790 (a copper transporter) with a 2.7 log2FC and a pct ratio of 42, reinforcing the role of SGCs in copper homeostasis (see Supp. File S1).”

      Line 611: it would be nice to display the enrichment of the phagocytic receptor in cluster 3 (dotplot or feature plot) to illustrate the comment.

      We appreciate the reviewer’s insightful suggestion regarding a more comprehensive analysis of phagocytic receptors. While a full inventory is beyond the scope of this study, we acknowledge the value of such an approach and hope that our findings will serve as a foundation for future investigations in this direction.

      Although we have highlighted certain phagocytic receptors (e.g., a scavenger receptor domain-containing gene) in our scRNA-seq dataset, it is beyond the scope of the current study to inventory all phagocytosisrelated receptors in the C. gigas genome, which itself would be a substantial undertaking. Moreover, singlecell RNA sequencing captures only about 15–20% of each cell’s mRNA, so we inherently lose a significant portion of the transcriptome, further limiting our ability to pinpoint all relevant phagocytic receptor genes. Adding more figures to cover every candidate receptor would risk overloading this paper, thus we focus on the most prominent examples. A promising approach for more exhaustive analysis would involve efficiently isolating granulocytes (e.g., via Percoll gradient) and performing targeted RNA-seq on this cell population to thoroughly explore genes involved in phagocytosis.

      Line 640-644: the authors mentioned that ML may be able to perform ETosis based on the oxidative burst.

      This hypothesis requires further evidences. Are other markers of ETosis expressed in this cell type?

      We agree that additional experimental evidence (e.g., detection of histone citrullination, extracellular DNA networks) is necessary to confirm ETosis in molluscan immune cells. We present ML-mediated ETosis only as a speculative possibility based on oxidative burst capacity as it was shown in different pieces of work that ETosis is inhibited by NADPH inhibitors (Poirier et al. 2014). Nevertheless, the expression of histones in the macrophage-like cluster (cluster 1) reinforces this possibility, as histone modifications play a key role in chromatin decondensation during ETosis.

      Reviewer #2 (Recommendations for the authors):

      Figure 1: In Figure 1B, the cell clusters are named 1 to 7, whereas in Figure 1C they are displayed as clusters 0 to 6. There is a mismatch between the identification of the clusters.

      We thank the reviewer for identifying this inconsistency. The cluster numbering has been corrected to ensure consistency between Figures 1B and 1C.

      Figure 2B: the font size could be increased for greater clarity.

      We thank the reviewer for this suggestion. The font size in Figure 2B has been increased to improve clarity and readability.

      Line 221: "Figures 2B, C and D" appears to refer to Figure S2 rather than the main Figure 2.

      The text has been corrected to properly reference the figure.

      Line 754: "Anopheles gambiae" should be italicised

      We thank the reviewer for pointing this out. "Anopheles gambiae" has been italicized accordingly.

      Bibliography

      Integrated Single-Cell Analysis Maps the Continuous Regulatory Landscape of Human Hematopoietic

      Differentiation. Buenrostro, Jason D. et al. Cell, Volume 173, Issue 6, 1535 - 1548.e16

      Antimicrobial Histones and DNA Traps in Invertebrate Immunity

      Poirier, Aurore C. et al. Journal of Biological Chemistry, Volume 289, Issue 36, 24821 - 24831

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Shen et al. conducted three experiments to study the cortical tracking of the natural rhythms involved in biological motion (BM), and whether these involve audiovisual integration (AVI). They presented participants with visual (dot) motion and/or the sound of a walking person. They found that EEG activity tracks the step rhythm, as well as the gait (2-step cycle) rhythm. The gait rhythm specifically is tracked superadditively (power for A+V condition is higher than the sum of the A-only and V-only condition, Experiments 1a/b), which is independent of the specific step frequency (Experiment 1b). Furthermore, audiovisual integration during tracking of gait was specific to BM, as it was absent (that is, the audiovisual congruency effect) when the walking dot motion was vertically inverted (Experiment 2). Finally, the study shows that an individual's autistic traits are negatively correlated with the BM-AVI congruency effect.

      Strengths:

      The three experiments are well designed and the various conditions are well controlled. The rationale of the study is clear, and the manuscript is pleasant to read. The analysis choices are easy to follow, and mostly appropriate.

      Weaknesses:

      There is a concern of double-dipping in one of the tests (Experiment 2, Figure 3: interaction of Upright/Inverted X Congruent/Incongruent). I raised this concern on the original submission, and it has not been resolved properly. The follow-up statistical test (after channel selection using the interaction contrast permutation test) still is geared towards that same contrast, even though the latter is now being tested differently. (Perhaps not explicitly testing the interaction, but in essence still testing the same.) A very simple solution would be to remove the post-hoc statistical tests and simply acknowledge that you're comparing simple means, while the statistical assessment was already taken care of using the permutation test. (In other words: the data appear compelling because of the cluster test, but NOT because of the subsequent t-tests.)

      We are sorry that we did not explain this issue clearly before, which might have caused some misunderstanding. When performing the cluster-based permutation test, we only tested whether the audiovisual congruency effect (congruent vs. incongruent) between the upright and inverted conditions was significantly different [i.e., (UprCon – UprInc) vs. (InvCon – InvInc)], without conducting extra statistical analyses on whether the congruency effect was significant in each orientation condition. Such an analysis yielded a cluster with a significant interaction between audiovisual integration and BM orientation for the cortical tracking effect at 1Hz (but not at 2Hz). However, this does not provide valid information about whether the audiovisual congruency effect at this cluster is significant in each orientation condition, given that a significant interaction effect may result from various patterns of data across conditions: such as significant congruency effects in both orientation conditions (Author response image 1a), a significant congruency effect in the upright condition and a non-significant effect in the inverted condition (Author response image 1b), or even non-significant yet opposite effects in the two conditions (Author response image 1c). Here, our results conform to the second pattern, indicating that cortical tracking of the high-order gait cycles involves a domain-specific process exclusively engaged in the AVI of BM. In a similar vein, the non-significant interaction found at 2Hz does not necessarily indicate that the congruency effect is non-significant in each orientation condition (Author response image 1f&e). Indeed, the congruency effect was significant in both the upright and inverted conditions at 2Hz in our study despite the non-significant interaction, suggesting that neural tracking of the lower-order step cycles is associated with a domain-general AVI process mostly driven by temporal correspondence in physical stimuli.

      Therefore, we need to perform subsequent t-tests to examine the significance of the simple effects in the two orientation conditions, which do not duplicate the clusterbased permutation test (for interaction only) and cause no double-dipping. Results from interaction and simple effects, put together, provide solid evidence that the cortical tracking of higher-order and lower-order rhythms involves BM-specific and domaingeneral audiovisual processing, respectively.

      To avoid ambiguity, we have removed the sentence “We calculated the audiovisual congruency effect for the upright and the inverted conditions” (line 194, which referred to the calculation of the indices rather than any statistical tests) from the manuscript. We have also clarified the meanings of the findings based on the interaction and simple effects together at the two temporal scales, respectively (Lines 205-207; Lines 213-215).

      Author response image 1.

      Examples of different patterns of data yielding a significant or nonsignificant interaction effect.

      Reviewer #2 (Public review):

      Summary:

      The authors evaluate spectral changes in electroencephalography (EEG) data as a function of the congruency of audio and visual information associated with biological motion (BM) or non-biological motion. The results show supra-additive power gains in the neural response to gait dynamics, with trials in which audio and visual information was presented simultaneously producing higher average amplitude than the combined average power for auditory and visual conditions alone. Further analyses suggest that such supra-additivity is specific to BM and emerges from temporoparietal areas. The authors also find that the BM-specific supra-additivity is negatively correlated with autism traits.

      Strengths:

      The manuscript is well-written, with a concise and clear writing style. The visual presentation is largely clear. The study involves multiple experiments with different participant groups. Each experiment involves specific considered changes to the experimental paradigm that both replicate the previous experiment's finding yet extend it in a relevant manner.

      Weaknesses:

      In the revised version of the paper, the manuscript better relays the results and anticipates analyses, and this version adequately resolves some concerns I had about analysis details. Still, it is my view that the findings of the study are basic neural correlate results that do not provide insights into neural mechanisms or the causal relevance of neural effects towards behavior and cognition. The presence of an inversion effect suggests that the supra-additivity is related to cognition, but that leaves open whether any detected neural pattern is actually consequential for multi-sensory integration (i.e., correlation is not causation). In other words, the fact that frequency-specific neural responses to the [audio & visual] condition are stronger than those to [audio] and [visual] combined does not mean this has implications for behavioral performance. While the correlation to autism traits could suggest some relation to behavior and is interesting in its own right, this correlation is a highly indirect way of assessing behavioral relevance. It would be helpful to test the relevance of supra-additive cortical tracking on a behavioral task directly related to the processing of biological motion to justify the claim that inputs are being integrated in the service of behavior. Under either framework, cortical tracking or entrainment, the causal relevance of neural findings toward cognition is lacking.

      Overall, I believe this study finds neural correlates of biological motion, and it is possible that such neural correlates relate to behaviorally relevant neural mechanisms, but based on the current task and associated analyses this has not been shown.

      Thank you for providing these thoughtful comments regarding the theoretical implications of our neural findings. Previous behavioral evidence highlights the specificity of the audiovisual integration (AVI) of biological motion (BM) and reveals the impairment of such ability in individuals with autism spectrum disorder. However, the neural implementation underlying the AVI of BM, its specificity, and its association with autistic traits remain largely unknown. The current study aimed to address these issues.

      It is noteworthy that the operation of multisensory integration does not always depend on specific tasks, as our brains tend to integrate signals from different sensory modalities even when there is no explicit task. Hence, many studies have investigated multisensory integration at the neural level without examining its correlation with behavioral performance. For example, the widely known super-additivity mode for multisensory integration proposed by Perrault and colleagues was based on single-cell recording findings without behavioral tasks (Perrault et al., 2003, 2005). As we mentioned in the manuscript, the super-additive and sub-additive modes indicate non-linear interaction processing, either with potentiated neural activation to facilitate the perception or detection of near-threshold signals (super-additive) or a deactivation mechanism to minimize the processing of redundant information cross-modally (subadditive) (Laurienti et al., 2005; Metzger et al., 2020; Stanford et al., 2005; Wright et al., 2003). Meanwhile, the additive integration mode represents a linear combination between two modalities. Distinguishing among these integration modes helps elucidate the neural mechanism underlying AVI in specific contexts, even though sometimes, the neural-level AVI effects do not directly correspond to a significant behavioral-level AVI effect (Ahmed et al., 2023; Metzger et al., 2020). In the current study, we unveiled the dissociation of multisensory integration modes between neural responses at two temporal scales (Exps. 1a & 1b), which may involve the cooperation of a domain-specific and a domain-general AVI processes (Exp. 2). While these findings were not expected to be captured by a single behavioral index, they revealed the multifaceted mechanism whereby hierarchical cortical activity supports audiovisual BM integration. They also advance our understanding of the emerging view that multi-timescale neural dynamics coordinate multisensory integration (Senkowski & Engel, 2024), especially from the perspective of natural stimuli processing.

      Meanwhile, our finding that the cortical tracking of higher-order rhythmic structure in audiovisual BM specifically correlated with individual autistic traits extends previous behavioral evidence that ASD children exhibited reduced orienting to audiovisual synchrony in BM (Falck-Ytter et al., 2018), offering new evidence that individual differences in audiovisual BM processing are present at the neural level and associated with autistic traits. This finding opens the possibility of utilizing the cortical tracking of BM as a potential neural maker to assist the diagnosis of autism spectrum disorder (see more details in our Discussion Lines 334-346).

      However, despite the main objective of the current study focusing on the neural processing of BM, we agree with the reviewer that it would be helpful to test the relevance of supra-additive cortical tracking on a behavioral task directly related to BM perception, for further justifying that inputs are being integrated in the service of behavior. In the current study, we adopted a color-change detection task entirely unrelated to audiovisual correspondence but only for maintaining participants’ attention. The advantage of this design is that it allows us to investigate whether and how the human brain integrates audiovisual BM information under task-irrelevant settings, as people in daily life can integrate such information even without a relevant task. However, this advantage is accompanied by a limitation: the task does not facilitate the direct examination of the correlation between neural responses and behavioral performance, since the task performance was generally high (mean accuracy >98% in all experiments). Future research could investigate this issue by introducing behavioral tasks more relevant to BM perception (e.g., Shen et al., 2023). They could also apply advanced neuromodulation techniques to elucidate the causal relevance of the cortical tracking effect to behavior (e.g., Ko sem et al., 2018, 2020).

      We have discussed the abovementioned points as a separate paragraph in the revised manuscript (Lines 322-333). In addition, since the scope of the current study does not involve a causal correlation with behavioral performance, we have removed or modified the descriptions related to "functional relevance" in the manuscript (Abstract; Introduction, lines 101-103; Results, lines 239; Discussion, line 336; Supplementary Information, line 794、803). Moreover, we have strengthened the descriptions of the theoretical implications of the current findings in the abstract.

      We hope these changes adequately address your concern.

      References

      Ahmed, F., Nidiffer, A. R., O’Sullivan, A. E., Zuk, N. J., & Lalor, E. C. (2023). The integration of continuous audio and visual speech in a cocktail-party environment depends on attention. Neuroimage, 274, 120143. https://doi.org/10.1016/j.neuroimage.2023.120143

      Falck-Ytter, T., Nystro m, P., Gredeba ck, G., Gliga, T., Bo lte, S., & the EASE team. (2018). Reduced orienting to audiovisual synchrony in infancy predicts autism diagnosis at 3 years of age. Journal of Child Psychology and Psychiatry, 59(8), 872–880. https://doi.org/10.1111/jcpp.12863

      Ko sem, A., Bosker, H., Jensen, O., Hagoort, P., & Riecke, L. (2020). Biasing the Perception of Spoken Words with Transcranial Alternating Current Stimulation. Journal of Cognitive Neuroscience, 32, 1–10. https://doi.org/10.1162/jocn_a_01579

      Ko sem, A., Bosker, H. R., Takashima, A., Meyer, A., Jensen, O., & Hagoort, P. (2018). Neural Entrainment Determines the Words We Hear. Current Biology, 28(18), 2867-2875.e3. https://doi.org/10.1016/j.cub.2018.07.023

      Laurienti, P. J., Perrault, T. J., Stanford, T. R., Wallace, M. T., & Stein, B. E. (2005). On the use of superadditivity as a metric for characterizing multisensory integration in functional neuroimaging studies. Experimental Brain Research, 166(3), 289–297. https://doi.org/10.1007/s00221-005-2370-2

      Metzger, B. A., Magnotti, J. F., Wang, Z., Nesbitt, E., Karas, P. J., Yoshor, D., & Beauchamp, M. S. (2020). Responses to Visual Speech in Human Posterior Superior Temporal Gyrus Examined with iEEG Deconvolution. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience, 40(36), 6938–6948. https://doi.org/10.1523/JNEUROSCI.0279-20.2020

      Perrault, T. J., Vaughan, J. W., Stein, B. E., & Wallace, M. T. (2003). Neuron-Specific Response Characteristics Predict the Magnitude of Multisensory Integration. Journal of Neurophysiology, 90(6), 4022–4026. https://doi.org/10.1152/jn.00494.2003

      Perrault, T. J., Vaughan, J. W., Stein, B. E., & Wallace, M. T. (2005). Superior Colliculus Neurons Use Distinct Operational Modes in the Integration of Multisensory Stimuli. Journal of Neurophysiology, 93(5), 2575–2586. https://doi.org/10.1152/jn.00926.2004

      Senkowski, D., & Engel, A. K. (2024). Multi-timescale neural dynamics for multisensory integration. Nature Reviews Neuroscience, 25(9), 625–642. https://doi.org/10.1038/s41583-024-00845-7

      Shen, L., Lu, X., Wang, Y., & Jiang, Y. (2023). Audiovisual correspondence facilitates the visual search for biological motion. Psychonomic Bulletin & Review, 30(6), 2272–2281. https://doi.org/10.3758/s13423-023-02308-z

      Stanford, T. R., Quessy, S., & Stein, B. E. (2005). Evaluating the Operations Underlying Multisensory Integration in the Cat Superior Colliculus. Journal of Neuroscience, 25(28), 6499–6508. https://doi.org/10.1523/JNEUROSCI.5095-04.2005

      Wright, T. M., Pelphrey, K. A., Allison, T., McKeown, M. J., & McCarthy, G. (2003). Polysensory Interactions along Lateral Temporal Regions Evoked by Audiovisual Speech. Cerebral Cortex, 13(10), 1034–1043. https://doi.org/10.1093/cercor/13.10.1034

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #1 (Public review):

      Summary:

      In this work, Qiu and colleagues examined the effects of preovulatory (i.e., proestrous or late follicular phase) levels of circulating estradiol on multiple calcium and potassium channel conductances in arcuate nucleus kisspeptin neurons. Although these cells are strongly linked to a role as the "GnRH pulse generator," the goal here was to examine the physiological properties of these cells in a hormonal milieu mimicking late proestrus, the time of the preovulatory GnRH-LH surge. Computational modeling is used to manipulate multiple conductances simultaneously and support a role for certain calcium channels in facilitating a switch in firing mode from tonic to bursting. CRISPR knockdown of the TRPC5 channel reduced overall excitability, but this was only examined in cells from ovariectomized mice without estradiol treatment.

      Comments to address most recent author response:

      The concern regarding the CRISPR experiments being confined to OVX mice is that the results can only suggest that CRISPR-mediated knockdown of TRPC5 can, at best, phenocopy the OVX+E condition. A reciprocal experiment in the opposite direction (for example, that returning TRPC5 to OVX levels in OVX+E mice prevents the changes in firing activity and pattern typical of the OVX+E2 condition) would strengthen the indication that E2-sensitive changes in TRPC5 expression and function are critically important to surge function. Acknowledging this as a limitation of the studies would help to better contextualize the value of the CRISPR experiments to an understanding of surge mechanisms when done only in OVX conditions.

      We have noted in the manuscript that “It would be of interest in future experiments to do the reciprocal experiment to see if overexpressing Trpc5 channels in Kiss1ARH neurons from OVX + E2 females restores the RMP and  “rescues” the synchronization phenotype.”

      The nature of the confusion regarding the consideration of OVX+E2 conditions in the computational model primarily arises from the methods description in the supplemental file: "The effect of E2 on ionic currents is modelled as a change in the maximum conductance parameter. For currents IM,IT, ICa and ITRPC5 this change is inferred from the qPCR data assuming that the conductance is directly proportional to the mRNA expression." If these were instead based on the whole-cell recordings as the authors now indicate in their response, then this description needs to be edited and clarified accordingly. Furthermore, the section states, "For ISK, IBK, Ileak, the OVX and OVX+E2 conductances are obtained from current-voltage relationships recorded from Kiss1ARH neurons in the absence/presence of iberiotoxin (BK blocker) and apamin (SK blocker). All other currents were assumed to be unaffected by E2." This section thus does not directly indicate that the recordings in the stated figures were used in the model, and moreover suggests that currents besides ISK, IBK, and Ileak were not different in OVX+E2 conditions.

      The prior evidence stated for correlation of mRNA and channel conductance is not explicitly cited in the manuscript. It is well known that post-translational modifications, physiological modulation of individual channel biophysical properties, and many other factors can influence the end output of a membrane conductance. Therefore, the authors should, at minimum, provide a literature citation supporting the assumption used here.

      We have re-written the paragraph on “Modelling the effects of E2” in the Supplemental Information (now Appendix 1)  to clarify the that the modeling was based on a combination of electrophysiological recordings and the qPCR data presented in this and previous publications. The statement that “all other currents were assumed to be unaffected by E2” was a misstatement and has been deleted. As per the reviewer’s request, we have listed seven publications that document the correlation between the mRNA expression and channel conductance for the various channels. We thank the reviewer for the suggestion.

      Reviewer #2 (Public review):

      Summary:

      Kisspeptin neurons of the arcuate nucleus (ARC) are thought to be responsible for the pulsatile GnRH secretory pattern and to mediate feedback regulation of GnRH secretion by estradiol (E2). Evidence in the literature, including the work of the authors, indicates that ARC kisspeptin coordinate their activity through reciprocal synaptic interactions and the release of glutamate and of neuropeptide neurokinin B (NKB), which they co-express. The authors show here that E2 regulates the expression of genes encoding different voltage-dependent calcium channels, calcium-dependent potassium channels and canonical transient receptor potential (TRPC5) channels and of the corresponding ionic currents in ARC kisspeptin neurons. Using computer simulations of the electrical activity of ARC kisspeptin neurons, the authors also provide evidence of what these changes translate into in terms of these cells' firing patterns. The experiments reveal that E2 upregulates various voltage-gated calcium currents as well as 2 subtypes of calcium-dependent potassium currents, while decreasing TRPC5 expression (an ion channel downstream of NKB receptor activation), the slow excitatory synaptic potentials (slow EPSP) elicited in ARC kisspeptin neurons by NKB release and expression of the G protein-associated inward-rectifying potassium channel (GIRK). Based on these results, and on those of computer simulations, the authors propose that E2 promotes a functional transition of ARC kisspeptin neurons from neuropeptide-mediated sustained firing that supports coordinated activity for pulsatile GnRH secretion to a less intense burst-like firing pattern that could favor glutamate release from ARC kisspeptin. The authors suggest that the latter might be important for the generation of the preovulatory surge in females.

      Strengths:

      The authors combined multiple approaches in vitro and in silico to gain insights into the impact of E2 on the electrical activity of ARC kisspeptin neurons. These include patch-clamp electrophysiology combined with selective optogenetic stimulation of ARC kisspeptin neurons, reverse transcriptase quantitative PCR, pharmacology and CRISPR-Cas9-mediated knockdown of the Trpc5 gene. The addition of computer simulations for understanding the impact of E2 on the electrical activity of ARC kisspeptin cells is also a strength.

      The authors add interesting information on the complement of ionic currents in ARC kisspeptin neurons and on their regulation by E2 to what was already known in the literature. Pharmacological and electrophysiological experiments appear of the highest standards and robust statistical analyses are provided throughout. The impact of E2 replacement on calcium and potassium currents is compelling. Likewise, the results of Trpc5 gene knockdown do provide good evidence that the TRPC5 channel plays a key role in mediating the NKB-mediated slow EPSP. Surprisingly, this also revealed an unsuspected role for this channel in regulating the membrane potential and excitability of ARC kisspeptin neurons.

      Weaknesses:

      The manuscript also has weaknesses that obscure some of the conclusions drawn by the authors.

      One is that the authors compare here two conditions, OVX versus OVX replaced with high E2, that may not reflect the physiological conditions under which the proposed transition between neuropeptide-dependent sustained firing and less intense burst firing might take place (i.e. the diestrous [low E2] and proestrous [high E2] stages of the estrous cycle). This is an important caveat to keep in mind when interpreting the authors' findings. Indeed, that E2 alters certain ionic currents when added back to OVX females, does not mean that the magnitude of all of these ionic currents will vary during the estrous cycle.

      We do know that the slow EPSP, which is generated by TRPC5 channels, tracks beautifully with the steroid state of female mice.  Using our E2 treatment paradigm that generates a LH surge in OVX females (left panel in Author response image 1), there is no difference in the amplitude of the slow EPSP in proestrous versus OVX + E2 females (right panel in Author response image 1).    

      Author response image 1.

      In addition, although the computational modeling indicates a role of the various E2-modulated conductances in causing a transition in ARC kisspeptin neuron firing pattern, their role is not directly tested in physiological recordings, weakening the link between these changes and the shift in firing patterns.

      In future experiments we will test directly the physiological contribution of the other E2-modulated conductances in causing the transition in the firing pattern of arcuate Kiss1 neurons using CRISPR/SaCas9 technology as we have documented for the TRPC5 channel (e.g., Figures 11 and 12).

      Overall, the manuscript provides interesting information about the effects of E2 on specific ionic currents in ARC kisspeptin neurons and some insights into the functional impact of these changes. However, some of the conclusions of the work, with regard, in particular, to the role of these changes in ion channels and to their implications for the LH surge, are not fully supported by the findings.

      ---------

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this work, Qiu and colleagues examined the effects of preovulatory (i.e., proestrous or late follicular phase) levels of circulating estradiol on multiple calcium and potassium channel conductances in arcuate nucleus kisspeptin neurons. Although these cells are strongly linked to a role as the "GnRH pulse generator," the goal here was to examine the physiological properties of these cells in a hormonal milieu mimicking late proestrus, the time of the preovulatory GnRH-LH surge. Computational modeling is used to manipulate multiple conductances simultaneously and support a role for certain calcium channels in facilitating a switch in firing mode from tonic to bursting. CRISPR knockdown of the TRPC5 channel reduced overall excitability, but this was only examined in cells from ovariectomized mice without estradiol treatment. The manuscript has been substantially improved from the initial version by the addition of new experiments and clarification of important figures. Importantly, the overlap of data with previous reports from the same group has been corrected.

      Strengths:

      (1) Examination of multiple types of calcium and potassium currents, both through electrophysiology and molecular biology.

      (2) Focus on arcuate kisspeptin neurons during the surge is relatively conceptually novel as the anteroventral periventricular nucleus (AVPV) kisspeptin neurons have received much more attention as the "surge generator" population.

      (3) The modeling studies allow for direct examination of manipulation of single and multiple conductances, whereas the electrophysiology studies necessarily require examination of each current in isolation. Construction of an arcuate kisspeptin neuron model promises to be of value to the reproductive neuroendocrinology field.

      Weaknesses:

      A remaining weakness in this revised version of the manuscript is that the relevance of the CRISPR experiments is still rather tenuous given that the goal is to understand what happens in the estrogen-treatment condition, and these experiments were performed only in OVX mice. Similar concerns reflect that the computational model examining the effect of E2 infers multiple conductances based on qPCR data and an assumption that the conductances are directionally proportional to the level of gene expression, and then tunes these to the current recordings obtained from OVX mice, without a direct confirmation in OVX+E2 conditions that the model parameters accurately reflect the properties of these currents in the presence of estrogen.

      We are still puzzled by Reviewer’s concerns about doing the CRISPRing of Trpc5 in the OVX+E2 females.  The Trpc5 channel expression is significantly reduced with the E2 treatment (Figure 10E) which we know translates into a minimal slow EPSP (Figure 2, Qiu eLife 2016) and is essentially equivalent to the slow EPSP amplitude in the Trpc5 mutagenesis in the ovariectomized females (Figure 12).  TRPC5 channel conductance is already at “rock bottom.”  The modeling informs us that such a low TRPC5 conductance will not support a long lasting slow EPSP and sustained firing (Figure 13A).

      Also, we respectively point out that we have published a score of papers over the past 20 years showing that the channel conductance does correlate with the mRNA expression (e.g., Qiu et al., eLife 2018).  Secondly, the model does take into consideration the OVX + E2 conditions (Figure 13B,C) which is based on the extensive whole-cell recordings presented in Figures 4,5,6,7,8 and 9.

      Reviewer #2 (Public Review):

      Summary:

      Kisspeptin neurons of the arcuate nucleus (ARC) are thought to be responsible for the pulsatile GnRH secretory pattern and to mediate feedback regulation of GnRH secretion by estradiol (E2). Evidence in the literature, including the work of the authors, indicates that ARC kisspeptin coordinate their activity through reciprocal synaptic interactions and the release of glutamate and of neuropeptide neurokinin B (NKB), which they co-express. The authors show here that E2 regulates the expression of genes encoding different voltage-dependent calcium channels, calcium-dependent potassium channels and canonical transient receptor potential (TRPC5) channels and of the corresponding ionic currents in ARC kisspeptin neurons. Using computer simulations of the electrical activity of ARC kisspeptin neurons, the authors also provide evidence of what these changes translate into in terms of these cells' firing patterns. The experiments reveal that E2 upregulates various voltage-gated calcium currents as well as 2 subtypes of calcium-dependent potassium currents while decreasing TRPC5 expression (an ion channel downstream of NKB receptor activation), the slow excitatory synaptic potentials (slow EPSP) elicited in ARC kisspeptin neurons by NKB release and expression of the G protein-associated inward-rectifying potassium channel (GIRK). Based on these results, and on those of computer simulations, the authors propose that E2 promotes a functional transition of ARC kisspeptin neurons from neuropeptide-mediated sustained firing that supports coordinated activity for pulsatile GnRH secretion to a less intense burst-like firing pattern that could favor glutamate release from ARC kisspeptin. The authors suggest that the latter might be important for the generation of the preovulatory surge in females.

      Strengths:

      The authors combined multiple approaches in vitro and in silico to gain insights into the impact of E2 on the electrical activity of ARC kisspeptin neurons. These include patch-clamp electrophysiology combined with selective optogenetic stimulation of ARC kisspeptin neurons, reverse transcriptase quantitative PCR, pharmacology and CRISPR-Cas9-mediated knockdown of the Trpc5 gene. The addition of computer simulations for understanding the impact of E2 on the electrical activity of ARC kisspeptin cells is also a strength.

      The authors add interesting information on the complement of ionic currents in ARC kisspeptin neurons and on their regulation by E2 to what was already known in the literature. Pharmacological and electrophysiological experiments appear of the highest standards and robust statistical analyses are provided throughout. The impact of E2 replacement on calcium and potassium currents is compelling. Likewise, the results of Trpc5 gene knockdown do provide good evidence that the TRPC5 channel plays a key role in mediating the NKB-mediated slow EPSP. Surprisingly, this also revealed an unsuspected role for this channel in regulating the membrane potential and excitability of ARC kisspeptin neurons.

      Weaknesses:

      The manuscript also has weaknesses that obscure some of the conclusions drawn by the authors.

      One is that the authors compare here two conditions, OVX versus OVX replaced with high E2, that may not reflect the physiological conditions under which the proposed transition between neuropeptide-dependent sustained firing and less intense burst firing might take place (i.e. the diestrous [low E2] and proestrous [high E2] stages of the estrous cycle). This is an important caveat to keep in mind when interpreting the authors' findings. Indeed, that E2 alters certain ionic currents when added back to OVX females, does not mean that the magnitude of all of these ionic currents will vary during the estrous cycle.

      Unfortunately, mice are a poor reproductive model since female mice do not have a clear follicular (estradiol-driven) phase distinctive from the luteal (progesterone-driven) phase.  Had we utilized a “proestrous” female, we could not with certainty distinguish between the effects of estradiol versus progesterone on the expression of the calcium and potassium channels that were the focus of this study.  Therefore, using our physiological model we can state with confidence that “estradiol elicits distinct firing patterns in arcuate nucleus kisspeptin neurons….”

      Overall, the manuscript provides interesting information about the effects of E2 on specific ionic currents in ARC kisspeptin neurons and some insights into the functional impact of these changes. However, some of the conclusions of the work, with regard, in particular, to the role of these changes in ion channels and their implications for the LH surge, are not fully supported by the findings.

      As we pointed out in the Discussion, the O’Byrne lab has clearly shown the relevance of Kiss1ARH neuronal burst firing and the release of glutamate to its effects on the LH surge:

      “Rather, we postulate that glutamate neurotransmission is more important for excitation of Kiss1AVPV/PeN neurons and facilitating the GnRH (LH) surge with high circulating levels of E2 when peptide neurotransmitters are at a nadir and glutamate levels are high in female Kiss1ARH neurons. Indeed, low frequency (5 Hz) optogenetic stimulation of Kiss1ARH neurons, which only releases glutamate in E2-treated, ovariectomized females (Qiu J. et al., 2016), generates a surge-like increase in LH release during periods of optical stimulation (Lin et al., 2021; Voliotis et al., 2021).  In a subsequent study optical stimulation of Kiss1ARH neuron terminals in the AVPV at 20 Hz, a frequency commonly used for terminal stimulation in vivo, generated a similar surge of LH (Shen et al., 2022).  Additionally, intra-AVPV infusion of glutamate antagonists, AP5+CNQX, completely blocked the LH surge induced by Kiss1ARH terminal photostimulation in the AVPV (Shen et al., 2022).”

      Recommendations for the authors:

      Reviewer #2 (Recommendations for The Authors):

      The reviewer noted the following in the revised manuscript:

      - page 6, the authors may consider adding that presynaptic effects of blocking calcium channels on the slow EPSP cannot be fully ruled out. Indeed, the added experiments do indicate that some of the effects can be explained by impaired regulation of TRPC5 channels by calcium influx through calcium channels; however, the senktide-induced current is not fully blocked by the broad-spectrum calcium channel inhibitor cadmium, suggesting that the effect of blocking these channels on the slow EPSP may involve other mechanisms, such as presynaptic effects.

      Optogenetic stimulation of all Kiss1ARH neurons induces the release of NKB at “physiological” concentrations, which in turn generates a slow EPSP in the recorded Kiss1ARH neuron. Blocking voltage-gated calcium channels can inhibit the NKB release from presynaptic  Kiss1ARH neurons, thereby reducing the amplitude of the slow EPSP. However, in whole-cell recordings of synaptically isolated Kiss1ARH neurons,  senktide directly induces a large inward current (Figure 3F), which is generated by the opening of TRPC5 channels (Qiu et al. J. Neurosci 2021). Voltage-gated calcium channels are coupled to the activation of TRPC5 channels (Blair, Kaczmarek and Clapham, J. Gen Physiol 2009), so by blocking voltage-gated calcium channels, cadmium effectively abrogates the facilitating effects of these channels on TRPC5 channel activation and significantly reduces but does not abolish the inward (excitatory) current (Figures 3F-H). We have clarified in the Results (page 6) that the Kiss1ARH neurons were synaptically isolated as depicted in Figures 3F,G.

      - page 8, bottom, the mean value given for the apamin-sensitive current amplitude in E2 treated females does not match that plotted on the I/V graph in Figure 7F.

      Thank you for pointing out this typographical error, which we have corrected.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      This is an interesting and somewhat unusual paper supporting the idea that creatine is a neurotransmitter in the central nervous system of vertebrates. The idea is not entirely new, and the authors carefully weigh the evidence, both past and newly acquired, to make their case. The strength of the paper lies in the importance of the potential discovery - as the authors point out, creatine ticks more boxes on criteria of neurotransmitters than some of the ones listed in textbooks - and the list of known transmitters (currently 16) certainly is textbook material. A further strength of the manuscript is the careful consideration of a list of criteria for transmitters and newly acquired evidence for four of these criteria: 1. evidence that creatine is stored in synaptic vesicles, 2. mutants for creatine synthesis and a vesicular transporter show reduced storage and release of creatine, 3. functional measurement that creatine release has an excitatory or inhibitory (here inhibitory) effect in vivo, and 4. ATP-dependence. The key weakness of the paper is that there is no single clear 'smoking gun', like a postsynaptic creatine receptor, that would really demonstrate the function as a transmitter. Instead, the evidence is of a cumulative nature, and not all bits of evidence are equally strong. On balance, I found the path to discovery and the evidence assembled in this manuscript to establish a clear possibility, positive evidence, and to provide a foundation for further work in this direction.

      it is notable that, historically, no neurotransmitter has ever been established in a single paper. While creatine will not be an exception, data presented in this paper are more than any previous paper in demonstrating the possibility of a new neurotransmitter. However, we added an entire paragraph in the Discussion part about differences between Cr and classic neurotransmitters such as Glu, beginning with the absence of a molecularly defined receptor at this point and the Ca2+ independent component of Cr release induced by extracellular K+.

      We appreciate the reviewer for noting that evidence obtained by us now support that creatine satisfies all 4 criteria of transmitters.

      We respectively disagree the point about a smoking gun: any of these four is a smoking gun, while the satisfication of all 4 is quite strong, more than a smoking gun.

      We find it disagreeable that a receptor “would really demonstrate the function of a transmitter”. Textbook criteria for a transmitter usually require postsynaptic responses, not a molecularly defined receptor. A molecularly defined receptor for many of the known transmitters required many years of work, while they were accepted as transmitters before their receptors were finally molecularly defined. As long as there is a postsynaptic response, there is of course a receptor, though its molecular properties should be further studied. For examples, responses to choline were discovered in 1900 (Hunt, Am J Physiol 3, xviii-xix, 1900), those to acetylcholine in 1906 (Hunt and Taveau, Br Med J 2:1788-1789, 1906), those to supradrenal glands before 1894 (Oliver and Schäfer, J Physiol 18:230-276 1895). Henry Dale was awarded a Nobel prize in 1936 partly for his work on acetylcholine. Receptors for acetylcholine and noradrenaline were not molecularly defined until the 1970s and 1980s. Before then, they were only known by mediating responses to natural transmitters and synthesized chemicals.

      There were two previous reports that creatine could be taken into brain slices (Almeida et al., 2006) or synaptosomes (Peral, Vázquez-Carretero and Ilundain, 2010). These were used by the reviewer to argue that the idea of creatine as a neurotransmitter “is not entirely new”. However, no one has followed up these studies for 10 years, thus they would not be considered as good smoking guns. While we have reproduced the synaptosome uptake result (together with our new finding that this uptake was dependent on SLC6A8), it should be noted that uptake of molecules into synaptosomes is not absolutely required for a neurotransmitter because degradation of a transmitter is equally valid. Furthermore, molecules required synaptically but not as a transmitter can also be transported into the synaptic terminal.

      Our detection of Cr in the synaptic vesicles provides much stronger evidence supporting its importance. If a smoking gun is important, the detection of creatine in the SVs is the best smoking gun, whose discovery in fact was the reason leading us to study its release, postsynaptic responses as well as repeating the uptake experiment with genetic mutants.

      Reviewer #2 (Public Review):

      Summary:

      Bian et al studied creatine (Cr) in the context of central nervous system (CNS) function. They detected Cr in synaptic vesicles purified from mouse brains with anti-Synaptophysin using capillary electrophoresis-mass spectrometry. Cr levels in the synaptic vesicle fraction were reduced in mice lacking the Cr synthetase AGAT, or the Cr transporter SLC6A8. They provide evidence for Cr release within several minutes after treating brain slices with KCl. This KCl-induced Cr release was partially calcium-dependent and was attenuated in slices obtained from AGAT and SLC6A8 mutant mice. Cr application also decreased the excitability of cortical pyramidal cells in one third of the cells tested. Finally, they provide evidence for SLC6A8-dependent Cr uptake into synaptosomes, and ATP-dependent Cr loading into synaptic vesicles. Based on these data, the authors propose that Cr may act as a neurotransmitter in the CNS.

      Strengths:

      1) A major strength of the paper is the broad spectrum of tools used to investigate Cr.

      2) The study provides strong evidence that Cr is present in/loaded into synaptic vesicles.

      Weaknesses:

      (in sequential order)

      1) Are Cr levels indeed reduced in Agat-/-? The decrease in Cr IgG in Agat-/- (and Agat+/-) is similar to the corresponding decrease in Syp (Fig. 3B). What is the explanation for this? Is the decrease in Cr in Agat-/- significant when considering the drop in IgG? The data should be normalized to the respective IgG control.

      We measured the Cr concentration in the whole brain lysates using Creatine Assay Kit (Sigma, MAK079). Cr levels in the brain were reduced in Agat-/- mice. The Cr concentration in AGAT-/- mice was reduced to about 1/10 of AGAT+/+ and AGAT+/- mice (Author response image 1).

      Author response image 1.

      Cr concentration in brain from AGAT+/+, AGAT+/- and AGAT-/- mice (n=5 male mice for each group). , p<0.05, **, p<0.001, one-way ANOVA with Tukey’s correction.

      As pointed by the reviewer, the decrease in Cr IgG in Agat-/- seems similar to the corresponding decrease in Syp (Fig. 3B in the paper). Cr pulled down by IgG was 0.46 ± 0.04, 0.37 ± 0.06 and 0.17 ±0.03 pmol/μg anti-syp antibody for Agat+/+, Agat+/-, and Agat-/- mice respectively. There was a trend of reduction Cr IgG in Agat-/-, however, there were no statistically significant differences between Agat-/- and Agat+/+, or between Agat-/- and Agat+/-, as determined by one-way ANOVA (Fig. 3B in the paper). Due to the fact that Agat-/- reduced Cr concentration in the brain, we speculate that the apparent drop in Cr pulled down by IgG may have partially resulted from the overall reduction of Cr content in the brain.

      The absolute content of Cr pulled down by Syp in Agat-/- mice was reduced to 21.6% of Agat+/+ mice and 23.6% of Agat+/- mice (Fig. 3B in the paper). As suggested by the reviewer, we normalized the Cr pulled down by Syp to the respective IgG control (Author response image 2). The normalized Cr content in AGAT-/- mice has a tendency to decrease, but not statistically significant, as compared to Agat+/+ and Agat+/- mice (n=10 for each group, one-way ANOVA).

      Author response image 2.

      Normalized Cr content in brain from AGAT+/+, AGAT+/- and AGAT-/- mice (n=10 for each group). Cr pulled down by anti-Syp antibody was normalized to that of IgG.

      2) The data supporting that depolarization-induced Cr release is SLC6A8 dependent is not convincing because the relative increase in KCl-induced Cr release is similar between SLC6A8-/Y and SLC6A8+/Y (Fig. 5D). The data should be also normalized to the respective controls.

      As suggested by the reviewer, we normalized the Cr release during KCl stimulation to the baseline (Author response image 3). The ratio of Cr release evoked by high KCl stimulation to the baseline was similar in WT and Slc6a8 knockouts. This suggests that Cr is not released through SLC6A8 transporter.

      Author response image 3.

      Normalized Cr release from slices from Slc6a8+/Y and Slc6a8-/Y mice (n=7 slices for each group). Cr released evoked by high KCl stimulation was normalized to baseline.

      However, without Slc6a8, KCl-induced release of Cr was significantly reduced (Figure 5D in the paper). This is because Slc6a8 is a transporter to Cr uptake into synaptic terminals (Figure 5D and 8C in the paper). Therefore, Cr content in SVs (Figure 2C in the paper) indirectly reduced Cr release.

      3) The majority (almost 3/4) of depolarization-induced Cr release is Ca2+ independent (Fig. 5G). Furthermore, KCl-induced, Ca2+-independent release persists in SLC6A8-/Y (Fig. 5G). What is the model for Ca2+-independent Cr release? Why is there Ca2+-independent Cr release from SLC6A8 KO neurons? How does this relate to the prominent decrease in Ca2+-dependent Cr release in SLC6A8-/Y (Fig. 5G)? They show a prominent decrease in Cr control levels in SLC6A8-/Y in Fig. 5D. Were the data shown in Fig. 5D obtained in the presence or absence of Ca2+? Could the decrease in Ca2+-dependent Cr release in SLC6A8-/Y (Fig. 5G) be due to decreased Cr baseline levels in the presence of Ca2+ (Fig. 5D)?

      These are interesting questions that, at this point, could only be answered by references to literature. For example, one possibility was that Ca2+-independent Cr release might occurs in glia, since as pointed by the reviewer in Point 6, high GAMT levels were reported for astrocytes and oligodendrites (Schmidt et al. 2004; Rosko et al. 2023). As reported, other neuromodulators such as taurine can be released from astrocytes (Philibert, Rogers, and Dutton 1989) or slices (Saransaari and Oja 2006) in Ca2+ independent manner. In addition, in the absence of potassium stimulation, Ca2+ depletion lead to increased release of taurine in cultured astrocytes (Takuma et al. 1996) or in striatum in vivo (Molchanova, Oja, and Saransaari 2005). Similarly, in SLC6A8 KO slices, Ca2+ depletion (Figure 5G) also increased creatine baseline levels as compared to that in normal ACSF (Figure 5D). Another possibility was that Ca2+-independent Cr release might occurs in neurons lacking SLC6a8 expression.

      As mentioned in the paper, data shown in Figure 5D was obtained in the presence Ca2+. Reduction of Ca2+-dependent Cr release evoked by potassium in SLC6A8-/Y (Figure 5G) may be due to decreased Cr baseline levels in the presence of Ca2+ and reduced Cr in synaptic vesicles (Figure 5D).

      4) Cr levels are strongly reduced in Agat-/- (Figure 6B). However, KCl-induced Cr release persists after loss of AGAT (Figure 6B). These data do not support that Cr release is Agat dependent.

      Although KCl-induced Cr release persisted in AGAT-/- mutants, it was dropped to 11.6% of WT mice (Figure 6B). AGAT is not directly involved in the release, but required for providing sufficient Cr.

      5) The authors show that Cr application decreases excitability in ~1/3 of the tested neurons (Figure 7). How were responders and non-responders defined? What justifies this classification? The data for all Cr-treated cells should be pooled. Are there indeed two distributions (responders/non-responders)? Running statistics on pre-selected groups (Figure 7H-J) is meaningless. Given that the effects could be seen 2-8 minutes after Cr application - at what time points were the data shown in Figure 7E-J collected? Is the Cr group shown in Figure 7F significantly different from the control group/wash?

      The responders were defined by three criteria: (1) When Cr was applied, the rheobase was increased as compared to both control and wash conditions. (2) The number of total evoked spikes was decreased during Cr application than both control and wash. (3) The number of total evoked spikes was decreased at least by 10% than control or wash.

      For all the individual responders, when Cr was applied, the rheobase was increased (Figure 7E and 7F). While in individual non-responders, the rheobase was either identical to both control and wash (n=19/35), identical to either control or wash (n=11/35), between control and wash (n=2/35) or smaller than both control and wash (n=3/35) following Cr application. Thus, the responders and non-responders were separatable. When the rheobase data were pulled together, many points were overlapped, so we did not pull the data here.

      As suggested, we pulled the data of the ratio of spike changes in response to 100 μM Cr application for all neurons together (Author response image 4). Evoked spikes of non-responders were typically (34/35) changed in the range of -10% to 10%.

      Author response image 4.

      Relative changes of total evoked spikes in response to 100 μM Cr. Responders are represented by red dots and non-responders by black dots. Dashed black line indicates 10%. Relative change = (Cr-(Control +wash)/2)/((Control +wash)/2)*100%.

      In Figure 7E-J, we collected data at time points when the maximal response was reached. The Cr group shown in Figure 7F was indeed significantly different from the control group/wash (p<0.05, paired t test, for data points collected under 75-500 pA current injection).

      6) Indirect effects: The phenotypes could be partially caused by indirect effects of perturbing the Cr/PCr/CK system, which is known to play essential roles in ATP regeneration, Ca2+ homeostasis, neurotransmission, intracellular signaling systems, axonal and dendritic transport... Similarly, high GAMT levels were reported for astrocytes (e.g., Schmidt et al. 2004; doi: 10.1093/hmg/ddh112), and changes in astrocytic Cr may underlie the phenotypes. Cr has been also reported to be an osmolyte: a hyperosmotic shock of astrocytes induced an increase in Cr uptake, suggesting that Cr can work as a compensatory osmolyte (Alfieri et al. 2006; doi: 10.1113/jphysiol.2006.115006). Potential indirect effects are also consistent with a trend towards decreased KCl-induced GABA (and Glutamate) release in SLC6A8-/Y (Figure 5C). These indirect effects may in part explain the phenotypes seen after perturbing Agat, SLC6A8, and should be thoroughly discussed.

      We discussed the possibility of creatine/phosphocreatine as non-transmitters in discussion part. We added the possibility of astrocytic Cr in discussion part. KCl-induced GABA (and Glutamate) release in SLC6A8-/Y (Figure 5C) was not significant.

      7) As stated by the authors, there is some evidence that Cr may act as a co-transmitter for GABAA receptors (although only at high concentrations). Would a GABAA blocker decrease the fraction of cells with decreased excitability after Cr exposure?

      We performed another experiment in CA1 pyramidal neurons in hippocampus showing that Cr at 100 μM did not change GABAergic neurotransmission (n=8, Author response image 5). Inhibitory postsynaptic currents (IPSCs) recorded in the presence of glutamate receptor blockers (10 μM APV and 10 μM CNQX) were not changed by 100 μM creatine in hippocampal CA1 pyramidal neurons (Bgroup data of IPSC frequency (B) and amplitude (C) averaged in 1 min duration). These did not support Cr activation of GABAA receptors.

      Author response image 5.

      IPSCs recorded in in hippocampal CA1 pyramidal neurons. (A) representative raw traces before (Control), during (Creatine) and after (Wash) the application of 100 μM creatine. (B&C) group data of IPSC frequency (B) and amplitude (C) averaged in 1 min duration.

      8) The statement "Our results have also satisfied the criteria of Purves et al. 67,68, because the presence of postsynaptic receptors can be inferred by postsynaptic responses." (l.568) is not supported by the data and should be removed.

      We have deleted this sentence, though what could mediate postsynaptic responses other than receptors?

      Reviewer #3 (Public Review):

      SUMMARY:

      The manuscript by Bian et al. promotes the idea that creatine is a new neurotransmitter. The authors conduct an impressive combination of mass spectrometry (Fig. 1), genetics (Figs. 2, 3, 6), biochemistry (Figs. 2, 3, 8), immunostaining (Fig. 4), electrophysiology (Figs. 5, 6, 7), and EM (Fig. 8) in order to offer support for the hypothesis that creatine is a CNS neurotransmitter.

      We thank the reviewer for the summary.

      STRENGTHS:

      There are many strengths to this study.

      • The combinatorial approach is a strength. There is no shortage of data in this study.

      • The careful consideration of specific criteria that creatine would need to meet in order to be considered a neurotransmitter is a strength.

      • The comparison studies that the authors have done in parallel with classical neurotransmitters are helpful.

      • Demonstration that creatine has inhibitory effects is another strength.

      • The new genetic mutations for Slc6a8 and AGAT are strengths and potentially incredibly helpful for downstream work.

      WEAKNESSES:

      • Some data are indirect. Even though Slc6a8 and AGAT are helpful sentinels for the presence of creatine, they are not creatine themselves. Therefore, the conclusions that are drawn should be circumspect.

      SLC6A8 and AGAT mutants are not essential for Cr’s role as a neurotransmitter.

      • Regarding Slc6a8, it seems to work only as a reuptake transporter - not as a transporter into SVs. Therefore, we do not know what the transporter is.

      Indeed, SLC6A8 is only a transporter on the cytoplasmic membrane, not a transporter on synaptic vesicles. We have shown biochemistry here, and we have unpublished data that showed other SLCs on SVs, which did not include SLC6A8.

      • Puzzlingly, Slc6a8 and AGAT are in different cells, setting up the complicated model that creatine is created in one cell type and then processed as a neurotransmitter in another.

      • No candidate receptor for creatine has been identified postsynaptically.

      • Because no candidate receptor has been identified, is it possible that creatine is exerting its effects indirectly through other inhibitory receptors (e.g., GABAergic Rs)?

      As shown in our response to Question 7 of Reviewer 2, Cr did not exert its effects through inhibitory GABAA receptors.

      • More broadly, what are the other possibilities for roles of creatine that would explain these observations other than it being a neurotransmitter? Could it simply be a modifier that exists in the SVs (lots of molecules exist in SVs)?

      We discussed the possibility of a non-transmitter role for creatine/phosphocreatine in discussion part.

      • The biochemical studies are helpful in terms of comparing relevant molecules (e.g., Figs. 8 and S1), but the images of the westerns are all so fuzzy that there are questions about processing and the accuracy of the quantification.

      Multiple members (>4) have carried out SV purifications repeatedly over the last decade in our group, we are highly confident of SV purifications presented in Figs. 8 and S1.

      There are several criteria that define a neurotransmitter. The authors nicely delineated many criteria in their discussion, but it is worth it for readers to do the same with their own understanding of the data.

      By this reviewer's understanding (and the Purves' textbook definition) a neurotransmitter: 1) must be present within the presynaptic neuron and stored in vesicles; 2) must be released by depolarization of the presynaptic terminal; 3) must require Ca2+ influx upon depolarization prior to release; 4) must bind specific receptors present on the postsynaptic cell; 5) exogenous transmitter can mimic presynaptic release; 6) there exists a mechanism of removal of the neurotransmitter from the synaptic cleft.

      6 criteria seem to be only required by the reviewer. As discussed in our Discussion part, Purves’ textbook did not list 6 criteria but only three criteria, “the substance must be present within the presynaptic neuron; the substance must be released in response to presynaptic depolarization, and the release must be Ca2+ dependent; specific receptors for the substance be present on the postsynaptic cell” (Purves et al., 2001, 2016).

      Kandel et al. (2013, 2021) listed 4 criteria for a neurotransmitter: “it is synthesized in the presynaptic neuron; it is present within vesicles and is released in amounts sufficient to exert a defined action on the postsynaptic neuron or effector organ; when administered exogenously in reasonable concentrations it mimics the action of the endogenous transmitter; a specific mechanism usually exists for removing the substance from the synaptic cleft”.

      While we agree that any neuroscientist can have his/her own criteria, it is more reasonable to accept the textbooks that have been widely read for decades.

      For a paper to claim that the work has identified a new neurotransmitter, several of these criteria would be met - and the paper would acknowledge in the discussion which ones have not been met. For this particular paper, this reviewer finds that condition 1 is clearly met.

      Conditions 2 and 3 seem to be met by electrophysiology, but there are caveats here. High KCl stimulation is a blunt instrument that will depolarize absolutely everything in the prep all at once and could result in any number of non-specific biological reactions as a result of K+ rushing into all neurons in the prep. Moreover, the results in 0 Ca2+ are puzzling. For creatine (and for the other neurotransmitters), why is there such a massive uptick in release, even when the extracellular saline is devoid of calcium?

      To avoid the disadvantage of high KCl stimulation, we performed optogenetic experiments recently, with encouraging preliminary data. We do not know the source of Ca2+-independent release of Cr and neurotransmitters, though astrocytes are a possibility.

      Condition 4 is not discussed in detail at all. In the discussion, the authors elide the criterion of receptors specified by Purves by inferring that the existence of postsynaptic responses implies the existence of receptors. True, but does it specifically imply the existence of creatinergic receptors? This reviewer does not think that is necessarily the case. The authors should be appropriately circumspect and consider other modes of inhibition that are induced by activation or potentiation of other receptors (e.g., GABAergic or glycinergic).

      Our results did not support Cr stimulation of inhibitory GABAA receptors (see our answer to Point 7 in of Reviewer 2).

      Condition 5 may be met, because the authors applied exogenous creatine and observed inhibition (Fig. 7). However, this is tough to know without understanding the effects of endogenous release of creatine. if they were to test if the absence of creatine caused excess excitation (at putative creatinergic synapses), then that would be supportive of the same.

      After the submission of our manuscript, we found a recent paper showing that slc6a8 knockout led to increased excitation in pyramidal neurons in the prefrontal cortex (PFC), with increased firing frequency (Ghirardini et al., 2023). Because we have shown that slc6a8 knockout would cause decrease of Cr in SVs (Figure 2 in our paper), this result provide the evidence described as Condition 5 of this reviewer: that decrease of Cr in SVs led to excess excitation.

      For condition 6, the authors made a great effort with Slc6a8. This is a very tough criterion to understand for many synapses and neurotransmitters.

      In terms of fundamental neuroscience, the story would be impactful if proven correct. There are certainly more neurotransmitters out there than currently identified.

      The impact as framed by the authors in the abstract and introduction for intellectual disability is uncertain (forming a "new basis for ID pathogenesis") and it seems quite speculative beyond the data in this paper.

      We deleted this sentence.

      Reviewer #1 (Recommendations For The Authors):

      To strengthen the manuscript, I suggest the following considerations:

      1) The key missing evidence to my mind is a receptor - but this is clearly outside the scope of this paper. Yet, I am surprised that in the list of criteria for neurotransmitters in general there is no mention of a receptor. Furthermore, many receptors have been identified through receptor agonists or antagonists, like neurotoxins or drugs. The authors do not talk about putative receptors except for a sentence in the discussion where they speculate on a GPCR. There are numerous GPCR agonists and antagonists, which may be a long-shot, or something even a bit more designed based on knowledge about creatine? I do not think the publication of this manuscript should have been made dependent on finding an agonist or antagonist of this specific unknown receptor (if it exists), but it would be good to have at least some leads on this from the authors what has been tried or what could be done? How about a manipulation of G-protein-coupled signal transduction to support the idea that there IS such a GPCR? There may be a real opportunity here to test existing compounds in wild type, the slc6a8 and agat mutants.

      We will keep trying, but accept the reality that Rome was not built in a single day and that no transmitter was proven by one single paper.

      A key new puzzle piece of evidence is the identification of creatine in synaptic vesicles. The experiment relies heavily on the purity of the SV fraction using the anti-synaptophysin antibody. I am quite sure that these preparations contain many other compartments - and of course a big mix of synaptic (and other) vesicles. Would it be possible to purify with an anti slc6a8 antibody?

      Sl6a8 is expressed in on the plasma membrane of neurons7-9, instead of synaptic vesicles. Consistent with this, we could not detect obvious Slc6a8-HA signal in our starting material (Lane S in Author response image 6) that was used for SV purification. We have tried to purify SVs by HA antibody in Slc6a8 mice and SV markers could not be detected.

      Author response image 6.

      Lack of Slc6a8-HA in our starting material. In Slc6a8-HA knock-in mice, the HA signal was present in whole brain homogenate (H), but not obvious in supernatants (S) following 35000 × centrifugation. In contrast, SV marker Syp was present in supernatants.

      The K stimulation protocol in slices is relatively crude, as all neurons in the slice get simultaneously overactivated - and some of the effects on Ca-dependent release are not very strong (e.g. the 35 neurons that were not responsive to creatine at all). A primary neuronal culture of neurons that respond to creatine would strengthen this section.

      To avoid the disadvantage of K stimulation, we also performed optogenetic experiments recently and obtained encouraging preliminary results.

      Reviewer #2 (Recommendations For The Authors):

      1) The different sections of the manuscript are not separated by headers.

      2) The beginning of the results section either does not reference the underlying literature or refers to unpublished data.

      We have kept a bit background in the beginning of the Results section.

      3) The text contains many opinions and historical information that are not required (e.g., "It has never been easy to discover a new neurotransmitter, especially one in the central nervous system (CNS). We have been searching for new neurotransmitters for 12 years."; l. 17).

      This is a field that has been dormant for decades and such background introductions are helpful for at least some readers.

      4) Almeida et al. (2008; doi: 10.1002/syn.20280) provided evidence for electrical activity-, and Ca2+-dependent Cr release from rat brain slices. This paper should be introduced in the introduction.

      Those were stand-alone papers which have not been reproduced or paid attention to. Our introduction part did not mention them because our research did not begin with those papers. We had no idea that those papers existed when we began. We started with SV purification and only read those papers afterwards. Thus, they were not necessary background to our paper but can be discussed after we discovered Cr in SVs.

      5) Fig. 7: A Y-scale for the stimulation protocol is missing.

      Revised.

      Reviewer #3 (Recommendations For The Authors):

      The main suggestion by this reviewer (beyond the details in the public review) is to consider the full spectrum of biology that is consistent with these results. By my reading, creatine could be a neurotransmitter, but other possibilities also exist, and the authors need to highlight those too.

      We have discussed non-transmitter role in the discussion.

      References

      Ghirardini, E., G. Sagona, A. Marquez-Galera, F. Calugi, C. M. Navarron, F. Cacciante, S. Chen, F. Di Vetta, L. Dada, R. Mazziotti, L. Lupori, E. Putignano, P. Baldi, J. P. Lopez-Atalaya, T. Pizzorusso, and L. Baroncelli. 2023. Cell-specific vulnerability to metabolic failure: the crucial role of parvalbumin expressing neurons in creatine transporter deficiency. Acta Neuropathol Commun, 11: 34. doi: 10.1186/s40478-023-01533-w.

      Lowe, M. T., Faull, R. L., Christie, D. L. & Waldvogel, H. J. Distribution of the creatine transporter throughout the human brain reveals a spectrum of creatine transporter immunoreactivity. J Comp Neurol 523, 699-725 (2015). https://doi.org:10.1002/cne.23667

      Mak, C. S. et al. Immunohistochemical localisation of the creatine transporter in the rat brain. Neuroscience 163, 571-585 (2009). https://doi.org:10.1016/j.neuroscience.2009.06.065.

      Molchanova, S. M., Oja, S. S. & Saransaari, P. Mechanisms of enhanced taurine release under Ca2+ depletion. Neurochem Int 47, 343-349 (2005). https://doi.org:10.1016/j.neuint.2005.04.027

      Philibert, R. A., Rogers, K. L. & Dutton, G. R. K+-evoked taurine efflux from cerebellar astrocytes: on the roles of Ca2+ and Na+. Neurochem Res 14, 43-48 (1989). https://doi.org:10.1007/BF00969756

      Rosko, L. M. et al. Cerebral Creatine Deficiency Affects the Timing of Oligodendrocyte Myelination. J Neurosci 43, 1143-1153 (2023). https://doi.org:10.1523/JNEUROSCI.2120-21.2022

      Saransaari, P. & Oja, S. S. Characteristics of taurine release in slices from adult and developing mouse brain stem. Amino Acids 31, 35-43 (2006). https://doi.org:10.1007/s00726-006-0290-5

      Schmidt, A. et al. Severely altered guanidino compound levels, disturbed body weight homeostasis and impaired fertility in a mouse model of guanidinoacetate N-methyltransferase (GAMT) deficiency. Hum Mol Genet 13, 905-921 (2004). https://doi.org:10.1093/hmg/ddh112

      Speer, O. et al. Creatine transporters: a reappraisal. Mol Cell Biochem 256-257, 407-424 (2004). https://doi.org:10.1023/b:mcbi.0000009886.98508.e7

      Takuma, K. et al. Ca2+ depletion facilitates taurine release in cultured rat astrocytes. Jpn J Pharmacol 72, 75-78 (1996). https://doi.org:10.1254/jjp.72.75

    1. Author Response

      The following is the authors’ response to the previous reviews.

      eLife assessment

      This valuable paper examines gene expression differences between male and female individuals over the course of flower development in the dioecious angiosperm Trichosantes pilosa. The authors show that male-biased genes evolve faster than female-biased and unbiased genes. This is frequently observed in animals, but this is the first report of such a pattern in plants. In spite of the limited sample size, the evidence is mostly solid and the methods appropriate for a non-model organism. The resources produced will be used by researchers working in the Cucurbitaceae, and the results obtained advance our understanding of the mechanisms of plant sexual reproduction and its evolutionary implications: as such they will broadly appeal to evolutionary biologists and plant biologists.

      Public Reviews:

      Reviewer #1 (Public Review):

      The evolution of dioecy in angiosperms has significant implications for plant reproductive efficiency, adaptation, evolutionary potential, and resilience to environmental changes. Dioecy allows for the specialization and division of labor between male and female plants, where each sex can focus on specific aspects of reproduction and allocate resources accordingly. This division of labor creates an opportunity for sexual selection to act and can drive the evolution of sexual dimorphism.

      In the present study, the authors investigate sex-biased gene expression patterns in juvenile and mature dioecious flowers to gain insights into the molecular basis of sexual dimorphism. They find that a large proportion of the plant transcriptome is differentially regulated between males and females with the number of sex-biased genes in floral buds being approximately 15 times higher than in mature flowers. The functional analysis of sex-biased genes reveals that chemical defense pathways against herbivores are up-regulated in the female buds along with genes involved in the acquisition of resources such as carbon for fruit and seed production, whereas male buds are enriched in genes related to signaling, inflorescence development and senescence of male flowers. Furthermore, the authors implement sophisticated maximum likelihood methods to understand the forces driving the evolution of sex-biased genes. They highlight the influence of positive and relaxed purifying selection on the evolution of male-biased genes, which show significantly higher rates of non-synonymous to synonymous substitutions than female or unbiased genes. This is the first report (to my knowledge) highlighting the occurrence of this pattern in plants. Overall, this study provides important insights into the genetic basis of sexual dimorphism and the evolution of reproductive genes in Cucurbitaceae.

      Reviewer #2 (Public Review):

      Summary:

      This study uses transcriptome sequence from a dioecious plant to compare evolutionary rates between genes with male- and female-biased expression and distinguish between relaxed selection and positive selection as causes for more rapid evolution. These questions have been explored in animals and algae, but few studies have investigated this in dioecious angiosperms, and none have so far identified faster rates of evolution in male-biased genes (though see Hough et al. 2014 https://doi.org/10.1073/pnas.1319227111).

      Strengths:

      The methods are appropriate to the questions asked. Both the sample size and the depth of sequencing are sufficient, and the methods used to estimate evolutionary rates and the strength of selection are appropriate. The data presented are consistent with faster evolution of genes with male-biased expression, due to both positive and relaxed selection.

      This is a useful contribution to understanding the effect of sex-biased expression in genetic evolution in plants. It demonstrates the range of variation in evolutionary rates and selective mechanisms, and provides further context to connect these patterns to potential explanatory factors in plant diversity such as the age of sex chromosomes and the developmental trajectories of male and female flowers.

      Weaknesses:

      The presence of sex chromosomes is a potential confounding factor, since there are different evolutionary expectations for X-linked, Y-linked, and autosomal genes. Attempting to distinguish transcripts on the sex chromosomes from autosomal transcripts could provide additional insight into the relative contributions of positive and relaxed selection.

      Reviewer #3 (Public Review):

      The potential for sexual selection and the extent of sexual dimorphism in gene expression have been studied in great detail in animals, but hardly examined in plants so far. In this context, the study by Zhao, Zhou et al. al represents a welcome addition to the literature.

      Relative to the previous studies in Angiosperms, the dataset is interesting in that it focuses on reproductive rather than somatic tissues (which makes sense to investigate sexual selection), and includes more than a single developmental stage (buds + mature flowers).<br /> Some aspects of the presentation have been improved in this new version of the manuscript.

      Specifically:

      • the link between sex-biased and tissue-biased genes is now slightly clearer,

      • the limitation related to the de novo assembled transcriptome is now formally acknowledged,

      • the interpretation of functional categories of the genes identified is more precise,

      • the legends of supplementary figures have been improved - a large number of typos have been fixed.

      in response to this first round of reviews. As I detail below, many of the relevant and constructive suggestions by the previous reviewers were not taken into account in this revision.

      For instance:

      • Reviewer 2 made precise suggestions for trying to take into account the potential confounding factor of sex-chromosomes. This suggestion was not followed.

      For the question of reviewer 2:

      The presence of sex chromosomes is a potential confounding factor, since there are different evolutionary expectations for X-linked, Y-linked, and autosomal genes. Attempting to distinguish transcripts on the sex chromosomes from autosomal transcripts could provide additional insight into the relative contributions of positive and relaxed selection.

      Empirically, the analyses could be expanded by an attempt to distinguish between genes on the autosomes and the sex chromosomes. Genotypic patterns can be used to provisionally assign transcripts to XY or XX-like behavior when all males are heterozygous and all females are homozygous (fixed X-Y SNPs) and when all females are heterozygous and males are homozygous (lost or silenced Y genes). Comparing such genes to autosomal genes with sex-biased expression would sharpen the results because there are different expectations for the efficacy of selection on sex chromosomes. See this paper (Hough et al. 2014; https://www.pnas.org/doi/abs/10.1073/pnas.1319227111), which should be cited and does in fact identify faster substitution rates in Y-linked genes.

      Authors’ response: We have cited Hough et al. (2014) and Sandler et al. (2018) in the revised manuscript. We agree that the presence of sex chromosomes is potentially a confounding factor. By adopting methods in Hough et al. (2014) and Sandler et al. (2018), we tried to distinguish transcripts on sex chromosomes from autosomal chromosomes. For a total of 2,378 unbiased genes, we found that 36 genes were putatively sex chromosomal genes, 20 of which were exclusively heterozygous and homozygous for males and females, respectively; while the other 16 genes showing an opposite genotyping patterns between males and females. For 343 male-biased genes, only three ones exhibit a pattern of potentially sex-linked. For the 1,145 female-biased genes, we identified 19 genes which might located on the sex chromosomes. Among the 19 genes, five genes were exclusively heterozygous for males and exclusively homozygous for females, while reversed genotyping patterns presented in the other 14 genes. So, sex-linked genes may contribute relatively little to rapid evolution of male-biased genes. An alternative explanation is that the results could be unreliable due to small sample sizes. Thus, we did not describe them in the Results section. We will investigate the issue when whole genome sequences and population datasets become available in the near future.

      • Reviewer 1 & 3 indicated that results were mentioned in the discussion section without having been described before. This was not fixed in this new version.

      For the question of reviewer 1:

      2) Paragraph (407-416) describes the analysis of duplicated genes under relaxed selection but there is no mention of this in the results.

      Authors’ response: Following this suggestion, in the Results section, we have added a sentence, “We also found that most of them were members of different gene families generated by gene duplication (Table S13)” on line 310-311 in the revised manuscript (Rapid_evolution_of_malebiased_genes_Trichosanthes_pilosa_Tracked_change_2023_11_06.docx).

      For the question of reviewer 1:

      38- line 417-424. The discussion should not contain new results.

      Authors’ response: Thank you for pointing out this. In the Results section, we have added a few sentences as following: “Similarly, given that dN/dS values of sex-biased genes were higher due to codon usage bias, lower dS rates would be expected in sex-biased genes relative to unbiased genes (Ellegren & Parsch, 2007; Parvathy et al., 2022). However, in our results, the median of dS values in male-biased genes were much higher than those in female-biased and unbiased genes in the results of ‘free-ratio’ (Fig. S4A, female-biased versus male-biased genes, P = 6.444e-12 and malebiased versus unbiased genes, P = 4.564e-13) and ‘two-ratio’ branch model (Fig. S4B, femalebiased versus male-biased genes, P = 2.2e-16 and male-biased versus unbiased genes, P = 9.421e08, respectively). ” on line 323-331, and consequently, removed the following sentence, “femalebiased vs male-biased genes, P = 6.444e-12 and male-biased vs unbiased genes, P = 4.564e-13” and “female-biased versus male-biased genes, P = 2.2e-16 and male-biased versus unbiased genes, P = 9.421e-08, respectively” in the Discussion section.

      • Reviewer 1 asked for a comparison between the number of de novo assembled unigenes in this transcriptome and the number of genes in other Cucurbitaceae species. I could not see this comparison reported.

      Authors’ response: In the first revision, we described only percentages. We have now added the number of genes. We modify this part as follows: “The majority of unigenes were annotated by homologs in species of Cucurbitaceae (61.6%, 36,375), including Momordica charantia (16.3%, 9,625), Cucumis melo (11.9%, 7,027), Cucurbita pepo (11.9%, 7,027), Cucurbita moschata (11.5%, 6,791), Cucurbita maxima (10.1%, 5,964) and other species (38.4%, 22,676) (Fig. S1C).”.

      • Reviewer 1 pointed out that permutation tests were more appropriate, but no change was made to the manuscript.

      Authors’ response: Thank you for your suggestion. In the first revision, we have indirectly responded to the issues. Wilcoxon rank sum test is more commonly used for all comparisons between sex-biased and unbiased genes in many papers. Additionally, we tested datasets using permutation t-tests, which is consistent with the results of Wilcoxon rank sum test. For example, we found that only in floral buds, there are significant differences in ω values in the results of ‘free-ratio’ (female-biased versus male-biased genes, P = 0.04282 and male-biased versus unbiased genes, P = 0.01114) and ‘two-ratio’ model (female-biased versus male-biased genes, P = 0.01992 and male-biased versus unbiased genes, P = 0.02127, respectively). We also described these results in the Results section accordingly (line 278-284).

      • Reviewer 3 pointed out the small sample size (both for the RNA-seq and the phylogenetic analysis), but again this limitation is not acknowledged very clearly.

      Authors’ response: Sorry, we acknowledged that our sample size was relatively small. In the revised version, we have added a sentence as follows, “Additionally, our sample size is relatively small, and may provide low power to detect differential expression.” in the Discussion section.

      • Reviewer 1 & 3 pointed out that Fig 3 was hard to understand and asked for clarifications that I did not see in the text and the figure in unchanged.

      Authors’ response: Thank you for your suggestions. We have revised the manuscript to clarify the meaning of the acronym (F1TGs, F2TGs, M1TGs, M2TGs, F1BGs, F2BGs, M1BGs and M2BGs) and presented the number of genes. We have added two labels, indicating that panels A and B correspond to males and C and D to females in Fig. 3.

      • Reviewer 3 suggested to combine all genes with sex-bias expression when evaluating the evolutionary rate, in addition to the analyses already done. This suggestion was not followed.

      For the question of reviewer 3:line 196 and following: In these analyses, I could not understand the rationale for keeping buds vs mature flowers as separate analyses throughout. Why not combine both and use the full set of genes showing sex-bias in any tissue? This would increase the power and make the presentation of the results a lot more straightforward.

      Authors’ response: Thank you for your suggestions. In the first revision, we tried to respond to the issues. First, we observed strong sexual dimorphism in floral buds, such as racemose versus solitary, early-flowering versus late-flowering. Second, as you pointed out earlier, “the dataset is interesting in that it focuses on reproductive rather than somatic tissues (which makes sense to investigate sexual selection), and includes more than a single developmental stage (buds + mature flowers)”, we totally agree with you on this point. Third, according to your suggestions, we combined all genes with sex-bias expression to evaluate the evolutionary rates. We found significant differences (please see a Figure below) in ω values in the results of ‘free-ratio’ (female-biased versus male-biased genes, P =0.005622 and male-biased versus unbiased genes, P = 0.001961) and ‘two-ratio’ model (female-biased versus male-biased genes, P = 0.008546 and male-biased versus unbiased genes, P = 0.009831, respectively) using Wilcoxon rank sum test. However, the significance is lower than previous results in floral buds due to sex-biased genes of mature flower joined, especially compared to the results of “free-ratio model”. Additionally, we also test all combined genes with sex-bias expression using permutation t-test. Unfortunately, there are no significant differences in ω values expect for male-biased versus unbiased genes in the results of ‘free-ratio’ model (P = 0.03034) and ‘two-ratio’ model (P = 0.0376), respectively. To a certain extent, the combination of all genes with sex-bias expression may cover the signals of rapid evolution of sex-biased genes in floral buds. Therefore, these results are not described in our manuscript. In the near future, we would like to make further investigations through more development stages of flowers and new technologies (e.g. Single-Cell method, See Murat et al., 2023) in each sex to consolidate the conclusion, and it is hoped that we could find more meaningful results.

      Author response image 1.

      • Reviewer 3 pointed out that hand-picking specific categories of genes was not statistically valid, and in fact not necessary in the present context. This was not changed.

      For the question of reviewer3: removing genes on a post-hoc basis seems statistically suspicious to me. I don't think your analysis has enough power to hand-pick specific categories of genes, and it is not clear what this brings here. I suggest simply removing these analyses and paragraphs.

      Authors’ response: Thank you for your suggestions. We have changed them accordingly. We removed a part of the following paragraph, “To confirm the contributions of positive selection and relaxed selection to rapid rates of male-biased genes in floral buds, we generated three datasets of OGs by excluding different sets of genes. Specifically, we excluded 18 relaxed selective male-biased genes (5.23%), 98 positively selected male-biased genes (28.57%), and 112 male-biased genes (32.65%) under positive and relaxed selection from 343 OGs (Fig. S4). We observed that after excluding male-biased genes under relaxed purifying selection, the median (0.264) decreased by 0.34% compared to the median (0.265) of all OGs (Fig. S4A-B). However, after excluding positively selected male-biased genes, the median (0.236) was reduced by 11% (Fig. S4A, C) in the results of ‘free-ratio’ branch model. This pattern was consistent with the results of ‘two-ratio’ branch model as well (Fig. S4E-G).” on line 290 to 300.

      However, we kept the following paragraph, “We also analyzed female-biased and unbiased genes that underwent positive and relaxed selection in floral buds (Tables S6-S10). We identified 216 (18.86%) positively selected, and 69 (6.03%) relaxed selective female-biased genes from 1,145 OGs, respectively. Similarly, we found 436 (18.33%) positively selected, and 43 (1.81%) unbiased genes under relaxed selection from 2,378 OGs, respectively. Notably, male-biased genes have a higher proportion (10%) of positively selected genes compared to female-biased and unbiased genes. However, relaxed selective male-biased genes have a higher proportion (3.24%) than unbiased genes, but about 0.8% lower than that of female-biased genes.”. In this way, we can compare the proportion of sex-biased genes that have undergone positive selection and release selection among female-biased genes, unbiased genes and male-biased genes in floral buds in the Discussion section.

      • Reviewer 1 asked for all data to be public, but I could not find in the manuscript where the link to the data on ResearchGate was provided.

      Authors’ response: We have added a link in the Data Availability section.

      • Reviewers 1 & 3 pointed out that since only two tissues were compared, the claims on pleiotropy should have been toned down, but no change was made to the text.

      Authors’ response: Thank you for your suggestions. We revised “due to low pleiotropic constraints” to “due to low evolutionary constraints” and revised “low pleiotropy” to “low constraints”.

      • Reviewer 1 asked for a clarification on which genes are plotted on the heatmap of Fig3C and an explanation of the color scale. No change was made.

      Authors’ response: Sorry for the confusion. Actually, Reviewer 1 asked that “Fig. 2C, which genes are plotted on the heatmap and what is the color scale corresponding to?” In the previous revision, we have revised them (See Fig. 2 Sex-biased gene expression for floral buds and flowers at anthesis in males and females of Trichosanthes pilosa). Sex-biased genes (the union of sex-biased genes in F1, M1, F2 and M2) are plotted on the heatmap. The color gradient represents from high to low (from red to green) gene expression.

      • Reviewer 1 asked for panel B in Fig S5 and S6 to be removed. They are still there. They asked for abbreviations to be explained in the legend of Fig S8. This was not done. They asked for details about columns headers. Such detailed were not added. They asked for more recent references on line 53-56: this was not done.

      Authors’ response: We have removed panel B in Fig. S5 and S6. We explained abbreviations in text and Fig. S8. We added more details about the column headers in Supplementary Table S4, S5, S6, S7, S8, S9 and S10. We also added more recent references on line 53-56.

      Recommendations for the authors:

      Reviewer #3 (Recommendations For The Authors):

      Authors’ response: Thank you for your suggestions. We have revised/fixed these issues following your concerns and suggestions.

      Line 46-48 would be clearer as « Sexual dimorphism is the condition where sexes of the same species exhibit different morphological, ecological and physiological traits in gonochoristic animals and dioecious plants, despite male and female individuals sharing the same genome except for sex chromosomes or sex-determining loci »

      Authors’ response: Thanks. We have revised it accordingly.

      Line 50: replace «in both » by «between the two »

      Authors’ response: We have revised it.

      Line 51: « genes exclusively » -> « genes expressed exclusively »

      Authors’ response: We have revised it.

      Line 58: « in many animals » -> « in several animal species »

      Authors’ response: We have revised it to “in some animal species”.

      Line 58: « to which » -> « of this bias »

      Authors’ response: We have revised it.

      Line 64: « Most dioecious plants possess homomorphic sex-chromosomes that are roughly similar in size when viewed by light microscopy. » : a reference is missing

      Authors’ response: We have added the reference.

      Line 67: remove « that »

      Authors’ response: We have revised it.

      line 96: change to: « only the five above-mentioned studies »

      Authors’ response: We have revised it.

      Line 97: remove « the »

      Authors’ response: We have revised it.

      Line 111: « Drosophia » -> Drosophila

      Authors’ response: We have revised it.

      Line 114: exhibiting -> « exhibited »

      Authors’ response: We have revised it.

      Line 115: suggest -> « suggesting »

      Authors’ response: We have revised it.

      Line 117: « studies in plants have rarely reported elevated rates of sex-biased genes » : is it « rarely » or « never » ?

      Authors’ response: We have revised to “never”.

      Line 143: « It’s » -> « Its »

      Authors’ response: We have revised it.

      Line 143-146: say whether the male parts (e.g. anthers) are still present in females flowers, and the female parts (pistil+ ovaries) in the male flowers, or whether these respective organs are fully aborted.

      Authors’ response: We have added the following sentence, “The male parts (e. g., anthers) of female flowers, and the female parts (e. g., pistil and ovaries) of male flowers are fully aborted” in line 148150 of the Introduction section.

      Line 158: this is now clearer, but please specify whether you are talking about 12 floral buds in total, or 12 per individual (i.e. 72 buds in total).

      Authors’ response: We have revised it to “Using whole transcriptome shotgun sequencing, we sequenced floral buds and flowers at anthesis from female and male of dioecious T. pilosa. We set up three biological replicates from three female and three male plants, including 12 samples in total (six floral buds and six flowers at anthesis)”.

      Line 194-198: These sentences are unclear and hard to link to the figure. Consider changing for « In male plants, the number of tissue-biased genes in flowers at anthesis (M2TGs: n = 2795) was higher than that in floral buds (M1TGs: n = 1755, Fig. 3A and 3B). Figure 3 is also very hard to read. Adding a label on the side to indicate that panels A and B correspond to male-biased genes and C and D to female-biased genes could be useful.

      Authors’ response: Thank you for your suggestions. We have revised the text to clarify the meaning of the acronym (F1TGs, F2TGs, M1TGs, M2TGs, F1BGs, F2BGs, M1BGs and M2BGs) and presented the number of genes. We have added two labels, indicating that panels A and B correspond to males and C and D to females in Figure 3.

      Line 208: explain the approach: e.g. « We then compared rates of protein evolution among malebiased, female-biased and unbiased genes. To do this, we sequenced floral bud transcriptomes from the closely related T. anguina, as well as two more distant outgroups, T. kirilowii and Luffa cylindrica. T. kirilowii is a dioecious species like T. pilosa, and the other two are monoecious. We identified one-to-one orthologous groups (OGs) for 1,145 female-biased, 343 male-biased, and 2,378 unbiased genes. »

      Authors’ response: We have revised this paragraph to the following, “We compared rates of protein evolution among male-biased, female-biased and unbiased genes in four species with phylogenetic relationships (((T. anguina, T. pilosa), T. kirilowii), Luffa cylindrica), including dioecious T. pilosa, dioecious T. kirilowii, monoecious T. anguina in Trichosanthes, together with monoecious Luffa cylindrica. To do this, we sequenced transcriptomes of T. pilosa. We also collected transcriptomes of T. kirilowii, as well as genomes of T. anguina and Luffa cylindrica.”

      Line 220: « the same ω value was in all branches » -> « all branches are constrained to have the same ω value ».

      Authors’ response: We have revised it.

      Line 221: « results of the 'two-ratio' branch model ... »

      Authors’ response: We have revised it.

      Line 235: add a few words to explain why the effect size is bigger than for buds, but still is not significant: e.g. «possibly because of limited statistical power due to the low number of sex-biased genes in flowers at anthesis »

      Authors’ response: We have revised this to “However, there is no statistically significant difference in the distribution of ω values using Wilcoxon rank sum tests for female-biased versus male-biased genes (P = 0.0556), female-biased versus unbiased genes (P = 0.0796), and male-biased versus unbiased genes (P = 0.3296) possibly because of limited statistical power due to the low number of sex-biased genes in flowers at anthesis.” in line 260-261.

      Line 255: explain in plain English what the « A model » is. This was already requested in the previous version.

      Authors’ response: We have revised “A model” to “classical branch-site model A”.

      Line 258: explain in plain English what the « foreground 2b ω value » corresponds to

      Authors’ response: We have revised to as follows, “foreground 2b ω value” to “foreground ω >1”. Additionally, we also added the sentence “The classical branch-site model assumes four site classes (0, 1, 2a, 2b), with different ω values for the foreground and background branches. In site classes 2a and 2b, the foreground branch undergoes positive selection when there is ω > 1.” in line 624-627.

      Line 259: explain how these different approaches complement each other rather than being redundant. This was also already requested in the previous version.

      Authors’ response: Sorry. We have now revised it as follows, “As a complementary approach, we utilized the aBSREL and BUSTED methods that are implemented in HyPhy v.2.5 software, which avoids false positive results by classical branch-site models due to the presence of rate variation in background branches, and detected significant evidence of positive selection.” in line 292-295.

      Line 270: remove « dramatically », and also remove « or eliminated at both gene-wide and genomewide levels », as well as « relative to positive selection »

      Authors’ response: Thank you for your suggestions. We have revised it.

      Line 290-309: remove this section - this was already pointed out in the previous reviews as a « ad hoc » procedure, and this point has already been made clear with the RELAX analysis.

      Authors’ response: Thank you for your suggestions. We revised this section accordingly. We remove the following paragraph, “To confirm the contributions of positive selection and relaxed selection to rapid rates of male-biased genes in floral buds, we generated three datasets of OGs by excluding different sets of genes. Specifically, we excluded 18 relaxed selective male-biased genes (5.23%), 98 positively selected male-biased genes (28.57%), and 112 male-biased genes (32.65%) under positive and relaxed selection from 343 OGs (Fig. S4). We observed that after excluding malebiased genes under relaxed purifying selection, the median (0.264) decreased by 0.34% compared to the median (0.265) of all OGs (Fig. S4A-B). However, after excluding positively selected malebiased genes, the median (0.236) was reduced by 11% (Fig. S4A, C) in the results of ‘free-ratio’ branch model. This pattern was consistent with the results of ‘two-ratio’ branch model as well (Fig. S4E-G).” on line 334-344.

      However, we kept the other parts “We also analyzed female-biased and unbiased genes that underwent positive and relaxed selection in floral buds (Tables S6-S10). We identified 216 (18.86%) positively selected, and 69 (6.03%) relaxed selective female-biased genes from 1,145 OGs, respectively. Similarly, we found 436 (18.33%) positively selected, and 43 (1.81%) unbiased genes under relaxed selection from 2,378 OGs, respectively. Notably, male-biased genes have a higher proportion (10%) of positively selected genes compared to female-biased and unbiased genes. However, relaxed selective male-biased genes have a higher proportion (3.24%) than unbiased genes, but about 0.8% lower than that of female-biased genes.”. In this way, we can compare the proportion of sex-biased genes that have undergone positive selection and release selection among female-biased genes, unbiased genes and male-biased genes in floral buds in the Discussion sections.

      Line 348: Here you talk about « Numerous studies », but then only report three studies. Please clarify.

      Authors’ response: Thank you for your suggestions. We have revised it to “Several studies”.

      Line 352: Cut the sentence: « In contrast, the wind-pollinated dioecious plant Populus balsamifera ... »

      Authors’ response: Thank you for your suggestions. We have revised it.

      Line 357: « In contrast to the above studies... »: If I understand correctly, this is not in contrast to the observation in Populus balsamifera. Please clarify.

      Authors’ response: Thank you for your suggestions. We have revised to “Similar to the above study of Populus balsamifera.”.

      Line 420: « our results » -> « we »; « that underwent » -> « undergoing »

      Authors’ response: Thank you for your suggestions. We have revised it.

      Figure 3 is very hard to read and poorly labeled (see my comments on line 194 above). It is also hard to link to the text, since the numbers reported in the text are actually not present in the figure unless the readers makes some calculations themselves. This should be improved. Also, the use of acronyms (e.g. M1BG, F2TG etc.) contributes to making the text very difficult to read. The acronyms should at least be explained very clearly in the text when they are used.

      Authors’ response: Thank you for your suggestions. We have revised the text to clarify the meaning of the acronym (F1TGs, F2TGs, M1TGs, M2TGs, F1BGs, F2BGs, M1BGs and M2BGs) and give the number of genes. We have added two labels, indicating that panels A and B correspond to males and C and D to females in Figure 3.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Review:

      Reviewer #2 (Public Review): 

      Regarding reviewer #2 public review, we update here our answers to this public review with new analysis and modification done in the manuscript. 

      This manuscript is missing a direct phenotypic comparison of control cells to complement that of cells expressing RhoGEF2-DHPH at "low levels" (the cells that would respond to optogenetic stimulation by retracting); and cells expressing RhoGEF2-DHPH at "high levels" (the cells that would respond to optogenetic stimulation by protruding). In other words, the authors should examine cell area, the distribution of actin and myosin, etc in all three groups of cells (akin to the time zero data from figures 3 and 5, with a negative control). For example, does the basal expression meaningfully affect the PRG low-expressing cells before activation e.g. ectopic stress fibers? This need not be an optogenetic experiment, the authors could express RhoGEF2DHPH without SspB (as in Fig 4G). 

      Updated answer: We thank reviewer #2 for this suggestion. PRG-DHPH overexpression is known to affect the phenotype of the cell as shown in Valon et al., 2017. In our experiments, we could not identify any evidence of a particular phenotype before optogenetic activation apart from the area and spontaneous membrane speed that were already reported in our manuscript (Fig 2E and SuppFig 2). Regarding the distribution of actin and myosin, we did not observe an obvious pattern that will be predictive of the protruding/retracting phenotype. Trying to be more quantitative, we have classified (by eye, without knowing the expression level of PRG nor the future phenotype) the presence of stress fibers, the amount of cortical actin, the strength of focal adhesions, and the circularity of cells. As shown below, when these classes are binned by levels of expression of PRG (two levels below the threshold and two above) there is no clear determinant. Thus, we concluded that the main driver of the phenotype was the PRG basal expression rather than any particularity of the actin cytoskeleton/cell shape.

      Author response image 1.

      Author response image 2.

      Relatedly, the authors seem to assume ("recruitment of the same DH-PH domain of PRG at the membrane, in the same cell line, which means in the same biochemical environment." supplement) that the only difference between the high and low expressors are the level of expression. Given the chronic overexpression and the fact that the capacity for this phenotypic shift is not recruitmentdependent, this is not necessarily a safe assumption. The expression of this GEF could well induce e.g. gene expression changes. 

      Updated answer: We agree with reviewer #2 that there could be changes in gene expression. In the next point of this supplementary note, we had specified it, by saying « that overexpression has an influence on cell state, defined as protein basal activity or concentration before activation. »  We are sorry if it was not clear, and we changed this sentence in the revised manuscript (in red in the supp note). 

      One of the interests of the model is that it does not require any change in absolute concentrations, beside the GEF. The model is thought to be minimal and fits well and explains the data with very few parameters. We do not show that there is no change in concentration, but we show that it is not required to invoke it. We revised a sentence in the new version of the manuscript to include this point.

      Additional answer: During the revision process, we have been looking for an experimental demonstration of the independence of the phenotypic switch to any change in global gene expression pattern due to the chronic overexpression of PRG. Our idea was to be in a condition of high PRG overexpression such that cells protrude upon optogenetic activation, and then acutely deplete PRG to see if cells where then retracting. To deplete PRG in a timescale that prevent any change of gene expression, we considered the recently developed CATCHFIRE (PMID: 37640938) chemical dimerizer. We designed an experiment in which the PRG DH-PH domain was expressed in fusion with a FIRE-tag and co-expressing the FIRE-mate fused to TOM20 together with the optoPRG tool. Upon incubation with the MATCH small molecule, we should be able to recruit the overexpressed PRG to the mitochondria within minutes, hereby preventing it to form a complex with active RhoA in the vicinity of the plasma membrane. Unfortunately, despite of numerous trials we never achieved the required conditions: we could not have cells with high enough expression of PRGFIRE-tag (for protrusive response) and low enough expression of optoPRG (for retraction upon PRGFIRE-tag depletion). We still think this would be a nice experiment to perform, but it will require the establishment of a stable cell line with finely tuned expression levels of the CATCHFIRE system that goes beyond the timeline of our present work.      

      Concerning the overall model summarizing the authors' observations, they "hypothesized that the activity of RhoA was in competition with the activity of Cdc42"; "At low concentration of the GEF, both RhoA and Cdc42 are activated by optogenetic recruitment of optoPRG, but RhoA takes over. At high GEF concentration, recruitment of optoPRG lead to both activation of Cdc42 and inhibition of already present activated RhoA, which pushes the balance towards Cdc42."

      These descriptions are not precise. What is the nature of the competition between RhoA and Cdc42? Is this competition for activation by the GEFs? Is it a competition between the phenotypic output resulting from the effectors of the GEFs? Is it competition from the optogenetic probe and Rho effectors and the Rho biosensors? In all likelihood, all of these effects are involved, but the authors should more precisely explain the underlying nature of this phenotypic switch. Some of these points are clarified in the supplement, but should also be explicit in the main text. 

      Updated answer: We consider the competition between RhoA and Cdc42 as a competition between retraction due to the protein network triggered by RhoA (through ROCK-Myosin and mDia-bundled actin) and the protrusion triggered by Cdc42 (through PAK-Rac-ARP2/3-branched Actin). We made this point explicit in the main text.  

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):  

      Major 

      - why this is only possible for such few cells. Can the authors comment on this in the discussion? Does the model provide any hints? 

      As said in our answer to the public comment or reviewer #1, we think that the low number of cells being able to switch can be explained by two different reasons: 

      (1) First, we were looking for clear inversions of the phenotype, where we could see clear ruffles in the case of the protrusion, and clear retractions in the other case. Thus, we discarded cells that would show in-between phenotypes, because we had no quantitative parameter to compare how protrusive or retractile they were. This reduced the number of switching cells 

      (2) Second, we had a limitation due to the dynamic of the optogenetic dimer used here. Indeed, the control of the frequency was limited by the dynamic of unbinding of the optogenetic dimer. This dynamic of recruitment (~20s) is comparable to the dynamics of the deactivation of RhoA and Cdc42. Thus, the differences in frequency are smoothed and we could not vary enough the frequency to increase the number of switches. Thanks to the model, we can predict that increasing the unbinding rate of the optogenetic tool (shorter dimer lifetime) should allow us to increase the number of switching cells. 

      We have added a sentence in the discussion to make this second point explicit.

      - I would encourage the authors to discuss this molecular signaling switch in the context of general design principles of switches. How generalizable is this network/mechanism? Is it exclusive to activating signaling proteins or would it work with inhibiting mechanisms? Is the competition for the same binding site between activators and effectors a common mechanism in other switches? 

      The most common design principle for molecular switches is the bistable switch that relies on a nonlinear activation (for example through cooperativity) with a linear deactivation. Such a design allows the switch between low and high levels. In our case, there is no need for a non-linearity since the core mechanism is a competition for the same binding site on active RhoA of the activator and the effectors. Thus, the design principle would be closer to the notion of a minimal “paradoxical component” (PMID: 23352242) that both activate and limit signal propagation, which in our case can be thought as a self-limiting mechanism to prevent uncontrolled RhoA activation by the positive feedback. Yet, as we show in our work, this core mechanism is not enough for the phenotypic switch to happen since the dual activation of RhoA and Cdc42 is ultimately required for the protrusion phenotype to take over the retracting one. Given the particularity of the switch we observed here, we do not feel comfortable to speculate on any general design principles in the main text, but we thank reviewer #1 for his/her suggestion.

      - Supplementary figures - there is a discrepancy between the figures called in the text and the supplementary files, which only include SF1-4. 

      We apologize for this error and we made the correction. 

      - In the text, the authors use Supp Figure 7 to show that the phenotype could not be switched by varying the fold increase of recruitment through changing the intensity/duration of the light pulse. Aside from providing the figure, could you give an explanation or speculation of why? Does the model give any prediction as to why this could be difficult to achieve experimentally (is the range of experimentally feasible fold change of 1.1-3 too small? Also, could you clarify why the range is different than the 3 to 10-fold mentioned at the beginning of the results section? 

      We thank the reviewer for this question, and this difference between frequency and intensity can be indeed understood in a simple manner through the model. 

      All the reactions in our model were modeled as linear reactions. Thus, at any timepoint, changing the intensity of the pulse will only change proportionally the amount of the different components (amount of active RhoA, amount of sequestered RhoA, and amount of active Cdc42). This explains why we cannot change the balance between RhoA activity and Cdc42 activity only through the pulse strength. We observed the same experimentally: when we changed the intensity of the pulses, the phenotype would be smaller/stronger, but would never switch, supporting our hypothesis on the linearity of all biochemical reactions. 

      On the contrary, changing the frequency has an effect, for a simple reason: the dynamics of RhoA and Cdc42 activation are not the same as the dynamics of inhibition of RhoA by the PH domain (see

      Figure 4). The inhibition of RhoA by the PH is almost instantaneous while the activation of RhoGTPases has a delay (sets by the deactivation parameter k_2). Intuitively, increasing the frequency will lead to sustained inhibition of RhoA, promoting the protrusion phenotype. Decreasing the frequency – with a stronger pulse to keep the same amount of recruited PRG – restricts this inhibition of RhoA to the first seconds following the activation. The delayed activation of RhoA will then take over. 

      We added two sentences in the manuscript to explain in greater details the difference between intensity and frequency.  

      Regarding the difference between the 1.3-3 fold and the 3 to 10 fold, the explanation is the following: the 3 to 10 fold referred to the cumulative amount of proteins being recruited after multiple activations (steady state amount reached after 5 minutes with one activation every 30s); while the 1.3-3 fold is what can be obtained after only one single pulse of activation.  

      - The transient expression achieves a large range of concentration levels which is a strength in this case. To solve the experimental difficulties associated with this, i.e. finding transfected cells at low cell density, the authors developed a software solution (Cell finder). Since this approach will be of interest for a wide range of applications, I think it would deserve a mention in the discussion part. 

      We thank the reviewer for his/her interest in this small software solution.

      We developed the description of the tool in the Method section. The Cell finder is also available with comments on github (https://github.com/jdeseze/cellfinder) and usable for anyone using Metamorph or Micromanager imaging software. 

      Minor 

      - Can the authors describe what they mean with "cell state"? It is used multiple times in the manuscript and can be interpreted as various things. 

      We now explain what we mean by ‘cell state’ in the main text :

      “protein basal activities and/or concentrations - which we called the cell state”

      - “(from 0% to 45%, Figure 2D)", maybe add here: "compare also with Fig. 2A". 

      We completed the sentence as suggested, which clarifies the data for the readers.

      - The sentence "Given that the phenotype switch appeared to be controlled by the amount of overexpressed optoPRG, we hypothesized that the corresponding leakiness of activity could influence the cell state prior to any activation." might be hard to understand for readers unfamiliar with optogenetic systems. I suggest adding a short sentence explaining dark-state activity/leakiness before putting the hypothesis forward. 

      We changed this whole beginning of the paragraph to clarify.

      - Figure 2E and SF2A. I would suggest swapping these two panels as the quantification of the membrane displacement before activation seems more relevant in this context. 

      We thank reviewer #1 for this suggestion and we agree with it (we swapped the two panels)

      - Fig. 2B is missing the white frames in the mixed panels. 

      We are sorry for this mistake, we changed it in the new version.  

      - In the text describing the experiment of Fig. 4G, it would again be helpful to define what the authors mean by cell state, or to state the expected outcome for both hypotheses before revealing the result.

      We added precisions above on what we meant by cell state, which is the basal protein activities and/or concentrations prior to optogenetic activation. We added the expectation as follow: 

      To discriminate between these two hypotheses, we overexpressed the DH-PH domain alone in another fluorescent channel (iRFP) and recruited the mutated PH at the membrane. “If the binding to RhoA-GTP was only required to change the cell state, we would expect the same statistics than in Figure 2D, with a majority of protruding cells due to DH-PH overexpression. On the contrary, we observed a large majority of retracting phenotype even in highly expressing cells (Figure 4G), showing that the PH binding to RhoA-GTP during recruitment is a key component of the protruding phenotype.”

      - Figure 4H,I: "of cells that overexpress PRG, where we only recruit the PH domain" doesn't match with the figure caption. Are these two constructs in the same cell? If not please clarify the main text. 

      We agree that it was not clear. Both constructs are in the same cell, and we changed the figure caption accordingly.  

      - "since RhoA dominates Cdc42" is this concluded from experiments (if yes, please refer to the figure) or is this known from the literature (if yes, please cite). 

      The assumption that RhoA dominates Cdc42 comes from the fact that we see retraction at low PRG concentration. We assumed that RhoA is responsible for the retraction phenotype. Our assumption is based on the literature (Burridge 2004 as an example of a review, confirmed by many experiments, such as the direct recruitment of RhoA to the membrane, see Berlew 2021) and is supported by our observations of immediate increase of RhoA activity at low PRG. We modified the text to clarify it is an assumption.

      - Fig. 6G  o left: is not intuitive, why are the number of molecules different to start with? 

      The number of molecules is different because they represent the active molecules: increasing the amount of PRG increases the amount of active RhoA and active Cdc42. We updated the figure to clarify this point.

      o right: the y-axis label says "phenotype", maybe change it to "activity" or add a second y-axis on the right with "phenotype"? 

      We updated the figure following reviewer #1 suggestion.

      - Discussion: "or a retraction in the same region" sounds like in the same cell. Perhaps rephrase to state retraction in a similar region? 

      Sorry for the confusion, we change it to be really clear: “a protrusion in the activation region when highly expressed, or a retraction in the activation region when expressed at low concentrations.”

      Typos: 

      - "between 3 and 10 fold" without s. 

      - Fig. 1H, y-axis label. 

      - "whose spectrum overlaps" with s. 

      - "it first decays, and then rises" with s. 

      - Fig 4B and Fig 6B. Is the time in sec or min? (Maybe double-check all figures). 

      - "This result suggests that one could switch the phenotype in a single cell by selecting it for an intermediate expression level of the optoPRG.". 

      - "GEF-H1 PH domain has almost the same inhibition ability as PRG PH domain". 

      We corrected all these mistakes and thank the reviewer for his careful reading of the manuscript.

      Reviewer #2 (Recommendations For The Authors): 

      Likewise, the model assumes that at high PRG GEF expression, the "reaction is happening far from saturation ..." and that "GTPases activated with strong stimuli -giving rise to strong phenotypic changes- lead to only 5% of the proteins in a GTP-state, both for RhoA and Cdc42". Given the high levels of expression (the absolute value of which is not known) this assumption is not necessarily safe to assume. The shift to Cdc42 could indeed result from the quantitative conversion of RhoA into its active state. 

      We agree with the reviewer that the hypothesis that RhoA is fully converted into its active state cannot be completely ruled out. However, we think that the two following points can justify our choice.

      - First, we see that even in the protruding phenotype, RhoA activity is increasing upon optoPRG recruitment (Figure 3). This means that RhoA is not completely turned into its active GTP-loaded state. The biosensor intensity is rising by a factor 1.5 after 5 minutes (and continue to increase, even if not shown here). For sure, it could be explained by the relocation of RhoA to the place of activation, but it still shows that cells with high PRG expression are not completely saturated in RhoA-GTP. 

      - We agree that linearity (no saturation) is still an hypothesis and very difficult to rule out, because it is not only a question of absolute concentrations of GEFs and RhoA, but also a question of their reaction kinetics, which are unknow parameters in vivo. Yet, adding a saturation parameter would mean adding 3 unknown parameters (absolute concentrations of RhoA, as well as two reaction constants). The fact that there are not needed to fit the complex curves of RhoA as we do with only one parameter tends to show that the minimal ingredients representing the interaction are captured here.  

      The observed "inhibition of RhoA by the PH domain of the GEF at high concentrations" could result from the ability of the probe to, upon membrane recruitment, bind to active RhoA (via its PH domain) thereby outcompeting the RhoA biosensor (Figure 4A-C). This reaction is explicitly stated in the supplemental materials ("PH domain binding to RhoA-GTP is required for protruding phenotype but not sufficient, and it is acting as an inhibitor of RhoA activity."), but should be more explicit in the main text. Indeed, even when PRG DHPH is expressed at high concentrations, it does activate RhoA upon recruitment (figure 3GH). Not only might overexpression of this active RhoA-binding probe inhibit the cortical recruitment of the RhoA biosensor, but it may also inhibit the ability of active RhoA to activate its downstream effectors, such as ROCK, which could explain the decrease in myosin accumulation (figure 3D-F). It is not clear that there is a way to clearly rule this out, but it may impact the interpretation. 

      This hypothesis is actually what we claim in the manuscript. We think that the inhibition of RhoA by the PH domain is explained by its direct binding. We may have missed what Reviewer #2 wanted to say, but we think that we state it explicitly in the main text :

      “Knowing that the PH domain of PRG triggers a positive feedback loop thanks to its binding to active RhoA 18, we hypothesized that this binding could sequester active RhoA at high optoPRG levels, thus being responsible for its inhibition.”

      And also in the Discussion:

      “However, this feedback loop can turn into a negative one for high levels of GEF: the direct interaction between the PH domain and RhoA-GTP prevents RhoA-GTP binding to effectors through a competition for the same binding site.”

      We may have not been clear, but we think that this is what is happening: the PH domain prevents the binding to effectors and decreases RhoA activity (as was shown in Chen et al. 2010).  

      The X-axis in Figure 4C time is in seconds not minutes. The Y-axis in Figure 4H is unlabeled. 

      We are sorry for the mistake of Figure 4C. We changed the Y-axis in the Figure 4h.  

      Although this publication cites some of the relevant prior literature, it fails to cite some particularly relevant works. For example, the authors state, "The LARG DH domain was already used with the iLid system" and refers to a 2018 paper (ref 19), whereas that domain was first used in 2016 (PMID 27298323). Indeed, the authors used the plasmid from this 2016 paper to build their construct. 

      We thank the reviewer for pointing out this error, we have corrected the citation and put the seminal one in the revised version.

      An analogous situation pertains to previous work that showed that an optogenetic probe containing the DH and PH domains in RhoGEF2 is somewhat toxic in vivo (table 6; PMID 33200987). Furthermore, it has previously been shown that mutation of the equivalent of F1044A and I1046E eliminates this toxicity (table 6; PMID 33200987) in vivo. This is particularly important because the Rho probe expressing RhoGEF2-DHPH is in widespread usage (76 citations in PubMed). The ability of this probe to activate Cdc42 may explain some of the phenotypic differences described resulting from the recruitment of RhoGEF2-DHPH and LARG-DH in a developmental context (PMID 29915285, 33200987). 

      We thank reviewer #2 for these comments, and added a small section in the discussion, for optogenetic users: 

      This underlines the attention that needs to be paid to the choice of specific GEF domains when using optogenetic tools. Tools using DH-PH domains of PRG have been widely used, both in mammalian cells and in Drosophila (with the orthologous gene RhoGEF2), and have been shown to be toxic in some contexts in vivo 28. Our study confirms the complex behavior of this domain which cannot be reduced to a simple RhoA activator.   

      Concerning the experiment shown in 4D, it would be informative to repeat this experiment in which a non-recruitable DH-PH domain of PRG is overexpressed at high levels and the DH domain of LARG is recruited. This would enable the authors to distinguish whether the protrusion response is entirely dependent on the cell state prior to activation or the combination of the cell state prior to activation and the ability of PRG DHPH to also activate Cdc42. 

      We thank the reviewer for his suggestion. Yet, we think that we have enough direct evidence that the protruding phenotype is due to both the cell state prior to activation and the ability of PRG DHPH to also activate Cdc42. First, we see a direct increase in Cdc42 activity following optoPRG recruitment (see Figure 6). This increase is sustained in the protruding phenotype and precedes Rac1 and RhoA activity, which shows that it is the first of these three GTPases to be activated. Moreover, we showed that inhibition of PAK by the very specific drug IPA3 is completely abolishing only the protruding phenotype, which shows that PAK, a direct effector of Cdc42 and Rac1, is required for the protruding phenotype to happen. We know also that the cell state prior to activation is defining the phenotype, thanks to the data presented in Figure 2. 

      We further showed in Figure 1 that LARG DH-PH domain was not able to promote protrusion. The proposed experiment would be interesting to confirm that LARG does not have the ability to activate another GTPase, even in a different cell state with overexpressed PRG. However, we are not sure it would bring any substantial findings to understand the mechanism we describe here, given the facts provided above.  

      Similarly, as PRG activates both Cdc42 and Rho at high levels, it would be important to determine the extent to which the acute Rho activation contributes to the observed phenotype (e.g. with Rho kinase inhibitor). 

      We agree with the reviewer that it would be interesting to know whether RhoA activation contributes to the observed phenotype, and we have tried such experiments. 

      For Rho kinase inhibitor, we tried with Y-27632 and we could never prevent the protruding phenotype to happen. However, we could not completely abolish the retracting phenotype either (even when the effect on the cells was quite strong and visible), which could be due to other effectors compensating for this inhibition. As RhoA has many other effectors, it does not tell us that RhoA is not required for protrusion. 

      We also tried with C3, which is a direct inhibitor of RhoA. However, it had too much impact on the basal state of the cells, making it impossible to recruit (cells were becoming round and clearly dying. As both the basal state and optogenetic activation require the activation of RhoA, it is hard to conclude out of experiments where no cell is responding. 

      The ability of PRG to activate Cdc42 in vivo is striking given the strong preference for RhoA over Cdc42 in vitro (2400X) (PMID 23255595). Is it possible that at these high expression levels, much of the RhoA in the cell is already activated, so that the sole effect that recruited PRG can induce is activation of Cdc42? This is related to the previous point pertaining to absolute expression levels.  

      As discussed before, we think that it is not only a question of absolute expression levels, but also of the affinities between the different partners. But Reviewer #2 is right, there is a competition between the activation of RhoA and Cdc42 by optoPRG, and activation of Cdc42 probably happens at higher concentration because of smaller effective affinity.

      Still, we know that activation of the Cdc42 by PRG DH-PH domain is possible in vivo, as it was very clearly shown in Castillo-Kauil et al., 2020 (PMID 33023908). They show that this activation requires the linker between DH and PH domain of PRG, as well as Gαs activation, which requires a change in PRG DH-PH conformation. This conformational switch does not happen in vitro, which might explain why the affinity against Cdc42 was found to be very low. 

      Minor points 

      In both the abstract and the introduction the authors state, "we show that a single protein can trigger either protrusion or retraction when recruited to the plasma membrane, polarizing the cell in two opposite directions." However, the cells do not polarize in opposite directions, ie the cells that retract do not protrude in the direction opposite the retraction (or at least that is not shown). Rather a single protein can trigger either protrusion or retraction when recruited to the plasma membrane, depending upon expression levels. 

      We thank the reviewer for this remark, and we agree that we had not shown any data supporting a change in polarization. We solved this issue, by showing now in Supplementary Figure 1 the change in areas in both the activated and in the not activated region. The data clearly show that when a protrusion is happening, the cell retracts in the non-activated region. On the other hand, when the cell retracts, a protrusion happens in the other part of the cell, while the total area is staying approximately constant. 

      We added the following sentence to describe our new figure:

      Quantification of the changes in membrane area in both the activated and non-activated part of the cell (Supp Figure 1B-C) reveals that the whole cell is moving, polarizing in one direction or the other upon optogenetic activation.

      While the authors provide extensive quantitative data in this manuscript and quantify the relative differences in expression levels that result in the different phenotypes, it would be helpful to quantify the absolute levels of expression of these GEFs relative to e.g. an endogenously expressed GEF. 

      We agree with the reviewer comment, and we also wanted to have an idea of the absolute level of expression of GEFs present in these cells to be able to relate fluorescent intensities with absolute concentrations. We tried different methods, especially with the purified fluorescent protein, but having exact numbers is a hard task.

      We ended up quantifying the amount of fluorescent protein within a stable cell line thanks to ELISA and comparing it with the mean fluorescence seen under the microscope. 

      We estimated that the switch concentration was around 200nM, which is 8 times more than the mean endogenous concentration according to https://opencell.czbiohub.org/, but should be reachable locally in wild type cell, or globally in mutated cancer cells. 

      Given the numerical data (mostly) in hand, it would be interesting to determine whether RhoGEF2 levels, cell area, the pattern of actin assembly, or some other property is most predictive of the response to PRG DHPH recruitment. 

      We think that the manuscript made it clear that the concentration of PRG DHPH is almost 100% predictive of the response to PRG DHPH. We believe that other phenotypes such as the cell area or the pattern of actin assembly would only be consequences of this. Interestingly, as experimentators we were absolutely not able to predict the behavior by only seeing the shape of the cell, event after hundreds of activation experiments, and we tried to find characteristics that would distinguish both populations with the data in our hands and could not find any.

      There is some room for general improvement/editing of the text. 

      We tried our best to improve the text, following reviewers suggestions.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewing Editor's comments:

      There appears to be several mistakes/missing details in the additional statistical analyses reported in their response to Reviewer #'1 comments:

      (1) Detecting differentially expressed genes (DEGs):

      Reviewer #1 suggested adding an interaction term between sex and environment (ethnicity) in identifying DEGs. The authors performed ANCOVA analysis with sex and ethnicity as covariates (but not the interaction) and found sex explained more variance. This is not what the reviewer asked for, and the results do not help identify DEGs.

      We understand the reviewer’s suggestion about identification of DEGs using sex × ethnicity interaction. However, we could not find an appropriate tool to make such analysis, though we have carefully searched it in the literature. It should be noted that the interaction analysis between sex and environment was only designed to study genotype data rather than gene expression data. Besides, considering that we have added multiple covariates in our DEG detection, adding an interaction term between sex and environment (ethnicity) in identifying DEGs make the formulation too complex to resolve using current tools. Alternatively, we have made a linear regression model to test the explanation of sex for DEG detection in the revision (see details below). We would appreciate if the reviewer could provide any available tools, or previous studies conducting interaction analysis for DEG identification.

      (2) Overlap between DEGs and genes under positive selection in Tibetans (TSNGs)

      The authors claimed that the overlaps are significantly enriched in "sex-combined" set (p=0.048) and "male-only" set (p=9e-4), but it seems that the authors calculated the p-values incorrectly. Based on the histogram shown in Fig 3R (left penal), at least 750 out of 10,000 permutations led to 4 genes in overlap and there are additional permutations with 5 or more genes in overlap, so the p-value for the sex-combined set cannot be 0.048. In addition, the permutation procedure is somewhat questionable: it is unclear whether randomly sampling 192 genes from the human genome is reasonable choice, without matching for relevant gene features.

      As we explained in the response to Reviewer-1, we agree with the reviewer’s point that random sampling of genes in permutation should be extracted from genes expressed in each tissue rather than the entire genome. Based on this updated random sampling procedure, we redid the analysis, and our previous conclusions remain unchanged.

      (3) Polygenic adaptation signal based on eQTL information:

      The PolyGraph method is designed for highly polygenic traits with causal variants spread across the genome. However, the genetic architecture of the expression of a gene is much less polygenic with at most few cis- eQTLs per gene, so the PolyGraph model does not apply for expression of individual genes. On the other hand, eQTLs for different genes are associated with different "traits", so they cannot be simply aggregated together for PolyGraph analysis. Based on the Methods description, it is unclear how the authors ran the PolyGraph analysis on eQTLs practically and whether this practice is appropriate for detecting polygenic adaptation signal on gene expression.

      We understand the reviewer’s concern on polygenic adaptation analysis. In this study, we tested whether the estimated polygenic scores from eQTLs (estimated using sums of allele frequencies at independent eQTLs weighted by their effect sizes) were significantly enriched in Tibetans compared to other populations. The detailed descriptions of polygenic test are provided in the response to Reviewer-1.

      Reviewer #1 (Public Review):

      The revised manuscript new presented 1) a permutation-based test for the significance of the overlap between DEGs and genes with positive selection signals in Tibetans, and 2) polygenic adaptation test for the eQTLs. I make my suggestions in detail as below:

      Major Comments

      (1) My previous concern regarding the DEG analysis remains unresolved. Although the authors agreed in their response that the difference between the male- and female-specific DEGs are insufficient to the difference between sex-combined and sex-specific DEGs (Figure S6). However, the results section still states the opposite pattern between males and females as a decisive reason for the difference (p. 9, lines 236-239). Again, I would like to recommend the authors to test alternative ways of analysis to boost statistical power for DEG detection other than simply splitting data into males and females and performing analysis in each subset. For example, the authors may consider utilizing gene by environment interaction analysis schemes here biological sex as an environmental factor.

      To evaluate the effect of gene expression of each layer by sex, we adopted two strategies: 1) to calculate the variance explained by sex from the expression data; 2) to evaluate the statistical significance of association between sex from the expression data.

      Firstly, we observed a significantly higher variance explained by sex than by ethnicity in six layers of the placenta (see details in our previous response to reviewers).

      Then, we performed a linear regression model to test whether gender affects the gene expression. For each gene, a linear regression model was made by using R glm function with sex as covariates: glm (gene expression ~ sex). We discovered 5,865 genes significantly associated with sex, and most of them were located on the sex chromosomes. We observed 62.63% genes overlapped with those genes with opposite differential directions between the sex-combined and the sex-specific analyses.

      Considering the opposite direction of DEGs is likely only one of the explanations for the discrepancy between the sex-combined and the sex-specific DEGs, and there might be alternative mechanism for this phenomenon, we have tune down the description of this point in the revised manuscript:

      “Considering 62.63% of DEGs (248/396) with an opposite direction of between-population expression divergence in males and females, respectively (Figure S6), we reckon that there might be other factors such as sample size or cell composition affecting the identification of DEGs, which could cancel out the differences in the sex-combined analysis.” (Page 9)

      (2) Multiple testing schemes are still sub-optimal in some cases. Most of all, the p-values in the WGCNA analysis (p. 11), the authors corrected for the number of traits (n=12) after adjusting for the correlation between them. However, they did not mention whether they counted for the number of modules they tested at all (n=136 and 161 for males and females, respectively). Whether they account for the number of modules will make a substantial difference in the significance threshold, please incorporate and describe a proper multiple testing scheme for this analysis.

      We understand the reviewer’s point. Indeed, for multiple testing schemes, we considered both the number of traits and the number of modules. For the number of modules, multiple testing correction is already imbedded in WGCNA, as described in the published studies (Li et al. 2018; Zeng et al. 2023).

      (3) Evidence for natural selection on the observed DEG pattern is still weak and not properly described.

      (1) For the overlap between DEGs and TSNGs, the authors introduced a permutation-based test, but used a total set of genes in the human genome as a comparison set (p. 25, lines 699-700). I believe that the authors should sample random sets of genes from those already expressed in each tissue to make a fair comparison.

      We agree with the reviewer’s point that random sampling of genes in permutation should be extracted from genes expressed in each tissue, which is a fair comparison between the observed and the simulated counts of the overlapped genes.

      Therefore, for each permutation, we randomly extracted 192 genes from all the placenta expressed genes identified from the seven layers (17,284 genes in total), and we overlapped them with DEGs of the three sets (female + male, female only, and male only) and counted the gene numbers. After 10,000 permutations, we constructed a null distribution for each set, and found that the overlaps between DEGs and TSNGs were significantly enriched in the “sex-combined” set (p-value = 0.0123) and the “male-only” set (p-value < 1e-4), but not in the “female-only” set (p-value = 0.0572) (Figure R1). This result suggests that the observed DEGs are significantly enriched in TSNGs when compared to the set of random sampling, especially for the DEGs from the “male-only” set.

      Author response image 1.

      The distribution of 10,000 permutation tests of counts of the overlapped genes between 192 TSNGs and the DEGs randomly selected from the expressed genes in the placenta. The red-dashed lines indicate the observed values based on the randomly selected DEGs.

      (2) The entire polygraph analysis for polygenic adaptation is poorly described. The current version of the Methods does not clarify i) for which genes the eQTLs are discovered, 2) how the authors performed the eQTL analysis, iii) how the authors polarized the effect, and iv) how they set up a comparison between the eQTLs and the others.

      Considering the RNA-seq data of placenta mostly represent the transcriptomes of the newborns according to our analysis on maternal-fetal compositions of each dissected layer, we conducted eQTL analysis using the fetal genotypes and the placental tissue gene expression data (TPM) using R package MatrixEQTL (https://github.com/andreyshabalin/MatrixEQTL), and the altitude and maternal age were taken as covariates. We take a window 1 Mb upstream and 1 Mb downstream around each SNP to select genes or expression probes to test. Associations between these SNP–gene combinations are calculated using linear model. This tool can distinguish local (cis-) and distant (trans) eQTLs. We performed separate corrections for multiple testing.

      Finally, we detected 5,251 eQTLs (involving 319 eGenes), covering the SNPs significantly associated with gene expression (p-value < 5e-8). To identify the signatures of polygenic selection in Tibetans using eQTL information, we removed those SNPs in linkage disequilibrium (r2 > 0.2 in 1000 Genome Project) and obtained 176 independent eQTLs as input into PolyGraph (Racimo et al. 2018). QB (Racimo et al. 2018) and QX (Berg and Coop 2014) framework are used in Polygraph to determine whether the estimated polygenic scores exhibit more variance among populations than null expectation under genetic drift, by retrieving the summary statistics from the eQTL set.

      In this study, we focused on testing whether the estimated polygenic scores from eQTLs (estimated using sums of allele frequencies at independent eQTLs weighted by their effect sizes) were significantly enriched in Tibetans compared to other populations. The significance was evaluated by comparing to 10,000 sets of the control SNPs. Each set of control SNPs was randomly drawn from the genomic SNPs, and contained an equal number of SNPs as the eQTLs matched one-to-one by minor allele frequency.

      The PolyGraph result showed that Tibetans have a clear signature of polygenic selection on gene expression (Bonferroni-corrected p-value = 0.003, Figure S12). In other words, the frequency of alleles associated with gene expression (up-regulation or down-regulation) were specifically enriched in Tibetans, a signal of positive selection.

      Minor comments (1) In Figure S1, the amount of variance explained by PC1 and PC2 need to be corrected. PC1 explains less variance than PC2 (0.11 vs 0.68%).

      It was a typing error that mixed up the variances between PC1 and PC2. We have corrected it in the revised version.

      (2) In the section "Sex-biased expression divergence ..." (p. 8), the authors are using the term "gender" instead of sex. Considering that they are talking about the biological sex of each infant, I believe that sex is a more appropriate term to be used than gender.

      Following the reviewer’s suggestion, we rephrased “gender” as “sex” in the revised manuscript to describe the biological differences between females and males.

      Reviewer #3 (Public Review):

      More than 80 million people live at high altitude. This impacts health outcomes, including those related to pregnancy. Longer-lived populations at high altitudes, such as the Tibetan and Andean populations show partial protection against the negative health effects of high altitude. The paper by Yue sought to determine the mechanisms by which the placenta of Tibetans may have adapted to minimise the negative effect of high altitude on fetal growth outcomes. It compared placentas from pregnancies from Tibetans to those from the Han Chinese. It employed RNAseq profiling of different regions of the placenta and fetal membranes, with some follow-up of histological changes in umbilical cord structure and placental structure. The study also explored the contribution of fetal sex in these phenotypic outcomes.

      A key strength of the study is the large sample sizes for the RNAseq analysis, the analysis of different parts of the placenta and fetal membranes, and the assessment of fetal sex differences.

      A main weakness is that this study, and its conclusions, largely rely on transcriptomic changes informed by RNAseq. Changes in genes and pathways identified through bioinformatic analysis were not verified by alternate methods, such as by western blotting, which would add weight to the strength of the data and its interpretations. There is also a lack of description of patient characteristics, so the reader is unable to make their own judgments on how placental changes may link to pregnancy outcomes. Another weakness is that the histological analyses were performed on n=5 per group and were rudimentary in nature.

      For the three weaknesses raised by the reviewer, here are our responses:

      (1) Considering that our conclusions largely rely on the transcriptomic data, we agree with reviewer that more experiments are needed to validate the results from our transcriptomic data. However, this study was mainly aimed to provide a transcriptomic landscape of high-altitude placenta, and to characterize the gene-expression difference between native Tibetans and Han migrants. The molecular mechanism exploration is not the main task of this study, and more validation experiments are warranted in the future.

      (2) For the lack of description of patient characteristics, actually, we provided three-level results on the placental changes of Tibetans: macroscopic phenotypes (higher placental weight and volume), histological phenotypes (larger umbilical vein walls and umbilical artery intima and media; lower syncytial knots/villi ratios) and transcriptomic phenotypes (DEG and differential modules). Combined with the previous studies, these placenta changes suggest a better reproductive outcome. For example, the placenta volume shows a significantly positive correlation with birth weight (R = 0.31, p-value = 2.5e-16), therefore, the larger placenta volume of Tibetans is beneficial to fetal development at high altitude. In addition, the larger umbilical vein wall and umbilical artery intima and media of Tibetans can explain their adaptation in preventing preeclampsia.

      (3) For the sample size of histological analyses, we understand the reviewer’s concern that 5 vs. 5 samples are not very large in histological analyses. This is because it was difficult to collect high-altitude Han placenta samples, and we only got 13 Han samples, from which we selected 5 infant sex matched samples.

      Minor point:

      I feel the authors have responded well to the other reviewer comments. However, I am disappointed that the authors did not address my comment related to the validation of their RNAseq data. In particular, they failed to add new data that verifies and supports their RNAseq findings on pathways affected. This is imperative as their conclusions are based solely on the RNAseq analysis. The only other comment I have is that they should add a description of all abbreviations, including those in the supplementary information (like Table S12).

      For experimental validation of transcriptome, we understand the concern of reviewer. However, as we mentioned before, this study was mainly aimed to provide a transcriptomic landscape of high-altitude placenta, the molecular mechanism exploration is not the main task of this study, and more validation experiments are warranted in the future. Actually, we have tune down the description of power from transcriptomic data for explanation of biological difference, and called for the further functional validations in the future:

      “the transcriptome data is insufficient to explain the underlying molecular mechanisms of genetic adaptation in Tibetans. Future single-cell transcriptome analysis and functional validations of the candidate genes are warranted to reveal the responsible cell types and the molecular pathways.” (highlighted in Page 20)

      For abbreviations of the manuscript, according to the reviewer’s suggestion, we added descriptions of all abbreviations of this study in corresponding position (Table S1 and S12).

      References

      Berg JJ, and Coop G (2014). A population genetic signal of polygenic adaptation. PLoS Genet 10(8): e1004412.

      Li J, et al. (2018). Application of Weighted Gene Co-expression Network Analysis for Data from Paired Design. Sci Rep 8(1): 622.

      Racimo F, Berg JJ, and Pickrell JK (2018). Detecting Polygenic Adaptation in Admixture Graphs. Genetics 208(4): 1565-1584.

      Zeng JF, et al. (2023). Functional investigation and two-sample Mendelian randomization study of neuropathic pain hub genes obtained by WGCNA analysis. Frontiers in Neuroscience 17.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #2 (Public review):

      Dipasree Hajra et al demonstrated that Salmonella was able to modulate the expression of Sirtuins (Sirt1 and Sirt3) and regulate the metabolic switch in both host and Salmonella, promoting its pathogenesis. The authors found Salmonella infection induced high levels of Sirt1 and Sirt3 in macrophages, which were skewed toward the M2 phenotype allowing Salmonella to hyper-proliferate. Mechanistically, Sirt1 and Sirt3 regulated the acetylation of HIF-1alpha and PDHA1, therefore mediating Salmonella-induced host metabolic shift in the infected macrophages. Interestingly, Sirt1 and Sirt3-driven host metabolic switch also had an effect on the metabolic profile of Salmonella. Counterintuitively, inhibition of Sirt1/3 led to increased pathogen burdens in an in vivo mouse model. Overall, this is a well-designed study.<br /> The revised manuscript has addressed all of the previous comments. The re-analysis of flow cytometry and WB data by authors makes the results and conclusion more complete and convincing.

      We are immensely grateful to the reviewer for improving the strength of the manuscript by providing insightful comments and for appreciating the work.

      Reviewer #3 (Public review):

      Summary:

      In this paper Hajra et al have attempted to identify the role of Sirt1 and Sirt3 in regulating metabolic reprogramming and macrophage host defense. They have performed gene knock down experiments in RAW macrophage cell line to show that depletion of Sirt1 or Sirt3 enhances the ability of macrophages to eliminate Salmonella Typhimurium. However, in mice inhibition of Sirt1 resulted in dissemination of the bacteria but the bacterial burden was still reduced in macrophages. They suggest that the effect they have observed is due to increased inflammation and ROS production by macrophages. They also try to establish a weak link with metabolism. They present data to show that the switch in metabolism from glycolysis to fatty acid oxidation is regulated by acetylation of Hif1a, and PDHA1.

      Strengths:

      The strength of the manuscript is that the role of Sirtuins in host-pathogen interactions have not been previously explored in-depth making the study interesting. It is also interesting to see that depletion of either Sirt1 or Sirt3 result in a similar outcome.

      Weaknesses:

      The major weakness of the paper is the low quality of data, making it harder to substantiate the claims. Also, there are too many pathways and mechanisms being investigated. It would have been better if the authors had focussed on either Sirt1 or Sirt3 and elucidated how it reprograms metabolism to eventually modulate host response against Salmonella Typhimurium. Experimental evidences are also lacking to prove the proposed mechanisms. For instance they show correlative data that knockdown of Sirt1 mediated shift in metabolism is due to HIF1a acetylation but this needs to be proven with further experiments.

      As the public review of the reviewer remains unaltered as the previous version without further recommendations for authors, we are sticking to our former author’s response. We respect the reviewer’s opinion and thank the reviewer for the critical analysis of our work.

      ---------

      The following is the authors’ response to the previous reviews.

      Reviewer #2 (Public Review):

      Dipasree Hajra et al demonstrated that Salmonella was able to modulate the expression of Sirtuins (Sirt1 and Sirt3) and regulate the metabolic switch in both host and Salmonella, promoting its pathogenesis. The authors found Salmonella infection induced high levels of Sirt1 and Sirt3 in macrophages, which were skewed toward the M2 phenotype allowing Salmonella to hyper-proliferate. Mechanistically, Sirt1 and Sirt3 regulated the acetylation of HIF-1alpha and PDHA1, therefore mediating Salmonella-induced host metabolic shift in the infected macrophages. Interestingly, Sirt1 and Sirt3-driven host metabolic switch also had an effect on the metabolic profile of Salmonella. Counterintuitively, inhibition of Sirt1/3 led to increased pathogen burdens in an in vivo mouse model. Overall, this is a well-designed study.

      Comments on revised version:

      The authors have performed additional experiments to address the discrepancy between in vitro and in vivo data. While this offers some potential insights into the in vivo role of Sirt1/3 in different cell types and how this affects bacterial growth/dissemination, I still believe that Sirt1/3 inhibitors could have some effect on the gut microbiota contributing to increased pathogen counts. This possibility can be discussed briefly to give a better scenario of how Sirt1/3 inhibitors work in vivo. Additionally, the manuscript would improve significantly if some of the flow cytometry analysis and WB data could be better analyzed.

      We are highly grateful for your valuable and insightful comments. Thank you for appreciating the merit of our manuscript. As rightly pointed out by the eminent reviewer, we acknowledge the probable link of Sirtuin on gut microbiota and its effect on increased bacterial loads as indicated by previous literature studies (PMID: 22115311, PMID: 19228061). These reports suggested that a low dose of Sirt1 activator, resveratrol treatment in rats for 25 days treatment under 5% DSS induced colitis condition led to alterations in gut microbiota profile with increased lactobacilli and bifidobacteria alongside reduced abundance of enterobacteria. This study correlates with our study wherein we have detected enhanced Salmonella (belonging to Enterobacteriaceae family) loads under both Sirt1/3 in vivo knockdown condition or inhibitor-treated condition in C57BL/6 mice and reduced burden under Sirt-1 activator treatment SRT1720.

      As per your valid suggestion, we have discussed this possibility in our discussion section. (Line- 541-548).

      We have incorporated the suggestions for the improvement in the analysis of WB data and flow cytometry.

      Reviewer #3 (Public Review):

      Summary:

      In this paper Hajra et al have attempted to identify the role of Sirt1 and Sirt3 in regulating metabolic reprogramming and macrophage host defense. They have performed gene knock down experiments in RAW macrophage cell line to show that depletion of Sirt1 or Sirt3 enhances the ability of macrophages to eliminate Salmonella Typhimurium. However, in mice inhibition of Sirt1 resulted in dissemination of the bacteria but the bacterial burden was still reduced in macrophages. They suggest that the effect they have observed is due to increased inflammation and ROS production by macrophages. They also try to establish a weak link with metabolism. They present data to show that the switch in metabolism from glycolysis to fatty acid oxidation is regulated by acetylation of Hif1a, and PDHA1.

      Strengths:

      The strength of the manuscript is that the role of Sirtuins in host-pathogen interactions has not been previously explored in-depth making the study interesting. It is also interesting to see that depletion of either Sirt1 or Sirt3 results in a similar outcome.

      Weaknesses:

      The major weakness of the paper is the low quality of data, making it harder to substantiate the claims. Also, there are too many pathways and mechanisms being investigated. It would have been better if the authors had focussed on either Sirt1 or Sirt3 and elucidated how it reprograms metabolism to eventually modulate host response against Salmonella Typhimurium. Experimental evidence is also lacking to prove the proposed mechanisms. For instance they show correlative data that knock down of Sirt1 mediated shift in metabolism is due to HIF1a acetylation but this needs to be proven with further experiments.

      We appreciate the reviewer’s critical analysis of our work. In the revised manuscript, we aimed to eliminate the low-quality data sets and have tried to substantiate them with better and conclusive ones, as directed in the recommendations for the author section. We agree with the reviewer that the inclusion of both Sirtuins 1 and 3 has resulted in too many pathways and mechanisms and focusing on one SIRT and its mechanism of metabolic reprogramming and immune modulation would have been a less complicated alternative approach. However, as rightly pointed out, our work demonstrated the shared and few overlapping roles of the two sirtuins, SIRT1 and SIRT3, together mediating the immune-metabolic switch upon Salmonella infection. As per the reviewer’s suggestion, we have performed additional experiments with HIF-1α inhibitor treatment in our revised manuscript to substantiate our correlative findings on SIRT1-mediated regulation of host glycolysis (Fig.7G). We wanted to clarify our claim in this regard. Our results suggested that loss of SIRT1 function triggered increased host glycolysis alongside hyperacetylation of HIF-1α. HIF-1α is reported to be one of the important players in glycolysis regulation (Kierans SJ, Taylor CT. Regulation of glycolysis by the hypoxia-inducible factor (HIF): implications for cellular physiology. J Physiol. 2021;599(1):23-37. doi:10.1113/JP280572.) and additionally, SIRT1 has been shown to regulate HIF-1α acetylation status (Lim JH, Lee YM, Chun YS, Chen J, Kim JE, Park JW. Sirtuin 1 modulates cellular responses to hypoxia by deacetylating hypoxia-inducible factor 1 alpha. Mol Cell. 2010;38(6):864-878. doi:10.1016/j.molcel.2010.05.023.) Further, ectopic expression of SIRT1 has been demonstrated to reduce glycolysis by negatively regulating HIF-1α. (Wang Y, Bi Y, Chen X, et al. Histone Deacetylase SIRT1 Negatively Regulates the Differentiation of Interleukin-9-Producing CD4(+) T Cells. Immunity. 2016;44(6):1337-1349. doi:10.1016/j.immuni.2016.05.009). We have subsequently shown in Fig. 7G, that the increase in host glycolysis upon SIRT knockdown in the infected macrophages gets lowered upon HIF-1α inhibitor treatment, suggesting that one of the mechanisms of SIRT-mediated regulation of host glycolysis is via regulation of HIF-1α. However, this warrants further future mechanistic research.

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      (1) Figures 8I-S: are only viable cells used for analysis? Please provide gating strategy used for these analyses.

      (2) Many changes seen in WB seem to be marginal. Since the authors used densitometric plot to quantify the band intensities, I expect these experiments were repeated at least three times. Please indicate the number of repeats. For instance, Figures 7C, 7I (UI SCR vs UI shSIRT3), 7J, show marginal changes or no changes. What do other WB images look like? Are they more convincing than the ones currently shown? Please provide them in the response letter.

      (3) Figure 7C: label is a bit misleading. Please relabel the figure title to Acetylated HIF vs total levels

      (4) Figure 7J: which band is AcPDHA1?

      (1) We are highly apologetic for not clarifying our gating strategy for the analysis.

      We initially gated the viable splenocyte population based on Forward scatter (FSC) and Side Scatter (SSC). This gated population was further subjected to gating based on cell FSC-H (height) versus FSC-A (area). Subsequently, the population was gated as per SSC-A and GFP (expressed by intracellular bacteria) based on the autofluorescence exhibited by the uninfected control (Fig. 8I-J).

      Author response image 1.

      UNINFECTED

      Author response image 2.

      VEHICLE CONTROL INFECTED

      Author response image 3.

      EX-527 INFECTED

      Author response image 4.

      3TYP INFECTED

      Author response image 5.

      SRT 1720 INFECTED

      For gating different cell types such as F4/80 (PE) positive population in Fig. 8K-L, the viable cell population was gated based on SSC-A versus PE-A to gate the macrophage population. These macrophage populations were gated further based on GFP (Salmonella) + population to obtain the percentage of macrophage population harboring GFP+ bacteria. Similar strategies were followed for other cell types as depicted in Fig. 8M-S, Fig. S8.

      (2) We agree with the reviewer’s concern with the marginal changes in the western blots (Figures 7C, 7I (UI SCR vs UI shSIRT3), 7J). As per the suggestions, we have provided the alternate blot images and have indicated the number of repeats in the manuscript. The alternate blot images are provided herewith:

      Author response image 6.

      Alternate blot images for Fig. 7B-C

      Author response image 7.

      Alternate blot images for Fig. 7I, J

      (1) We are highly thankful to the reviewer for recommending this suggestion. We have made the necessary modifications of relabelling Fig. C to Acetylated HIF-1α over total HIF-1α as per the suggestion.

      (2) 7J Acetylated PDHA1 has been duly pointed as per the suggestion. We are extremely apologetic for the inconvenience caused.

      Author response image 8.

      Reviewer #3 (Recommendations For The Authors):

      The authors have done some work to improve the manuscript. However, the data presented lacks clarity.

      Fig 4B: I still do not see a change in Ac p65 in the less saturated blot. It looks reduced as the band is distorted. I am not sure how this could be quantified.

      Fig S2 b-actin bands are hyper saturated, and it is not possible to decipher the knockdown efficiency. It is probably better to provide a ponceau staining similar to S2C. The band intensity values are out of place.

      Fig 5F HADHA blot: Lane 1 expression appears to be significantly higher than lane 3, but the values mentioned do not match the intensity of the bands.

      It is hard to interpret the authors' claim that the shift in metabolism is HIF1a-dependent.

      Fig 7B: I would expect HIF1a acetylation to be increased in UI ShSIRT1 compared to UI SCR. The blot shows reduced HIF1a acetylation.

      Fig 7D: SIRT1 immunoprecipitates with HIF1a equally under all conditions. Is this what the authors expect? Labelling of the blots are not clear. It looks like the bottom SIRT1 blot is from Beads IgG control.

      Fig 7H: How does PDHA1 interact with SIRT3 so strongly in shSIRT3 cells (lane 2)?

      Authors have mentioned in their response that a knockdown of 40% has been achieved in the uninfected but the blot does not reflect that. SIRT3 expression seems to be more in the knockdown.

      Blots are also not labelled properly especially Input. The lanes are not marked.

      We thank the reviewer for acknowledging the improvements in the revised version and for suggesting further clarifications and improvements.

      We have tried to incorporate the specified modifications to the best of our abilities in the revised manuscript.

      We are highly apologetic for the inconclusive blot image in the figure 4B. We have provided an alternative blot image with better clarity for Fig.4B used for quantification analysis.

      Author response image 9.

       

      As per the reviewer’s valuable suggestions, we have provided the ponceau image in the Fig. S2B.

      We thank the reviewers for rightly pointing out the discrepancy in the band intensity quantification in the Fig. 5F. We have re-evaluated the intensities on imageJ and have provided with the correct band intensities. We are highly apologetic for the inaccuracies.

      As per the reviewer’s previous suggestion, we have performed additional experiments with HIF-1α inhibitor treatment in our revised manuscript to substantiate our correlative findings on SIRT1-mediated regulation of host glycolysis (Fig.7G). We wanted to clarify our claim in this regard. Our results suggested that loss of SIRT1 function triggered increased host glycolysis alongside hyperacetylation of HIF-1α. HIF-1α is reported to be one of the important players of glycolysis regulation (Kierans SJ, Taylor CT. Regulation of glycolysis by the hypoxia-inducible factor (HIF): implications for cellular physiology. J Physiol. 2021;599(1):23-37. doi:10.1113/JP280572.) and additionally, SIRT1 has been shown to regulate HIF-1α acetylation status (Lim JH, Lee YM, Chun YS, Chen J, Kim JE, Park JW. Sirtuin 1 modulates cellular responses to hypoxia by deacetylating hypoxia-inducible factor 1alpha. Mol Cell. 2010;38(6):864-878. doi:10.1016/j.molcel.2010.05.023.) Further, ectopic expression of SIRT1 has been demonstrated to reduce glycolysis by negatively regulating HIF-1α. (Wang Y, Bi Y, Chen X, et al. Histone Deacetylase SIRT1 Negatively Regulates the Differentiation of Interleukin-9-Producing CD4(+) T Cells. Immunity. 2016;44(6):1337-1349. doi:10.1016/j.immuni.2016.05.009). We have subsequently shown in Fig. 7G, that the increase in host glycolysis upon SIRT knockdown in the infected macrophages gets lowered upon HIF-1α inhibitor treatment, suggesting that one of the mechanisms of SIRT-mediated regulation of host glycolysis is via regulation of HIF-1α. However, this warrants further future mechanistic research.

      We agree with the reviewer’s claim of increased HIF-1α acetylation in the UI sh1 versus UI SCR. The apparent reduced acetylation depicted in UI sh1 in Fig. 7B could be attributed to lower HIF-1α levels in the UI sh1 compared to UI SCR. Therefore, we have provided an alternate blot image that been used for quantification in Fig. 7C (Author response image 6).

      To answer the reviewer’s question in Fig. 7D, we have noticed more or less equal degree of immunoprecipitation of HIF-1α under pull down of HIF-1α in all the sample cohorts under conditions of SIRT1 inhibitor treatment. However, we have observed reduced interaction of HIF-1α with SIRT1 in the infected sample upon SIRT1 inhibitor treatment.

      We thank the reviewers for suggesting improvements in the blot labelling and for raising this concern. We have corrected the blot labelling to avoid the previous confusion.

      We appreciate the reviewer’s concern and therefore we have provided an alternate blot image for Fig. 7H which might address the previous stated concern wherein we have achieved an enhanced SIRT3 knockdown percentage.

      We are extremely apologetic for the improper labelling of the Input blot with unmarked lanes. We have addressed this issue by labelling the lanes in the input section of the blots.

    1. Author Response

      The following is the authors’ response to the previous reviews

      We appreciate the positive comments from the editors and reviewers. The followings are the point to point responses to the questions and comments of the Reviewers:

      Reviewer #1 (Public Review):

      In this study, Jiamin Lin et al. investigated the potential positive feedback loop between ZEB2 and ACSL4, which regulates lipid metabolism and breast cancer metastasis. They reported a correlation between high expression of ZEB2 and ACSL4 and poor survival of breast cancer patients, and showed that depletion of ZEB2 or ACSL4 significantly reduced lipid droplets abundance and cell migration in vitro. The authors also claimed that ZEB2 activated ACSL4 expression by directly binding to its promoter, while ACSL4 in turn stabilized ZEB2 by blocking its ubiquitination. While the topic is interesting, there are several concerns with the study:

      1. My concern regarding the absence of appropriate thresholds or False Discovery Rate (FDR) adjustments for the RNA-seq analysis has not been addressed, leading to incorrect thresholds and erroneous identification of significant signals.

      Response: We thank the reviewer for the concern about the RNA-seq analysis. RNA-seq data was analyzed by the Benjamini and Hochberg’s approach for controlling the false discovery rate. The procedure of RNA-seq bioinformatic analysis is as follows: For data analysis, raw data of fastq format were firstly processed through in-house perl scripts. In this step, clean data were obtained by removing reads containing adapter, reads containing N base and low quality reads from raw data. All the downstream analyses were based on the clean data with high quality. Index of the reference genome was built using Hisat2 v2.0.5 and paired-end clean reads were aligned to the reference genome using Hisat2 v2.0.5. FeatureCounts v1.5.0-p3 was used to count the reads numbers mapped to each gene, and then FPKM of each gene was calculated based on the length of the gene and reads count mapped to this gene. Differential expression analysis of two conditions/groups was performed using the DESeq2 R package (1.20.0). The resulting P-values were adjusted using the Benjamini and Hochberg’s approach for controlling the false discovery rate. Genes with an adjusted P-value (<0.05) found by DESeq2 were assigned as differentially expressed.

      1. In Figure 3B and C, it appears that the knockdown efficiency of ACSL4 is inadequate in these cells, which contradicts the Western blot results presented in Figure 2F.

      Response: We thank the reviewer for the concern. In figure 3B and 3C, we use the shRNA for the knockdown experiment and in Figure 2F we use siRNA for the knockdown experiment, so the efficiency of them were different.

      1. Regarding Figure 6, the discovery of consensus binding sequences (CACCT) for ZEB2 alone is insufficient evidence to support the direct binding of ZEB2 to the ACSL4 promoter.

      Response: We thank the reviewer for the concern. We performed chromatin immunoprecipitation (ChIP), which examines the direct interaction between DNA and protein, to test if ZEB2 directly binds to the ACSL4 promoter. The results showed that the primer set 1, which covered -184 to -295 of ACSL4 promoter region exhibited apparent ZEB2 binding (Fig. 6F). Moreover, the mutant sequence (AAAA) of ACSL4 promoter showed significant decreased luciferase activity (Fig. 7H). All these evidences suggest that ZEB2 directly bond to the consensus sequence of ACSL4 promoter.

      1. For Figure 7E, there are multiple bands present, and it appears that ZEB2-HA has been cropped, which should ideally be presented with unaltered raw data. Please provide the uncropped raw data.

      Response: We thank the reviewer for the concern. The raw data of the figure 7E ZEB2-HA is shown in Author response image 1:

      Author response image 1.

      1. In Figure 7C, the author claimed to have used 293T cells for the ubiquitin assay, which are not breast cancer cells. Moreover, the efficiency of over-expression differs between ZEB2 and ACSL4 in 293T cell lines. Performing the experiment in an unrelated cell line to justify an important interaction is not acceptable.

      Response: We thank the reviewer for the concern. We also performed the ubiquitination assay in MDA-MB-231 cells in Fig 7D (Author response image 2), The results confirm that knockdown of ACSL4 obviously enhanced the ubiqutination of ZEB2. We also have performed the IP experiment in MDA-MB-231 cells in Author response image 3 (Fig 7F). The results confirmed the interaction between ZEB2 and ACSL4:

      Author response image 2.

      Author response image 3.

      Reviewer #2 (Public Review):

      In this study, the authors validated a positive feedback loop between ZEB2 and ACSL4 in breast cancer, which regulates lipid metabolism to promote metastasis.

      Overall, the study is original, well structured, and easy to read.

      We appreciate the positive comments from the reviewer.

      Reviewer #3 (Public Review):

      The manuscript by Lin et al. reveals a novel positive regulatory loop between ZEB2 and ACSL4, which promotes lipid droplets storage to meet the energy needs of breast cancer metastasis.

      We appreciate the positive comments from the reviewer.

      Reviewer #2 (Recommendations For The Authors):

      I still have some points that should be addressed by the Authors:

      The interaction between ACSL4 and ZEB2 is still not convincing, due to the cellular localization of ACSL4 and ZEB2 is different. The authors should consider utilizing the Duolink experiment to more accurately determine the interaction location of these two proteins in cells.

      Response: We appreciate the reviewer’s suggestion. We performed GST pull-down assay to examine whether ZEB2 and ACSL4 form a complex. GST pull-down assay confirmed the interaction of ZEB2 and ACSL4 (Supplementary Fig. S10). We also performed immunofluorescence assay and found that ZEB2 was co-localized with ACSL4 in some certain regions of the cytoplasm in Author response image 5 (Supplementary Fig. S11):

      Author response image 4.

      Author response image 5.

      In Figure S4, the authors showed both "shACSL4" and "siACSL4", which is a description error.

      Response: We appreciate the reviewer to point out the mistake. We have corrected the "siACSL4" into "shACSL4".

      Author response image 6.

      Reviewer #3 (Recommendations For The Authors):

      The manuscript is improved.

      We appreciate the positive comments from the reviewer.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Responses to Reviewer #1:

      Reviewer #1: The study shows a new mechanism of NFkB-p65 regulation mediated by Vangl2-dependent autophagic targeting. Autophagic regulation of p65 has been reported earlier; this study brings an additional set of molecular players involved in this important regulatory event, which may have implications for chronic and acute inflammatory conditions.

      Comments on the revised version:

      The authors have addressed the earlier concerns and I am satisfied with the revised version. I have no additional comments to make.

      We appreciate the reviewer’s comments on our revised manuscript.

      Responses to Reviewer #2:

      Reviewer #2: Vangl2, a core planar cell polarity protein involved in Wnt/PCP signaling, cell proliferation, differentiation, homeostasis, and cell migration. Vangl2 malfunctioning has been linked to various human ailments, including autoimmune and neoplastic disorders. Interestingly, it was shown that Vangl2 interacts with the autophagy regulator p62, and autophagic degradation limits the activity of inflammatory mediators, such as p65/NF-κB. However, the possible role of Vangl2 in inflammation has not been investigated. In this manuscript, Lu et al. describe that Vangl2 expression is upregulated in human sepsis-associated PBMCs and that Vangl2 mitigates experimental sepsis in mice by negatively regulating p65/NF-κB signaling in myeloid cells. Their mechanistic studies further revealed that Vangl2 recruits the E3 ubiquitin ligase PDLIM2 to promote K63-linked poly-ubiquitination of p65. Vangl2 also facilitated the recognition of ubiquitinated p65 by the cargo receptor NDP52. These molecular processes caused selective autophagic degradation of p65. Indeed, abrogation of PDLIM2 or NDP52 functions rescued p65 from autophagic degradation, leading to extended p65/NF-κB activity in myeloid cells. Overall, the manuscript presents convincing evidence for novel Vangl2-mediated control of inflammatory p65/NF-kB activity. The proposed pathway may expand interventional opportunities restraining aberrant p65/NF-kB activity in human ailments.

      IKK is known to mediate p65 phosphorylation, which instructs NF-kB transcriptional activity. In this manuscript, Vangl2 deficiency led to an increased accumulation of phosphorylated p65 and IKK also at 30 minutes post-LPS stimulation; however, autophagic degradation of p-p65 may not have been initiated at this early time point. Therefore, this set of data put forward the exciting possibility that Vangl2 could also be regulating the immediate early phase of inflammatory response involving the IKK-p65 axis - a proposition that may be tested in future studies.

      We appreciate the reviewer’s comments on our manuscript, and we have added the discussion about IKK-p65 axis in revised version. (Page 15, lines 467-474)

      Responses to Reviewer #3:

      Reviewer #3: Lu et al. describe Vangl2 as a negative regulator of inflammation in myeloid cells. The primary mechanism appears to be through binding p65 and promoting its degradation, albeit in an unusual autolysosome/autophagy dependent manner. Overall, these findings are novel, valuable and the crosstalk of PCP pathway protein Vangl2 with NF-kappaB is of interest. While generally solid, some concerns still remain about the rigor and conclusions drawn.

      Comments on the revised version:

      (1) Lu et al. address my comments through responses and new experimental data. However, some of the explanations provided are inadequate.

      However, in response to my enquiry regarding directly exploring PCP effects, the authors simply assert "Our study revealed that Vangl2 recruits the E3 ubiquitin ligase PDLIM2 to facilitate K63-linked ubiquitination of p65, which is subsequently recognized by autophagy receptor NDP52 and then promotes the autophagic degradation of p65. Our findings by using autophagy inhibitors and autophagic-deficient cells indicate that Vangl2 regulates NFkB signaling through a selective autophagic pathway, rather than affecting the PCP pathway, WNT, HH/GLI, Fat-Dachsous or even mechanical tension."

      I do not agree that the use of autophagy inhibitors and autophagy-deficient cells can rule out the contributions of PCP or any other pathways. Only experimentally inhibiting the pathway(s) with adequate demonstration of target inhibition/abolition of well-known effector function and documenting unaltered p65 regulation under these conditions can be considered proof. Autophagy inhibitors and autophagy-deficient cells only prove that this particular pathway is necessary. Nonetheless, I do not want to dwell on proving a negative and agree that Vangl2 is a novel regulator of p65 through its role in promoting p65 degradation. The inclusion of a statement discussing the limitations of their approach would have sufficed. The response from the authors could have been better.

      We thank the reviewer for helping us improve the quality of the manuscript. We provided new data and revised the Discussion as suggested.

      To ascertain whether Vangl2 degrades p65 through a selective autophagic pathway or the PCP pathway, 293T cells were transfected with p65, together with or without the Vangl2 plasmids, and treated with different pharmacological inhibitors. We found the degradation of p65 induced by Vangl2 was blocked by autolysosome inhibitor (CQ), but not by the JNK inhibitor (SP600125) or Wnt/β-catenin inhibitor (FH535) (New Figure. 1). These data suggest that Vangl2 primarily degrades p65 through a selective autophagic pathway, rather than through the JNK or Wnt signaling pathway. Nevertheless, additional pathway inhibitions, such as those of the HH/GLI and Fat-Dachsous pathways, should also be employed to further elucidate the function of Vangl2 in p65 degradation. As suggested, we have added a statement about the limitation of the approach in the discussion (Page 12, lines 378-385).

      Author response image 1.

      Vangl2 degrades p65 through a selective autophagic pathway, but not by the PCP pathway. HEK293T cells were transfected with Flag-p65 and HA-Vangl2 plasmids, and treated with DMSO, CQ (50 mM) for 6 h, SP600125 (20 mM) for 1 h or FH535 (30 mM) for 6 h. The cell lysates were analyzed by immunoblot.

      (2) I am also not satisfied with the explanation that "immune cells represent a minor fraction of the lungs and liver". There are lots of resident immune cells in the lungs and liver (alveolar macrophages in the lung and Kuppfer cells in the liver). For example, it may be so that Vangl2 is important in monocytes and not in the resident population. This might be a potential explanation. But this is not explored. The restricted tissue-specificity of the interaction between two ubiquitously present proteins is still a challenge to understand. The response from the authors is not satisfactory. There is plenty of Vangl2 in the liver in their western blot.

      We thank the reviewer for this question. We added this explanation in the Discussion. (Page 13, lines 398-404)

      (3) I had also simply pointed out PMID: 34214490 with reference to the findings described in the manuscript. There were no suggestions of contradiction. In fact, I would refer to the publication in discussion to support the findings and stress the novelty. The response from the authors could have been better.

      Thank you for the reviewer's insightful comments. We have modified this discussion as suggested. (Page 13, lines 410-415; Page 14, lines 419-421)

      (4) The response to my enquiry regarding homo- or heterozygosity is unsupported by any reference or data.

      As suggested, we provided the data that only Vangl2 deficient homozygous showed inhibition of the activation of NF-kB in New Figure. 2.

      Author response image 2.

      Vangl2 deficiency promotes NF-kB activation. (A) The survival rates of WT, Vangl2ΔM/ΔM and Vangl2ΔM/WT mice treated with high-dosage of LPS (30 mg/kg, i.p.) (n≥4). (B) IL-6 and TNF-a secretion by WT and Vangl2-deficient BMDMs treated with LPS for 6 h was measured by ELISA. IL-1β secretion by WT, Vangl2ΔM/ΔM and Vangl2ΔM/WT BMDMs treated with LPS for 6 h and ATP for 30 min was measured by ELISA.

      (5) The listing of 8 patients and healthy controls are also appreciated. The body temperature of #6 doesn't fall in the <36 or >38 degree C SIRS criteria. The inclusion of CRP, PCT, heart rate and respiratory rate, and other lab values would have further improved the inclusion criteria. Moreover, it is difficult to understand why there are 16 value points for healthy and sepsis cohorts in Fig 1 when there are 8 patients.

      We thank the reviewer for this valuable suggestion. We are sorry for our mistake that we entered data from two repeated experiments in Figure. 1 A and we have revised this data in the updated version (Figure. 1 A, Pages 12 Lines 146). As suggested, we have added CRP, WBC and heart rate in sepsis patients’ information. (Supplementary Materials and Methods)

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      The proposition that Vangl2 may target additional mediators of inflammation could be indicated in the text.

      We thank the reviewer for this valuable suggestion. We had added discussion in modified version. (Page 15, lines 467-474)

      Reviewer #3 (Recommendations For The Authors):

      It is advised that some of the deficiencies pointed out by Reviewer #3 are textually addressed. Additionally, there could be some inconsistency in the number of healthy controls and patients (see Fig S1A and FIg 1A and Supplementary table, also see comments from Reviewer #3) - this should be carefully scrutinised and revised, if necessary.

      We thank the reviewer for this valuable suggestion. We are sorry for our mistake that we entered data from two repeated experiments in Figure. 1 A and we have revised this data in the updated version (Figure. 1 A, Pages 12 Lines 146).

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Summary:

      Audio et al. measured cerebral blood volume (CBV) across cortical areas and layers using high-resolution MRI with contrast agents in non-human primates. While the non-invasive CBV MRI methodology is often used to enhance fMRI sensitivity in NHPs, its application for baseline CBV measurement is rare due to the complexities of susceptibility contrast mechanisms. The authors determined the number of large vessels and the areal and laminar variations of CBV in NHP, and compared those with various other metrics.

      Strengths:

      Noninvasive mapping of relative cerebral blood volume is novel for non-human primates. A key finding was the observation of variations in CBV across regions; primary sensory cortices had high CBV, whereas other higher areas had low CBV. The measured CBV values correlated with previously reported neuronal and receptor densities.

      We appreciate your recognition of the novelty of our non-invasive relative cerebral blood volume (CBV) mapping in non-human primates, as well as the observed areal variations and their correlations with neuronal and receptor densities. However, we are concerned that key contributions of our work—such as cortical layer-specific vasculature mapping and benchmarking surface vessel density estimations against anatomical ground truth—are being framed as limitations rather than significant advances in the field pushing the boundaries of current neuroimaging capabilities and providing a valuable foundation for future research. Additionally, we would like to clarify that dynamic susceptibility contrast (DSC) MRI using gadolinium is the gold standard for CBV measurement in clinical settings and the argument that “baseline CBV measurements are rare due to the complexities of susceptibility contrast” is simply not true. The limited use of ferumoxytol for CBV imaging is primarily due to previous FDA regulatory restrictions, rather than inherent methodological shortcomings.

      Changes in text:

      Compared to clinically used gadolinium-based agents, ferumoxytol's substantially longer half-life and stronger R<sub>2</sub>* effect allows for higher-resolution and more sensitive vascular volume measurements (Buch et al., 2022), albeit these methodologies are hampered by confounding factors such as vessel orientation relative to the magnetic field (B<sub>0</sub>) direction (Ogawa et al., 1993).

      Weaknesses:

      A weakness of this manuscript is that the quantification of CBV with postprocessing approaches to remove susceptibility effects from pial and penetrating vessels is not fully validated, especially on a laminar scale. Further specific comments follow.

      (1) Baseline CBV indices were determined using contrast agent-enhanced MRI (deltaR<sub>2</sub>*). Although this approach is suitable for areal comparisons, its application at a laminar scale poses challenges due to significant contributions from large vessels including pial vessels. The primary concern is whether large-vessel contributions can be removed from the measured deltaR<sub>2</sub>* through processing techniques.

      Eliminating the contribution of large vessels completely is unlikely, and we agree with the reviewer that ΔR<sub>2</sub>* results likely reflect a weighted combination of signals from both large vessels and capillaries. However, the distribution of ΔR<sub>2</sub>* more closely aligns with capillary density in areas V1–V5 than with large vessel distributions (Weber et al., 2008), suggesting that our ΔR<sub>2</sub>* results are more weighted toward capillaries. Moreover, we demonstrated that the pial vessel induced signal-intensity drop-outs are clearly limited to the superficial layers and exhibit smaller spatial extent than generally thought (Supp. Figs. 2 and 4).

      (2) High-resolution MRI with a critical sampling frequency estimated from previous studies (Weber 2008, Zheng 1991) was performed to separate penetrating vessels. However, this approach is still insufficient to accurately identify the number of vessels due to the blooming effects of susceptibility and insufficient spatial resolution. The reported number of penetrating vessels is only applicable to the experimental and processing conditions used in this study, which cannot be generalized.

      Our intention was not to suggest that our measurements provide a general estimate of vessel density across the macaque cerebral cortex. At 0.23 mm isotropic resolution, we successfully delineated approximately 30% of the penetrating vessels in V1. Our primary objective was to demonstrate a proof-of-concept quantifiable measurement rather than to establish a generalized vessel density metric for all brain regions. We have consistently emphasized this throughout the manuscript, but if there is a specific point of misunderstanding, we would be happy to consider revisions for clarity.

      (3) Baseline R<sub>2</sub>* is sensitive to baseline R<sub>2</sub>, vascular volume, iron content, and susceptibility gradients. Additionally, it is sensitive to imaging parameters; higher spatial resolution tends to result in lower R<sub>2</sub>* values (closer to the R<sub>2</sub> value). Thus, it is difficult to correlate baseline R<sub>2</sub>* with physiological parameters.

      The observed correlation between R<sub>2</sub>* and neuron density is likely indirect, as R<sub>2</sub>* is strongly influenced by iron, myelin, and deoxyhemoglobin densities. However, the robust correlation between R<sub>2</sub>* and neuron density, peaking in the superficial layers (R = 0.86, p < 10<sup>-10</sup>), is striking and difficult to ignore (revised Supp. Fig. 6D-E). Upon revision, we identified an error in Supp. Fig. 6D-E, where the previous version used single-subject R<sub>2</sub>* and ΔR<sub>2</sub>* maps instead of the group-averaged maps. The revised correlations are slightly stronger than in the earlier version.

      Given that the correlation between neuron density and R<sub>2</sub>* is strongest in the superficial layers, we suggest this relationship reflects an underlying association with tissue cytochrome oxidase (CO) activity and cumulative effect of deoxygenated venous blood drainage toward the pial network. The superficial cortical layers are also less influenced by myelin and iron densities, which are more concentrated in the deeper cortical layers. Additional factors may contribute to this relationship, including the iron dependence of mitochondrial CO activity, as iron is an essential component of CO’s heme groups. Moreover, myelin maintenance depends on iron, which is predominantly stored in oligodendrocytes. The presence of myelinated thin axons and a higher axonal surface density may, in turn, be a prerequisite for high neuron density.

      In this context, it is also valuable to note the absolute range of superficial R<sub>2</sub>* values (≈ 6 s<sup>-1</sup>; Supp. Fig. 6D). This variation in cortical surface R<sub>2</sub>* is about 12-30 times larger compared to the signal changes observed during task-based fMRI (6 vs. 0.2-0.5 s<sup>-1</sup>). This relation seems reasonable because regional increases in absolute blood flow associated with imaging signals, as measured by PET, typically do not exceed 5%–10% of the brain's resting blood flow (Raichle and Mintum 2016; Brain work and brain imaging). The venous oxygenation level is typically 60%, with task-induced activation increasing it by only a few percent. We suggest that this is ~40% oxygen extraction is reflected in the superficial R<sub>2</sub>*. Finally, the large intercept (≈ 14.5 1/s; Supp. Fig. 6D), which is not equivalent to the water R<sub>2</sub>* (≈ 1 1/s), suggests that R<sub>2</sub>* is influenced by substantial non-neuron density factors, such as receptor, myelin, iron, susceptibility gradients and spatial resolution.

      The R<sub>2</sub>* values are well known to be influenced by intra-voxel phase coherence and thus spatial resolution. However, our view is that the proposed methodology of acquiring cortical-layer thickness adjusted high-resolution (spin-echo) R<sub>2</sub> maps poses more methodological limitations and is less practical. Notwithstanding, to further corroborate the relationship between R<sub>2</sub>* and neuron density, we investigated whether a similar correlation exists in non-quantitative T2w SPACE-FLAIR images (0.32 mm isotropic) signal-intensity and neuron density. Using B<sub>1</sub> bias-field and B<sub>0</sub> orientation bias corrected T2w SPACE-FLAIR images (N=7), we parcellated the equivolumetric surface maps using Vanderbilt sections. Our findings showed that signal intensity—where regions with high signal intensity correspond to low R<sub>2</sub> values, and areas with low signal intensity correspond to high R<sub>2</sub> values—was positively correlated with neuron density, particularly in the superficial layers (R = 0.77, p = 10<sup>-11</sup>; Author response image 1).This analysis confirmed the correlation with neuron density and R<sub>2</sub> peaks at superficial layers. However, this correlation was slightly weaker compared to quantitative R<sub>2</sub>* (Supp. Fig. 6D), suggesting the variable flip-angle spin-echo train refocused signal-phase coherence loss from large draining vessels or that non-quantitative T2w-FLAIR images may be confounded by other factors such as B<sub>1</sub> transmission field biases (Glasser et al., 2022). Notwithstanding, this non-quantitative fast spin-echo with variable flip-angles approach, which is in principle less dependent on image resolution and closer to R<sub>2,intrinsic</sub> than R<sub>2</sub>*, yields similar findings in comparison to quantitative gradient-echo.

      Author response image 1.

      (A) T2w-FLAIR SPACE normalized signal-intensity plotted vs neuron density. Note that low signal-intensity corresponds to high R<sub>2</sub> and high neuron density, consistent with findings using ME-GRE. (B) Correlation between T2w-FLAIR SPACE and neuron density across equivolumetric layers. Notably, a similar relationship with neuron density was observed using a variable spin-echo pulse sequence as with quantitative gradient-echo-based imaging.

      Changes in text:

      Results:

      “Because the Julich cortical area atlas covers only a section of the cerebral cortex, and the neuron density estimates are interpolated maps, we extended our analysis using the original Collins sample borders encompassing the entire cerebral cortex (Supp. Fig. 6A-C). This analysis reaffirmed the positive correlation with ΔR<sub>2</sub>* (peak at EL2, R = 0.80, p < 10<sup>-11</sup>) and baseline R<sub>2</sub>* (peak at EL2a, R = 0.86, p < 10<sup>-13</sup>), yielding linear coefficients of ΔR<sub>2</sub>* = 102 × 10<sup>3</sup> neurons/s and R<sub>2</sub>* = 41 × 10<sup>3</sup> neurons/s (Supp. Fig. 6D-G). This suggests that the sensitivity of quantitative layer R<sub>2</sub>* MRI in detecting neuronal loss is relatively weak, and the introduction of the Ferumoxytol contrast agent has the potential to enhance this sensitivity by a factor of 2.5.”

      A new paragraph was added into discussion section 4.3 corroborating the relation between R<sub>2</sub>* and neuron density:

      “Another key finding of this study was the strong correlation between baseline R<sub>2</sub>* and neuron density (Supp. Fig. 6D, E). While R<sub>2</sub>* is well known to be influenced by iron, myelin, and deoxyhemoglobin densities, this correlation peaks in the superficial layers (Supp. Fig. 6E), suggesting a link to CO activity and the accumulation of deoxygenated venous blood draining from all cortical layers toward the pial network. Notably, the absolute range of superficial R<sub>2</sub>* values (max - min ≈ 6 s<sup>-1</sup>; Supp. Fig. 6D) is approximately 12-30 times larger than the ΔR<sub>2</sub>* observed during task-based BOLD fMRI at 3T (0.2-0.5 1/s) (Yablonskiy and Haacke 1994). Since venous oxygenation is around 60% and task-induced changes in blood flow account for only 5%–10% of the brain's resting blood flow (Raichle & Mintun, 2006), these results suggest that superficial R<sub>2</sub>* (Fig. 1D) may serve as a more accurate proxy for total deoxyhemoglobin content (and thus total oxygen consumption), which scales with the neuron density of the underlying cortical gray matter. Importantly, superficial layers may also provide a more specific measure of deoxyhemoglobin, as they are less influenced by myelin and iron, which are more concentrated in deeper cortical layers. Additionally, smaller but direct contributors, such as mitochondrial CO density—an iron-dependent factor—may also play a role in this relationship.”

      References:

      Raichle, M.E., Mintun, M.A., 2006. BRAIN WORK AND BRAIN IMAGING. Annu. Rev. Neurosci. 29, 449–476. https://doi.org/10.1146/annurev.neuro.29.051605.112819

      (4) CBV-weighted deltaR<sub>2</sub>* is correlated with various other metrics (cytoarchitectural parcellation, myelin/receptor density, cortical thickness, CO, cell-type specificity, etc.). While testing the correlation between deltaR<sub>2</sub>* and these other metrics may be acceptable as an exploratory analysis, it is challenging for readers to discern a causal relationship between them. A critical question is whether CBV-weighted deltaR<sub>2</sub>* can provide insights into other metrics in diseased or abnormal brain states.

      We acknowledge that having multivariate analysis using dense histological maps would be valuable to establish causality among these several metrics:

      “To comprehensively understand the factors contributing to the vascular organization of the brain, experimental disentanglement through multivariate analysis of laminar cell types and receptor densities is needed (Hayashi et al., 2021, Froudist-Walsh et al., 2023). Moreover, employing more advanced statistical modeling, including considerations for synapse-neuron interactions, may be important for refined evaluations.”

      We think the primary contributors to the brain's energy budget are neurons and receptors, as shown in several references and stated in the manuscript. To investigate relationship between neuron density and CBV, we estimated the energy budget allocated to neurons and extrapolated the remaining CBV to other contributing factors:

      Changes in text:

      “However, this is a simplified estimation, and a more comprehensive assessment would need to account for an aggregate of biophysical factors such as neuron types, neuron membrane surface area, firing rates, dendritic and synaptic densities (Fig. 6F-G), neurotransmitter recycling, and other cell types (Kageyama 1982; Elston and Rose 1997; Perge et al., 2009; Harris et al., 2012). Indeed, the majority of the mitochondria reside in the dendrites and synaptic transmission is widely acknowledged to drive the majority of the energy consumption and blood flow (Wong-Riley, 1989; Attwell et al., 2001).

      Extrapolating cortical ΔR<sub>2</sub>* to zero neuron density results in a large intercept (~35 1/s), corresponding to 60% of the maximum cortical CBV (57 1/s; Supp. Fig. 6F). This supports the view that the majority of energy consumption occurs in the neuropil—comprising dendrites, synapses, and axons—which accounts for ~80–90% of cortical gray matter volume, whereas neuronal somata constitute only ~10–20% (Wong-Riley, 1989). Although neuronal cell bodies exhibit higher CO activity per unit volume due to their dense mitochondrial content, these results suggest their overall contribution to the total CBV per mm<sup>3</sup> tissue remains lower than that of the neuropil, given the latter's substantially larger volume fraction in cortical tissue.

      Contrary to our initial expectations, we observed a relatively smaller CBV in regions and layers with high receptor density (Fig. 6B, D, F). This relationship extends to other factors, such as number of spines (putative excitatory inputs) and dendrite tree size across the entire cerebral cortex (Supp. Fig. 7) (Froudist-Walsh et al., 2023, Elston 2007). These results align with the work of Weber and colleagues, who reported a similar negative correlation between vascular length density and synaptic density, as well as a positive correlation with neuron density in macaque V1 across cortical layers (Weber et al., 2008).”

      Variations in neurons and receptors are reflected in cytoarchitecture, myelin (axon density likely scales with neuron density and myelin inhibits synaptic connections), and cell-type composition. For example, fast-spiking parvalbumin interneurons, which target the soma or axon hillock, are well-suited for regulating activity in regions with high neuron density, whereas bursting calretinin interneurons, which target distal dendrites, are more adapted to areas with high synaptic density. These factors in turn, gradually change along the cortical hierarchy level (higher levels have thinner cortical layer IV, more complex dendrite trees and more numerous inter-areal connectivity patterns). In our view, these factors are tightly interlinked and explain the strong correlations and metabolic demands observed across different metrics.

      We also agree that cortical layer imaging of vasculature in diseased or abnormal brain states is an intriguing direction for future research; however, it falls beyond the scope of the present study.

      Reviewer #2 (Public review):

      Summary:

      This manuscript presents a new approach for non-invasive, MRI-based, measurements of cerebral blood volume (CBV). Here, the authors use ferumoxytol, a high-contrast agent and apply specific sequences to infer CBV. The authors then move to statistically compare measured regional CBV with known distribution of different types of neurons, markers of metabolic load and others. While the presented methodology captures and estimated 30% of the vasculature, the authors corroborated previous findings regarding lack of vascular compartmentalization around functional neuronal units in the primary visual cortex.

      Strengths:

      Non invasive methodology geared to map vascular properties in vivo.

      Implementation of a highly sensitive approach for measuring blood volume.

      Ability to map vascular structural and functional vascular metrics to other types of published data.

      Weaknesses:

      The key issue here is the underlying assumption about the appropriate spatial sampling frequency needed to captures the architecture of the brain vasculature. Namely, ~7 penetrating vessels / mm2 as derived from Weber et al 2008 (Cer Cor). The cited work, begins by characterizing the spacing of penetrating arteries and ascending veins using vascular cast of 7 monkeys (Macaca mulatta, same as in the current paper). The ~7 penetrating vessels / mm2 is computed by dividing the total number of identified vessels by the area imaged. The problem here is that all measurements were made in a "non-volumetric" manner and only in V1. Extrapolating from here to the entire brain seems like an over-assumption, particularly given the region-dependent heterogeneity that the current paper reports.

      We appreciate the reviewer’s concerns regarding spatial sampling frequency and its implications for characterizing brain vasculature, which we investigated in this study. To clarify, our analysis of surface vessel density was explicitly restricted to V1 precisely due to the limitations of our experimental precision. While we reported the total number of vessels identified in the cortex, we intentionally chose not to present density values across regions in this manuscript. Although these calculations are feasible, we focused on the data directly analyzed and avoided extrapolating density values beyond the scope of our findings. Thus, we are uncertain about the suggestion that we extrapolated vessel density values across the entire brain, as we have taken care to limit our conclusions of our vessel density precision to V1.

      Regarding methodology, we conducted two independent analyses of vessel density specifically in V1. The first involved volumetric analysis using the Frangi filter, while the second used surface-based analysis of local signal-intensity gradients (as illustrated in Fig. 2E and Supp. Figs. 3 and 4), albeit the final surface density analysis is performed using the ultra-high resolution equivolumetric layers. Notably, these two approaches produced consistent and comparable vessel density estimates, supporting the reliability of our findings within the scope of V1 (we found 30% of the vessels relative to the ground-truth).

      Comments on revisions:

      I appreciate the effort made to improve the manuscript. That said, the direct validation of the underlying assumption about spatial resolution sampling remains unaddressed in the final version of this manuscript. With the only intention to further strengthen the methodology presented here, I would encourage again the authors to seek a direct validation of this assumption for other brain areas.

      In their reply, the authors stated "... line scanning or single-plane sequences, at least on first impression, seem inadequate for whole-brain coverage and cortical surface mapping. ". This seems to emanate for a misunderstanding as the method could be used to validate the mapping, not to map per-se.

      We apologize for any misunderstanding in our previous response and appreciate your clarification. We now understand that you were suggesting the use of line-scanning or single-plane sequences as a method to validate, rather than map, our spatial sampling assumptions.

      We agree that single-plane sequences at very high in-plane resolution (e.g., 50 × 50 × 1000 µm) have great potential to detect penetrating vessels and even vessel branching patterns. These techniques could indeed provide valuable insights into region-specific vessel density variations which could then be used to validate whole brain 3D acquisitions. However, as noted above, we have refrained from reporting vessel densities outside V1 precisely due to sampling limitations (we only found 30% of the penetrating vessels in V1, or only 2 mm<sup>2</sup>/30mm<sup>2</sup> ≈ 7% of branching vessel ground-truth, see discussion).

      We acknowledge the merit of incorporating such methods to validate regional vessel densities and agree that this would be an important avenue for future research. Thank you for suggesting this point, we have briefly mentioned the advantage of single-plane EPI at discussion.

      Changes in text:

      “4.1 Methodological considerations - vessel density informed MRI

      …anatomical studies accounting for branching patterns have reported much higher vessel densities up to 30 vessels/mm<sup>2</sup> (Keller et al., 2011; Adams et al., 2015). Further investigations are warranted, taking into account critical sampling frequencies associated with vessel branching patterns (Duverney 1981), and achieving higher SNR through ultra-high B<sub>0</sub> MRI (Bolan et al., 2006; Harel et al., 2010; Kim et al., 2013) and utilize high-resolution single-plane sequences and prospective motion correction schemes to accurately characterize regional vessel densities. Such advancements hold promise for improving vessel quantification, classifications for veins and arteries and constructing detailed cortical surface maps of the vascular networks which may have diagnostic and neurosurgical utilities (Fig. 2A, B) (Iadecola, 2013; Qi and Roper, 2021; Sweeney et al., 2018).”

      During the revision we found a typo and corrected it in Supp. Fig. 8: Dosal -> Dorsal.

    1. Author response:

      The following is the authors’ response to the original reviews.

      In light of some reviewer comments requesting more clarity on the relationship between our model and prior theoretical studies of systems consolidation, we propose a modification to the title of our manuscript: “Selective consolidation of learning and memory via recall-gated plasticity.” We believe this title better reflects the key distinguishing feature of our model, that it selectively consolidates only a subset of memories, and also highlights the model’s applicability to task learning as well as memory storage.

      Major comments:

      Reviewer #3’s primary concern with the paper is the following: “The main weakness of the paper is the equation of recall strength with the synaptic changes brought about by the presentation of a stimulus. In most models of learning, synaptic changes are driven by an error signal and hence cease once the task has been learned. The suggested consolidation mechanism would stop at that point, although recall is still fine. The authors should discuss other notions of recall strength that would allow memory consolidation to continue after the initial learning phase.”

      We thank the reviewer for drawing attention to this issue, which primarily results from a poor that memories should be interpreted as actual synaptic weight updates,∆𝑤and thus in the context choice of notation on our part. Our decision to denote memories as gives the impression of supervised learning would go to zero when the task is learned. However, in the formalism of our model, memories are in fact better interpreted as target values of synaptic weights, and the synaptic model/plasticity rule is responsible for converting these target values into synaptic weight updates. We were unclear on this point in our initial submission, because our paper primarily considers binary synaptic weights, where target synaptic weights have a one-to-one correspondence with candidate synaptic weight updates. We have updated the paper to use w* to refer to memories, which we hope resolves this confusion, and have updated our introduction to the term “memory” to reflect their interpretation as target synaptic weight values. We have also updated the paper’s language to more clearly disambiguate between the “learning rule,” which determines how the memory vector (target synaptic weight vectors) are derived from task variables, and the “plasticity rule,” which governs how these are translated into actual synaptic weight updates. We acknowledge that our manuscript still does not explicitly consider a plasticity rule that is sensitive to continuous error error signals, as our analysis is restricted to binary weights. However, we believe that the updated notation and exposition makes it more clear that our model could be applied in such a case.

      Reviewer #1 brought up that our framework cannot capture “single-shot learning, for example, under fear conditioning or if a presented stimulus is astonishing.” Reviewer #2 raised a related question of how our model “relates to the opposite more intuitive idea, that novel surprising experiences should be stored in memory, as the familiar ones are presumably already stored.”

      We agree that the built-in inability to consolidate memories after a single experience is a limitation of our model, and that extreme novelty is one factor (among others, such as salience or reward) that might incentivize one-shot consolidation. We have added a comment to the discussion to acknowledge these points (added text in bold): “ Moreover, in real neural circuits, additional factors besides recall, such as reward or salience, are likely to influence consolidation as well. For instance, a sufficiently salient event should be stored in long-term memory even if encountered only once. Furthermore, while in our model familiarity drives consolidation, certain forms of novelty may also incentivize consolidation, raising the prospect of a non-monotonic relationship between consolidation probability and familiarity.” We agree that future work should address the combined influence of recall (as in our model) and other factors on the propensity to consolidate a memory.

      Reviewer #1 requested, “a comparison/discussion of the wide range of models on synaptic tagging for consolidation by various types of signals. Notably, studies from Wulfram Gerstner's group (e.g., Brea, J., Clayton, N. S., & Gerstner, W. (2023). Computational models of episodic-like memory in food-caching birds. Nature Communications, 14(1); and studies on surprise).”

      We thank the reviewer for the reference, which we have added to the manuscript. The model of Brea et al.(2023) is similar to that of Roxin & Fusi (2013), in that consolidation consists of “copying” synaptic weights from one population to another. As a result, just like the model of Roxin & Fusi (2013), this model does not provide the benefit that our model offers in the context of consolidating repeatedly recurring memories. However, the model of Brea et al. does have other interesting properties – for instance, it affords the ability to decode the age of a memory, which our model does not. We have added a comment on this point in the subsection of the Discussion tilted “Other models of systems consolidation.”

      Reviewer #2 noted, “While the article extensively discusses the strengths and advantages of the recall-gated consolidation model, it provides a limited discussion of potential limitations or shortcomings of the model, such as the missing feature of generalization, which is part of previous consolidation models. The model is not compared to other consolidation models in terms of performance and how much it increases the signal-to-noise ratio.”

      We agree that our work does not consider the notion of generalization and associated changes to representational geometry that accompany consolidation, which is the focus of many other studies on consolidation. We have further highlighted this limitation in the discussion. Regarding the comparison to other models, this is a tricky point as the desiderata we emphasize in this study (the ability to recall memories that are intermittently reinforced) is not the focus of other studies. Indeed, our focus is primarily on the ability of systems consolidation to be selective in which memories are consolidated, which is somewhat orthogonal to the focus of many other theoretical studies of consolidation. We have updated some wording in the introduction to emphasize this focus.

      Additional comments made by reviewer #1

      Reviewer #1 pointed out issues in the clarity of Fig. 2A. We have added substantial clarifying text to the figure caption.

      Reviewer #1 pointed out lack of clarity in our introduction to the terms “reliability” and “reinforcement.” We have now made it more clear what we mean by these terms the first time they are used.

      We have updated our definition of “recall” to use the term “recall factor,” which is how we refer to it subsequently in the paper.

      We have made explicit in the main text our simplifying assumption that memories are mean-centered.

      We have made consistent our use of “forgetting curve” and “memory trace”.

      Additional comments made by reviewer #2

      We have added a comment in the discussion acknowledging alternative interpretations of the result of Terada et al. (2021)

      We have significantly expanded the discussion of findings about the mushroom body to make it accessible to readers who do not specialize in this area. We hope this clarifies the nature of the experimental finding, which uncovered a circuit that performs a strikingly clean implementation of our model.

      The reviewer expresses concern that the songbird study (Tachibana et al., 2022) does not provide direct evidence for consolidation being gated by familiarity of patterns of activity. Indeed, the experimental finding is one-step removed from the direct predictions of our model. That said, the finding – that the rate of consolidation increases with performance – is highly nontrivial, and is predicted by our model when applied to reinforcement learning tasks. We have added a comment to the discussion acknowledging that this experimental support for our model is behavioral and not mechanistic.

      We do not regard it as completely trivial that the parallel LTM model performs roughly the same as the STM model, since a slower learning rate can achieve a higher SNR (as in Fig. 2C). Nevertheless we have added wording to the main text around Fig. 4B to note that the result is not too surprising.

      We have added a sentence that clarifies the goal / question of our paper earlier on in the introduction.

      We have updated Figure 3 by labeling the key components of the schematics and adding more detail to the legend, as suggested by the reviewer. We also reordered the figure panels as suggested.

      Additional comments made by reviewer #3:

      We have clarified in the main text that Fig. 2C and all results from Fig. 4 onward are derived from an ideal observer model (which we also more clearly define).

      We have now emphasized in the main text that the derivations of the recall factors for specific learning rules are derived in the Supplementary Information.

      We have highlighted more clearly in the main text that the recall factors associated with specific learning rules may correspond to other notions that do not intuitively correspond to “recall,” and have added a pointer to Fig. 3A where these interpretations are spelled out.

      We have added references corresponding to the types of learning rules we consider.

      The cutoffs / piecewise-looking behavior of plots in Fig. 4 are primarily the result of finite N, which limits the maximum SNR of the system, rather than coarse sampling of parameter values.

      Thank you for pointing out the error in the legend in Fig. 5D (also affected Supp Fig. S7/S8), which is now fixed.

      The reference to the nonexistence panel Fig. 5G has been removed.

      As the reviewer points out, the use of a binary action output in our reinforcement learning task renders it quite similar to the supervised learning task, making the example less compelling. In the revised manuscript we have updated the RL simulation to use three actions. Note also that in our original submission the network outputs represented action probabilities directly (which is straightforward to do for binary actions, but not for more than two available actions). In order to parameterize a policy when more than two actions are available, we sample actions using a softmax policy, as is more standard in the field and as the reviewer suggested. The associated recall factor is still a product of reward and a “confidence factor,” and the confidence factor is still the value of the network output in the unit corresponding to the chosen action, but in the updated implementation this factor is equal to , similar (though with a sign difference) to the reviewer’s suggestion. We believe these updates make our RL implementation and simulation more compelling, as it allows them to be applied to tasks with arbitrary numbers of actions.

      Additional minor comments

      The reviewers made a number of other specific line-by-line wording suggestions, typo corrections,

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      The study provides potentially fundamental insight into the function and evolution of daily rhythms. The authors investigate the function of the putative core circadian clock gene Clock in the cnidarian Nematostella vectensis. While it parts still incomplete, the evidence suggests that, in contrast to mice and fruit flies, Clock in this species is important for daily rhythms under constant conditions, but not under a rhythmic light/dark cycle, suggesting that the major role of the circadian oscillator in this species could be a stabilizing function under non-rhythmic environmental conditions.

      Public Reviews:

      Reviewer #1 (Public Review):

      In this nice study, the authors set out to investigate the role of the canonical circadian gene Clock in the rhythmic biology of the basal metazoan Nematostella vectensis, a sea anemone, which might illuminate the evolution of the Clock gene functionality. To achieve their aims the team generated a Clock knockout mutant line (Clock-/- ) by CRISPR/Cas9 gene deletion and subsequent crossing. They then compared wild-type (WT) with Clock-/- animals for locomotor activity and transcriptomic changes over time in constant darkness (DD) and under light/dark cycles to establish these phenotypes under circadian control and those driven by light cycles. In addition, they used Hybridization Chain Reaction-In situ Hybridization (HCR-ISH) to demonstrate the spatial expression of Clock and a putative circadian clocl-controlled gene Myh7 in whole-mounted juvenile anemones.

      The authors demonstrate that under LD both WT and Clock-/- animals were behaviourally rhythmic but under DD the mutants lost this rhythmicity, indicating that Clock is necessary for endogenous rhythms in activity. With altered LD regimes (LD6:6) they show also that Clock is light-dependent. RNAseq comparisons of rhythmic gene expression in WT and Clock-/- animals suggest that clock KO has a profound effect on the rhythmic genome, with very little overlap in rhythmic transcripts between the two phenotypes; of the rhythmic genes in both LD and DD in WT animals (220- termed clock-controlled genes, CCGS) 85% were not rhythmic in Clock-/- animals in either light condition. In silico gene ontology (GO) analysis of CCGS reflected process associated with circadian control. Correspondingly, those genes rhythmic in KO animals under DD (here termed neoCCGs) were not rhythmic in WT, lacked upstream E-box motifs associated with circadian regulation, and did not display any GO enrichment terms. 'Core' circadian genes (as identified in previous literature) in WT and Clock-/- animals were only rhythmic under entrainment (LD) conditions whilst Clock-/- displayed altered expression profiles under LD compared to WT. Comparing CCGs with previous studies of cycling genes in Nematostellar, the authors selected a gene from 16 rhythmic transcripts. One of these, Myh7 was detectable by both RNAseq and HCR-ISH and considered a marker of the circadian clock by the authors.

      The authors claim that the study reveals insights into the evolutionary origin of circadian timing; Clock is conserved across distant groups of organisms, having a function as a positive regulator of the transcriptional translational feedback loop at the heart of daily timing, but is not a central element of the core feedback loop circadian system in this basal species. Their behavioural and transcriptomic data largely support the claims that Clock is necessary for endogenous daily activity but that the putative molecular circadian system is not self-sustained under constant darkness (this was known already for WT animals)- rather it is responsive to light cycles with altered dynamics in Clock-/- specimens in some core genes under LD. In the main, I think the authors achieved their aims and the manuscript is a solid piece of important work. The Clock-/- animal is a useful resource for examining time-keeping in a basal metazoan.

      The work described builds on other transcriptomic-based works on cnidaria, including Nematostellar, and does probe into the molecular underpinnings with a loss-of-function in a gene known to be core in other circadian systems. The field of chronobiology will benefit from the evolutionary aspect of this work and the fact that it highlights the necessity to study a range of non-model species to get a fuller picture of timing systems to better appreciate the development and diversity of clocks.

      Strengths:

      The generation of a line of Clock mutant Nematostellar is a very useful tool for the chronobiological community and coupled with a growing suite of tools in this species will be an asset. The experiments seem mostly well conceived and executed (NB see 'weaknesses'). The problem tackled is an interesting one and should be an important contribution to the field.

      Weaknesses:

      I think the claims about shedding light on the evolutionary origin of circadian time maintenance are a little bold. I agree that the data do point to an alternative role for Clock in this animal in light responsiveness, but this doesn't illuminate the evolution of time-keeping more broadly in my view. In addition, these are transcriptomic data and so should be caveated- they only demonstrate the expression of genes and not physiology beyond that. The time-course analysis is weakened by its low resolution, particularly for the RAIN algorithm when 4-hour intervals constrain the analysis. I accept that only 24h rhythms were selected in the analysis from this but, it might be that detail was lost - I think a preferred option would be 2 or 3-hour resolution or 2 full 24h cycles of analysis.

      The authors discount the possibility of the observed 12h rhythmicity in Clock-/- animals by exposing them to LD6:6 cycles before free-running them in DD. I suggest that LD cycles are not a particularly robust way to entrain tidal animals as far as we know. Recent papers show inundation/mechanical agitation are more reliable cues (Kwiatkowski ER, et al. Curr Biol. 2023, 2;33(10):1867-1882.e5. doi: 10.1016/j.cub.2023.03.015; Zhang L., et al Curr Biol. 2013, 23;19, 1863-1873 doi.org/10.1016/j.cub.2013.08.038.) and might be more effective in revealing endogenous 12h rhythms in the absence of 24h cues.

      Response: We removed the suggestion that we used 6:6h LD to perform tidal entrainment. We generated this ultradian light condition to address the 24h rhythmicity observed in the NvClk1-/- in 12:12h LD.

      Reviewer #2 (Public Review):

      This manuscript addresses an important question: what is the role of the gene Clock in the control of circadian rhythms in a very primitive group of animals: Cnidaria. Clock has been found to be essential for circadian rhythms in several animals, but its function outside of Bilaterian animals is unknown. The authors successfully generated a severe loss-of-function mutant in Nematostella. This is an important achievement that should help in understanding the early evolution of circadian clocks. Unfortunately, this study currently suffers from several important weaknesses. In particular, the authors do not present their work in a clear fashion, neither for a general audience nor for more expert readers, and there is a lack of attention to detail. There are also important methodological issues that weaken the study, and I have questions about the robustness of the data and their analysis. I am hoping that the authors will be able to address my concerns, as this work should prove important for the chronobiology field and beyond. I have highlighted below the most important issues, but the manuscript needs editing throughout to be accessible to a broad audience, and referencing could be improved.

      Major issues:

      (1) Why do the authors make the claim in the abstract that CLOCK function is conserved with other animals when their data suggest that it is not essential for circadian rhythms? dCLK is strictly required in Drosophila for circadian rhythms. In mammals, there are two paralogs, CLOCK and NPAS2, but without them, there are no circadian rhythms either. Note also that the recent claim of BMAL1-independent rhythms in mammals by Ray et al., quoted in the discussion to support the idea that rhythms can be observed in the absence of the positive elements of the circadian core clock, had to be corrected substantially, and its main conclusions have been disputed by both Abruzzi et al. and Ness-Cohn et al. This should be mentioned.

      Response: According to our Behavioral and Transcriptomic data, CLOCK function is conserved in constant light condition. In LD context, the rhythmicity is maintained probably by the light-response pathway in Nematostella. We modified our rhythmic transcriptomic analysis and considered the context of the contested results by Ray et al., and discussed it in the revised manuscript.

      (2) The discussion of CIPC on line 222 is hard to follow as well. How does mRNA rhythm inform the function of CIPC, and why would it function as a "dampening factor"? Given that it is "the only core clock member included in the Clock-dependent CCGs," (220) more discussion seems warranted. Discussing work done on this protein in mammals and flies might provide more insight.

      Response: The initial sentence was unclear. Furthermore, since we restricted our rhythmic analysis to genes only found rhythmic with a p<0.01 with RAIN combined with JTK, NvCipc was no longer defined as rhythmic in free running.

      (3) The behavioral arrhythmicity seen with their Clock mutation is really interesting. However, what is shown is only an averaged behavior trace and a single periodogram for the entire population. This leaves open the possibility that individual animals are poorly synchronized with each other, rather than arrhythmic. I also note that in DD there seem to be some residual rhythms, though they do not reach significance. Thus, it is also possible that at least some individual animals retain weak rhythms. The authors should analyze behavioral rhythms in individual animals to determine whether behavioral rhythmicity is really lost. This is important for the solidity of their main conclusions.

      Response: Fig. 1 has been modified. We have separated the data for WT and NvClk1-/- animals to provide clarity on the average behavior pattern for each genotype. While the LSP analysis on the population average informs us about the synchronization of the population, it is true that it does not provide insight into individual rhythmicity. To address this, we analyzed individuals in all conditions using the Discorhythm website (Carlucci et al., 2019).

      In the revised figure, we have included a comparison plot of the acrophase of 24-hour rhythmic animals between genotypes using Cosinor analysis, which is most suitable for acrophase detection. This plot indicates the number of animals detected as significantly rhythmic, providing direct visual input to the reader regarding individual rhythmicity. Additionally, we have added Table 1, which contains the Cosinor period analysis (24 and 12 hours) of individuals for all genotypes and conditions, further enhancing the clarity of our findings.

      (4) There is no mention in the results section of the behavior of heterozygotes. Based on supplement figure 2A, there is a clear reduction in amplitude in the heterozygous animals. Perhaps this might be because there is only half a dose of Clock, but perhaps this could be because of a dominant-negative activity of the truncated protein. There is no direct functional evidence to support the claim that the mutant allele is nonfunctional, so it is important to discuss carefully studies in other species that would support this claim, and the heterozygous behavior since it raises the possibility that the mutant allele acts as a dominant negative.

      Response: Extended Data Fig.1 modified. We show NvClk1+/- normalized locomotion over time in DD of the population, comparison of individual normalized behavior amplitude, LSP of the average population and individual acrophase of only rhythmic 24h individuals. Indeed, we cannot discriminate Dominant-negative from non-functional allele.

      (5) I do not understand what the bar graphs in Figure 2E and 3B represent - what does the y-axis label refer to?

      Response: Not relevant to the revised manuscript.

      (6a) I note that RAIN was used, with a p<0.05 cut-off. I believe RAIN is quite generous in calling genes rhythmic, and the p-value cut-off is also quite high. What happens if the stringency is increased, for example with a p<0.01.

      Response: We acknowledge your concern regarding the stringency of our statistical analysis. To address this, we opted to combine both RAIN and JTK methods and applied a more stringent p-value cut-off of p<0.01.

      (6b) It would be worth choosing a few genes called rhythmic in different conditions (mutant or wild-type. LD or DD), and using qPCR to validate the RNAseq results. For example, in Figure 3D, Myh7 RNAseq data are shown, and they do not look convincing. I am surprised this would be called a circadian rhythm. In wild-type, the curve seems arrhythmic to me, with three peaks, and a rather large difference between the first and second ZT0 time point. In the Clock mutants, rhythms seem to have a 12hr period, so they should not be called rhythmic according to the material and methods, which says that only ca 24hr period mRNA rhythms were considered rhythmic. Also, the result section does not say anything about Myh7 rhythms. What do they tell us? Why were they presented at all?

      Response: Regarding the suggestion for independent verification of our RNAseq results, we agree that such validation would enhance the robustness of our findings. To address this, we chose to overlap our identified rhythmic genes under WT LD conditions with those from another transcriptomic study that shared similarities in experimental design. Notably, the majority of overlapping rhythmic genes between the studies are candidate pacemaker genes. We believe that this replication of biologically significant rhythmic genes strengthens the validity and reliability of our results (see Extended Data Fig. 2).

      Furthermore, we have decided to remove the NvMhc-st (mistakenly named Myh7, only rhythmic in WT DD in the new analysis) as it does not contribute substantively to the revised version of the manuscript.

      (7) The authors should explain better why only the genes that are both rhythmic in LD and DD are considered to be clock-controlled genes (CCGs). In theory, any gene rhythmic in DD could be a CCG. However, Leach and Reitzel actually found that most genes in DD1 do not cycle the next day (DD2)? This suggests that most "rhythmic" genes might show a transient change in expression due to prolonged obscurity and/or the stress induced by the absence of a light-dark cycle, rather than being clock controlled. Is this why the authors saw genes rhythmic under both LD and DD as actual CCGs? I would suggest verifying that in DD the phase of the oscillation for each CCG is similar to that in LD. If a gene is just responding to obscurity, it might show an elevated expression at the end of the dark period of LD, and then a high level in the first hours of DD. Such an expression pattern would be very unlikely to be controlled by the circadian clock.

      Response: As we modified our transcriptomic analysis, we do no longer analyze LD+DD rhythmic genes, but any genes rhythmic (RAIN and JTK p<0.01) in each condition. As such we end up with four list of genes corresponding to each experimental conditions.

      (8) Since there are still rhythms in LD in Clock mutants, I wonder whether there is a paralog that could be taking Clock's place, similar to NPAS2 in mammals.

      Response: see response to (1) > The only NPAS2 orthologous identified in Nematostella NPAS3 showed marginally significance (p=0.013) with RAIN in LD WT suggesting a regulation similar to the candidate pacemaker genes. As such we included within our candidate pacemaker genes list.

      (9) I do not follow the point the authors try to make in lines 268-272. The absence of anticipatory behavior in Drosophila Clk mutants results from disruption of the circadian molecular clock, due to the loss of Clk's circadian function. Which light-dependent function of Clock are the authors referring to, then? Also, following this, it should be kept in mind that clock mutant mice have a weakened oscillator. The effect on entrainment is secondary to the weakening of the oscillator, rather than a direct effect on the light input pathway (weaker oscillators have increased response to environmental inputs). The authors thus need to more clearly explain why they think there is a conservation of circadian and photic clock function.

      Response: Following the changes in our statistical analysis we reframed the discussion and address directly the circadian and the photic clock function (we call it light-response pathway in the manuscript)

      Recommendations for the authors:

      We suggest the following improvements:

      (1) Please undertake a serious effort to make this work more accessible to non-marine chronobiologists. This includes better explanations, and schemes of the animal when images of staining are shown (e.g. Fig.1b) which include the labeling of relevant morphological structures mentioned in the text (like "tentacle endodermis and mesenteries" (line 132)). Similar issues for mentioned life cycle stages like "late planula stage" (line 133), "bisected physa" (line 149).

      Response: Fig. 1b, we outlined the animal shaped and added 2 arrows to locate the tentacle endodermis and mesenteries. We replaced the term late planula stage, by larvae. And we rephrased bisected physa by tissue sampling.

      Please attend to details. This includes:

      • Wrong referrals to figures (currently line 151 refers to EDF2- but should be EDF 1 instead, there is a Fig.3f mentioned in the text, but there is no such Fig.).

      Response: Fixed

      • Mentioning of ZTs when the HCR stainings were performed.

      Response: Fixed

      • Fig.1 a shows a rather incomplete and thus potentially confusing phylogenetic tree. Vertebrates have at least two Clk orthologs (NPAS2 and CLK), please include both, use an outgroup, and rout the tree.

      Response: Identifying NPAS2 and CLK orthologous in all species added more confusion into the conclusion. However, we followed the suggestion of adding an outgroup using a CLK orthologous sequence identified in the sponge Amphimedon queenslandica and rout the tree. Thank for the suggestion.

      • What do the y-axis labels in Figure 2E and 3B refer to exactly? Y-axis label annotations in Fig.3a,d are entirely missing- what do the numbers refer to?

      Response: not relevant in the revised manuscript

      • Fig.2D- is the Go term enrichment referring to LD or DD?

      Response: to DD. We made it cleared on the figure 5.

      • Wording: "Clock regulates genetic pathways." What is meant by "genetic pathways"? There are no "non-genetic pathways". Could one simply say: "Clock regulates a variety of transcripts".

      Response: We modified our threshold to use only p.adj<0.01, which reduced the GO term numbers. We removed “genetic pathways” and now address the specific pathways: cell-cycle and neuronal.

      The use of the term "epistatic" is confusing (line 219), i.e. that light is epistatic to Clock. In genetics, epistasis is defined as the effect of gene interactions on phenotypes. To a geneticist, this implies that there is a second gene impacting on the phenotype of the Clock mutants. Please re-word.

      Response: “light is epistatic on Clock” has been re-phrased.

      The provided Supplementary tables are not well annotated. Several of them need guess-work about what is shown. For instance, for Supplementary Table 1, the Ns are unclear, which in total can go up to almost 200 per condition-genotype, but only about 30 animals for each were tested. Thus, where do the high totals in the LSP table come from? What do the numbers of each periodicity mean? Initially one might assume it was the number of animals that showed a periodogram peak at a given periodicity, but it seems that cannot be. Maybe it counted any period bin over statistical significance? Please clarify with better descriptions and labels.

      Response: Supplementary tables are now clearly annotated on their first Tabs. About Fig.1, we already addressed this point in the public review.

      Albeit not essential, it would be more reader-friendly to also add a summary table with average period and SD, power and SD, and percentage rhythmicity to the main figure.

      Response: Table 1 is added: it contains individual count of rhythmic animals (24h and 12h) with Cosinor. However, using Discorhythm we had to ask for a specific Period. Thus, we can only provide animal count significant for a given period value. And not an estimation of their own period.

      (2) Some of the terminology is quite confusing, in particular the double meaning of the word "clock" (i.e the pacemaker and the transcription factor). This is not a specific problem to this manuscript, but it would be helpful for the readability to try to improve this.

      Could the gene/transcript/protein be spelled: clk and Clk?

      Alternatively, for clarity- how about talking about "core pacemaker genes," "CLOCK-dependent rhythmic genes" and "CLOCK-independent rhythmic genes"?

      Response:

      Clock/CLOCK > NvClk / NvCLK and the mutant is NvClk1-/-

      Core clock genes > candidate pacemaker genes.

      CLOCK-dependent CCG > this notion no longer exists in the revised manuscript.

      CLOCK-independent CCG > this notion no longer exists in the revised manuscript.

      (3) The dismissal of the 12h rhythmicity in Clock-/- animals is not really convincing and should be reconsidered. LD6:6 cycles (before free-running animals in DD) is likely a not particularly robust way to entrain tidal animals. Recent papers show inundation/mechanical agitation are more reliable cues (Kwiatkowski ER, et al. Curr Biol. 2023, 2;33(10):1867-1882.e5. doi: 10.1016/j.cub.2023.03.015; Zhang L., et al Curr Biol. 2013, 23;19, 1863-1873 doi.org/10.1016/j.cub.2013.08.038.) and might be more effective in revealing endogenous 12h rhythms in the absence of 24h cues.

      Response: We removed the proposition of using 6:6hLD as Tidal entrainment. Instead, the LD 6:6 experiment reveals the direct light-dependency of the NvClk1-/- mutant.

      (4) There are significant questions raised on the validity of BMAL1-independent rhythms in mammals as suggested by the Ray et al study. See DOI: 10.1126/science.abe9230 and DOI: 10.1126/science.abf0922

      These technical comments should also be taken into account and the discussion adjusted accordingly to better reflect the ongoing discussions in the chronobiology field.

      Response: We modified our rhythmic analysis. As we cannot use BHQ or adjusted p-value which resulted in very genes, we defined 24h-rhythmic genes if p<0.01 with two different algorithms (RAIN and JTK). We propose this compromise to reduce the risk of false-positive. Furthermore, we discussed our methodology in the light of the significant questions raised by these papers you cited. We thank the reviewer for this important point.

      (5) The HCR stainings for clk are not very convincing. Normally, HCR should have more dots. In principle, the logic of HCR is such that it detects individual mRNA molecules in the cell. Thus, having only one strong dot/cell like in Fig.1b doesn't make much sense.

      Response: We were the first surprised by this single dot signal. We are experienced users of HCRv.3 across different species. We decided to remove the close-up (for further investigations) but to keep the full animal signal. According to our approach it is a convincing signal. However, the doty nature of the signal itself it is not easy to make it highly visible at full scale animal on the picture. We did our best to show the mRNA signal visible without altering the pattern.

      Furthermore, the controls for the HCR in situ hybridization are unclear. In the methods, there are two Clock probes described (B3 & B5) and two control probes (B1 & B3), however, in the negative control image, a combination of one Clock (B1) and one control (B3) probes is used and is unclear what "redundant detection" means in the legend of figure S2.

      Response: Considering the nature of the signal (single of few dots), we decided to use two probes with 2 different fluorophores. A noise is by nature random. Our hypothesis was: only overlapping fluorescent dots are true signal of NvClk mRNA.

      For Control probes we used two zebrafish probes labelling hypothalamic peptides.

      Based on the experience with non-Drosophila, non-mouse animal model systems the reviewers assume that non-sense mediated mRNA decay (NMD) is not strongly initiated upon Crispr-induced premature STOP-codons. If this assumption is correct it would be worth to mention it. Alternatively, it would be worth testing if Nematostella induces NMD, as this would be a great control for the HCR and the mutation itself. At which ZT was the HCR done?

      Response: We performed the HCR at ZT10 when NvClk is described to be at peak. It is now indicated in the Fig. 1b. The RNAseq detected a higher quantity of NvClk1 mRNA in the NvClk1-/- (see Fig. 4a). mRNA quantity regulation involves transcription, stabilization, and degradation. At this stage, we cannot identify which specific step is affected.

      For Fig.1c- please provide the binding site and sequence in the figure, simply include EDF 1 in the main figure.

      Response: We generated a clear indication in the new Fig.1c and EDF. 1b about the protein domains, the CRISPR binding site and the consequences on the DNA and AA sequences.

      (6) Please provide the individual trace data for the behavioral analyses either as supplementary files or as a link to an openly accessible database like DRYAD (see also comment 7 in the public review of reviewer 2). Maybe this is what is shown in Supplementary Table 1, but it is really not clear what is actually shown.

      Response: Fig.1 is updated. Table 1 is added. Supplementary Table 1 contains individual normalized locomotor data of each polyps for each genotypes and light conditions. Supplementary Table 2 contains the cosinor individual rhythmic behavior analysis based on the Supplementary Table 1.

      (7) It is not really clear if the mutation is a true loss-of-function or could also be dominant negative. While this is raised in the discussion, it should be more carefully considered. The reason why a dominant negative would be unlikely is unclear. More specifically also see comment 8) in the public review of reviewer 2.

      Response: Indeed, the results cannot tell us if it is a true loss of function, a dominant negative or non-functional allele. We addressed it in the first part of the discussion.

      (8) The pretty small overlap of rhythmic transcripts in LD and DD could reflect the true biology of a more core clock driven-process under constant conditions and a more light-driven process under LD. But still- wouldn't one expect that similar processes should be rhythmic? If not, why not?

      It would certainly add strength to the data if for one or two transcripts these results were independently verified by qPCR from an independent sampling. This could even be done for just two time points with the most extreme differences.

      Response: We appreciate the reviewer's comments and concerns regarding the overlap of rhythmic transcripts in different conditions. In response to the reviewer's query, we revised our interpretation of the transcriptomic data, acknowledging the limited overlap between light and genotype conditions in our study. This prompted us to reconsider the underlying biological processes driving rhythmic gene expression under constant conditions versus light-dark cycles.

      Regarding the suggestion for independent verification of our RNAseq results, we agree that such validation would enhance the robustness of our findings. To address this, we chose to overlap our identified rhythmic genes under WT LD conditions with those from another transcriptomic study that shared similarities in experimental design. Notably, the majority of overlapping rhythmic genes between the studies are candidate pacemaker genes. We believe that this replication of biologically significant rhythmic genes strengthens the validity and reliability of our results (see Extended Data Fig. 2).

      (9) Expression of myh7 : Checking for co-expression should be pretty straightforward by HCR. This is what this type of staining technique is really good for. Please do clk and myh7 co-staining if you want to claim co-expression. Otherwise don't make such a claim.

      Response: We agree that checking for co-expression should be straightforward by HCR. However, due to time constraints during the revision period, we are unable to conduct the double in-situ experiment. Additionally, upon careful consideration, we recognize that including myhc-st (mistakenly named myh7) staining and co-expression analysis would not significantly contribute to the main conclusions of our study. Therefore, we have decided to remove this analysis from the revised manuscript.

      (10) Missing methodological details:

      • The false discovery rate for each analysis should be included (see Hughes et al.,: "Guidelines for Genome-Scale Analysis of Biological Rhythms," 2017).

      Response: THE FDR is indicated for each gene in supplementary table 3

      • Fig.1f- continuous light- please provide a spectrum (If there is no good spectrophotometer available, please provide at least manufacturer information.

      Response: Unfortunately, we don’t have a good spectrophotometer available during the time of the revision. We added to the method the reference of the lamp. We found the light spectrum provided by the supplier. However, we did not add it to the revised manuscript.

      Author response image 1.

      Spectrum of the Aquastar t8

      Also, it would be easier for the reader, if the measurements of light intensity are provided in photons, because this is what the light receptors ultimately measure.

      Response: Modified.

      • Fig.2E- please add the consensus sequence used for circadian E-box vs. E-box to the figure.

      Response: In the revised manuscript Fig.4c, we show which E-box motifs we extracted for our promoter analysis. We as well changed our analysis and did no longer use HOMER, but we directly extracted promoter sequences and looked for canonical Ebox CANNTG and Circadian Ebox CACGTG and generate a Circadian Ebox enrichment output per gene promoter.

      (11) There has been some discussion about the evolutionary statement as stated by the authors. It appears that depending on the background of the reader, this can be misunderstood. We thus suggest to more clearly point out where the author thinks there is evolutionary conservation (a function for clk in the circadian oscillator under constant light or dark conditions) versus where there is no apparent evolutionary conservation (the situation under light-dark conditions).

      Response: In the revised manuscript we proposed a conserved function of NvCLK in constant darkness, and a light-response pathway compensating in LD conditions in the mutant.

      Please also consider the major comments 8 and 9 of the common review from reviewer 2.

      Reviewer #1 (Recommendations For The Authors):

      The hybridization chain-reaction ISH is OK but, I'm not sure I understand the control condition-this should be clarified. I would also welcome the use of Clock-/- animals in HCR as another, more direct level of control. In addition, the authors state that the Myh7 probes hybridise in anatomical regions resembling those for Clock (Fig 3e). It would be better to duplex these two probe sets with different fluors for a better representation of the relative spatial distributions of each transcript.

      Response: We agree that checking for co-expression should be straightforward by HCR. However, due to time constraints during the revision period, we are unable to conduct the double in-situ experiment. Additionally, upon careful consideration, we recognize that including myhc-st (mistakenly named myh7) staining and co-expression analysis would not significantly contribute to the main conclusions of our study. Therefore, we have decided to remove this analysis from the revised manuscript.

      We clarified in the methods the control probes design.

      Minor points:

      Figure legends do not all convey sufficient detail. For instance, Figure 1c needs a better explanation. Figure 3e- are these images both WT? Fig 3f doesn't exist and other figure text references do not align with figures and need an overhaul.

      Response: All errors have been fixed.

      Reviewer #2 (Recommendations For The Authors):

      Major issues:

      (1) The authors need to introduce their model system better for a broad audience. What are the tissues/cells that express Clock at a higher level? What is their function, does this provide a potential explanation for their specific Clock expression, and how CLOCK might regulate behavior? Terms such as "tentacle endodermis and mesenteries" (line 132), "late planula stage" (line 133), "bisected physa" (line 149) would need some explanation.

      Response: We modified term such as planula to larvae, and bisected physa to tissue samples.

      2) Some of the terminology used is quite confusing, because of the double-meaning of the word "clock" (i.e the pacemaker and the transcription factor). The authors use terms such as "clock-controlled genes", "core clock genes", "CLOCK-dependent clock-controlled genes", "neo-clock-controlled genes". Is there any way to help the reader? Here are several suggestions: "core pacemaker genes," "CLOCK-dependent rhythmic genes" and "CLOCK-independent rhythmic genes".

      Response: all the terminology has been clarified, see previous comments

      3) Also in the abstract, there is mention of "hierarchal light- and Clock-signaling" (52-3) - is this related to the statement on line 219 that light is epistatic to Clock? I do not quite understand what epistatic would mean here. Who is upstream of whom? LD modifies rhythmicity in Clock mutant animals, but Clock mutations also impact rhythmicity in LD. Also, as epistasis is defined as the effect of gene interactions on phenotypes - what is the secondary gene impacting the phenotype of the Clock mutants? I am not sure the term epistatic is appropriate in the present context.

      Response: Indeed, Epistatic is a genetic term which might be unclear in this context. We removed it.

      4) The control for the in situ hybridization is unclear. In the methods, there are two Clock probes described (B3 & B5) and two control probes (B1 & B3), however, in the negative control image, a combination of one Clock (B1) and one control (B3) probe is used, I am not sure what "redundant detection" means in the legend of figure S2. Also, the sequences of each Clock probe should be provided. It might be worth testing the Clock mutant the authors generated. Clock mRNA could be reduced due to non-sense, mediated RNA decay, since the mutation causes a premature stop codon. This would be a great additional control for the in situ hybridization. Even better would be if, by chance, the probes target the mutated sequence. The signal should then be completely lost.

      Response: HCR is a tilling probe. Which means the target transcript is covered by dozens of successive DNA sequence “primer-like” which allow the HCRv.3 technology. We cannot design a mutant probe specific with this technology.

      (5) I have concerns with rhythmic-expression calls, particularly as there is so little overlap between LD and DD, and that a completely different set of rhythmic genes is observed in Clock mutant and wild-type animals. I am not an expert in whole-genome expression studies, so I hope one of my colleague reviewers can weigh in.

      When describing rhythmicity analysis in the Methods, it states that Benjamini-Hochberg corrections were applied to account for multiple comparisons. However, the false discovery rate for each analysis should be included (see Hughes et al.,: "Guidelines for Genome-Scale Analysis of Biological Rhythms," 2017).

      Response: As explained before we cannot used Benjamini-Hochberg corrections as only few genes (mostly oscillator gene pass the threshold). As such we combined two different algorithms (RAIN and JTK) with a p<0.01 to detect confidently rhythmic genes while reducing the risk of false-positives.

      Minor issues:

      (1) Environmental inputs are not "circadian", as written in the title.

      Response: Title modified

      (2) In the abstract, the description of the Clock mutant behavioral phenotypes is hard to follow, with no mention of whether or not Clock mutant animals are behaviorally rhythmic or arrhythmic in constant conditions.

      Response: corrected

      (3) Abstract: A 6/6 h LD cycle is not a compressed tidal cycle as written in the abstract. Light is not an input to tidal rhythms.

      Response: corrected

      (4) Line 101: timeout is not a core clock gene in animals.

      Response: we removed it from the candidate pacemaker genes.

      (5) What is the evidence for the role of PAR-Zip proteins in the Nematostella clock? The reference provided does not mention those.

      Response: There is no functional data in Nematostella yet to support their role within the pacemaker. However based on their rhythmicity in LD and protein conservation, we included them within the candidate pacemaker genes list. The refences have been corrected.

      (6) Line 125. should refer to Fig 1C when describing the Clock protein.

      Response: corrected

      (7) Line 143-4. based on the figure, the region targeted by gRNA was not "close to the 5' end" as stated, it is closer to the middle of the gene sequence as shown in Figure 1C. A more accurate description would be a region in between the PAS domains.

      Response: Indeed we modified the figure and the text.

      (8) Line 150. The mutant allele is described as Clock1 initially, then for the rest of the paper as Clock-. SInce it is not clear that the allele is a null (see major comment #8), Clock1 should be used throughout the manuscript.

      Response: the allele is named NvClk1 in the revised manuscript

      (9) Figure 2A, the second CT/ZT0 is misplaced.

      Response: Fig. 2 modified in the revised manuscript

      (10) Figure legend for 2E and 3B. "The 1000bp upstream ATG" is unclear. I guess it means that 1000bp upstream of the putative initiation codon was used.

      Response: Right, and in the revised version we analyzed 5kb upstream the putative ATG.

      (11) Line 164. The authors write "We discovered..." , but wasn't it already known that these animals are behaviorally rhythmic?

      Response: Fixed

      (12) It would be worth mentioning in the results section the reduced amplitude of rhythms in LL compared to DD (in WT and seemingly also in Clock mutants).

      Response: Indeed, we observed a significant reduction in the mean amplitude in the NvClk1-/- in DD and LL compared WT and NvClk1-/- in LD, DD and LL. However, as rhythmicity is lost by virtually all mutants in LL and DD we do not think these results add to the current interpretation of the gene function.

      (13) Please correct the figure numbers in the main text, there are several mistakes.

      Response: Done

      (14) Line 196, most genes in the quoted study did not cycle on day 2, so whether they are truly clock controlled is questionable.

      Response: We agree, identifying free-running cycling genes in cnidarian remains a challenge to overcome. One of the limitations of this study was to detect rhythmic genes in LD which conserved rhythmicity in DD. However, considering different transcriptomic studies (cited in the discussion) it seems that in the cnidaria phyla rhythmic genes in LD are not necessarily the one we identified rhythmic in DD.

      (15) Line 204-206 needs to be rephrased. It is confusing.

      Response: rephrased

      (16) Line 216. Rephrase to something like: "A similar finding was made for."

      Response: rephrased

      (17) "Clock regulates genetic pathways" sounds quite odd. Do you mean it regulates preferentially specific genetic (or maybe better, molecular) pathways?

      Response: rephrased

      (18) Figure 4 and legend: Dashed lines indicating threshold are missing. Do the black and red dots represent WT and Clock-/-, as indicated in the legend, or up/down, as indicated in the figures?

      Response: Fig.5 modified accordingly. Colors in the Volcano plot indicate Up- (black) versus Down- (red) regulated. It is now coherent within the figure.

      (19) Legend for Extended figure 1. "Immature peptide sequence" is incorrect.

      Response: rephrased

      (20) Extended data Figure 4. What the asterisks labels is unclear.

      Response: EDF4 was modified and become EDF2 with different content. The * indicates NvClk mRNA

      (21) Line 228. Gene "isoforms". I guess the authors mean "paralogs".

      Response: corrected.

      (22) Line 232-3/Figure 3e. Please include a comparable image of the Clk ISH to facilitate the comparison of the spatial expression pattern. In addition, where and what is the "analysis" referred to - "the spatial expression pattern of Myh7 closely resembled that of Clock, as evidenced by our analysis"?

      Response: the analysis has been removed from the revised manuscript because we currently cannot perform the double ish.

      (23) Line 282-3. As mentioned above, it is difficult to be sure that circadian behavior is lost, if only looking at a population of animals.

      Response: Fig.1 corrected

      (24) Line 301-5. Rephrase.

      Response: Rephrased

      (25) Line 325. I am not convinced that the author can say that their mutant is amorphic. See Major comment 8.

      Response: corrected.

      (26) Line 351 "simplifying interactions with the environment". Please explain what is meant here.

      Response: this confusing sentence has been removed from the revised manuscript

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this work, the authors investigate the functional difference between the most commonly expressed form of PTH, and a novel point mutation in PTH identified in a patient with chronic hypocalcemia and hyperphosphatemia. The value of this mutant form of PTH as a potential anabolic agent for bone is investigated alongside PTH(1-84), which is a current anabolic therapy. The authors have achieved the aims of the study. Their conclusion that this suggests a "new path of therapeutic PTH analog development" seems unfounded; the benefit of this PTH variant is not clear, but the work is still interesting.

      The work does not identify why the patient with this mutation has hypocalcemia and hyperphosphatemia; this was not the goal of the study, but the data is useful for helping to understand it.

      Thank you for your valuable feedback. In this study, we confirmed that <sup>R25C</sup>PTH can form a dimer, and our in vivo experiments in the mouse model demonstrated that dimeric <sup>R25C</sup>PTH can stimulate bone formation similarly to normal PTH. Furthermore, patients with the <sup>R25C</sup>PTH mutation, who have been exposed to high levels of this variant over an extended period, were reported to have high bone mineral density. Based on these observations, we hypothesized that dimeric <sup>R25C</sup>PTH might have potential as a new therapeutic PTH analog, particularly as a bone anabolic agent. However, we acknowledge that it is premature to make definitive claims regarding its therapeutic utility. Thus, we are currently conducting follow-up research to further investigate the subsignaling pathway changes induced by dimeric <sup>R25C</sup>PTH and their impact on bone metabolism.

      Moreover, to fully understand the patient’s symptoms, it is crucial to determine the form in which <sup>R25C</sup>PTH exists in vivo. While our in vitro experiments demonstrated that <sup>R25C</sup>PTH is secreted primarily in its dimeric form, we do not yet know whether this dimeric structure is maintained in vivo. We are actively conducting experiments to analyze the circulating form of <sup>R25C</sup>PTH in patients through blood sample collection (Andersen et al., 2022; Lee et al., 2015). Should the mutation predominantly exist in its monomeric form in vivo, this would align with clinical findings reported by Lee et al. (2015), which could help explain the patient’s hypocalcemia and hyperphosphatemia. However, if <sup>R25C</sup>PTH primarily exists in its dimeric form, additional research will be necessary to uncover the underlying mechanisms. Based on our experimental results, the dimeric <sup>R25C</sup>PTH exhibits a reduced binding affinity to PTH1R compared to the monomeric form. Furthermore, our in vitro experiments revealed that dimeric <sup>R25C</sup>PTH induces lower levels of cAMP production upon PTH1R activation. Accordingly, we can assume that this reduction in receptor signaling is likely to account for the impaired regulation of calcium and phosphate in patients with the mutation. However, despite this diminished signaling in calcium and phosphate homeostasis, dimeric <sup>R25C</sup>PTH was still capable of promoting bone formation at levels comparable to wild-type PTH. This apparent paradox warrants further investigation, and we are actively pursuing studies to elucidate how the dimeric form exerts its effects on bone metabolism.

      References

      Andersen, S. L., Frederiksen, A. L., Rasmussen, A. B., Madsen, M., & Christensen, A. R. (2022). Homozygous missense variant of PTH (c.166C>T, p.(Arg56Cys)) as the cause of familial isolated hypoparathyroidism in a three-year-old child. J Pediatr Endocrinol Metab, 35(5), 691-694. https://doi.org/10.1515/jpem-2021-0752

      Lee, S., Mannstadt, M., Guo, J., Kim, S. M., Yi, H. S., Khatri, A., Dean, T., Okazaki, M., Gardella, T. J., & Juppner, H. (2015). A Homozygous [Cys25]PTH(1-84) Mutation That Impairs PTH/PTHrP Receptor Activation Defines a Novel Form of Hypoparathyroidism. J Bone Miner Res, 30(10), 1803-1813. https://doi.org/10.1002/jbmr.2532

      Strengths:

      The work is novel, as it describes the function of a novel, naturally occurring, variant of PTH in terms of its ability to dimerise, to lead to cAMP activation, to increase serum calcium, and its pharmacological action compared to normal PTH.

      Weaknesses:

      (1) The use of very young, 10 week old, mice as a model of postmenopausal osteoporosis remains a limitation of this study, but this is now quite clearly described as a limitation, including justifying the use of the primary spongiosa as a measurement site.

      We appreciate the reviewer’s comment.

      (2) Methods have been clarified. It is still necessary to properly define the micro-CT threshold in mm HA/cc^3. I think it might be acat about 200mg HA/cc^3 in this study.

      Thank you for your insightful comment. To address this, we utilized hydroxyapatite (HA) phantom with HA content ranging from 0 to 1200 mg/cm<sup>3</sup>, with calibration points at 0, 50, 200, 800, 1000, and 1200 mg CaHA/cm<sup>3</sup>, to measure grayscale values via µ-CT. Based on these measurements, the trabecular bone BMD in our study was determined to range from 100 to 200 mg/cm<sup>3</sup>.

      Author response image 1.

      (3) The apparent contradiction between the cortical thickness data (where there is no difference between the two PTH formulations) and the mechanical testing data (where there is a difference) remains unresolved. It is still not clear whether there is a material defect in the bone, which can be partially assessed by reporting the 3-point bending test, corrected for the diameters of the bone (i.e. as stress / strain curves).

      Thank you for your comment. First, we ensured that the bones sampled during the experiment showed no defects, and we carefully separated the femur bones from the mice to preserve their integrity. In the 3-point bending test, PTH treatment significantly increased the maximum load of the femur bone compared to the OVX-control group. Additionally, the maximum load in the PTH treatment group was significantly greater than that observed in the PTH dimer group. Furthermore, structural factors influencing bone strength, such as the perosteal perimeter and the endocortical bone perimeter, were also increased in the PTH treatment group compared to the PTH dimer group (data only for reviewer).

      Author response image 2.

      (4) It is also puzzling that both dimeric and monomeric PTH lead to a reduction in total bone area (cross sectional area?). This would suggest a reduction in bone growth. This should be discussed in the work.

      In our experiment, the data showed an increase in cortical bone area in the PTH treatment group, but not in the PTH dimer treatment group. However, both dimeric and monomeric PTH treatments resulted in a reduction in total tissue area. We added revised sentence in page 13 line 317 and page 14 line 333 as follows:

      “In addition, the data showed an increase in cortical bone area (Ct.Ar) in the PTH treatment group but not in the PTH dimer treatment group. However, both dimeric and monomeric PTH treatments reduced total tissue area (Tt.Ar), suggesting potential effects on bone growth in the width of mice or humans.”

      “This study has several limitations. First, it is urgently necessary to determine whether dimeric <sup>R25C</sup>PTH is present in human patient serum. Second, TRAP staining showed an inhibitory effect of PTH treatment on the primary spongiosa area. However, the secondary spongiosa, which more accurately reflects bone remodeling (55), was not examined due to the barely detectable bone in this area in OVX-induced osteoporosis mouse models. Third, it is unclear whether similar bone phenotypes exist between human <sup>R25C</sup>PTH patients and dimeric <sup>R25C</sup>PTH-treated mice, particularly regarding low bone strength. Although the dimeric <sup>R25C</sup>PTH-treated group showed higher cortical BMD compared to WT-Sham or PTH groups, there was no difference in bone strength compared to the osteoporotic mouse model. Fourth, our study demonstrated that PTH or <sup>R25C</sup>PTH treatment decreased circumferential length, which could affect bone growth in width. However, whether this phenotype is also observed in patients treated with PTH or <sup>R25C</sup>PTH remains uncertain.”

    1. Author Response

      The following is the authors’ response to the original reviews.

      We would like to thank the reviewers for their insightful comments and recommendations. We have extensively revised the manuscript in response to the valuable feedback. We believe the results is a more rigorous and thoughtful analysis of the data. Furthermore, our interpretation and discussion of the findings is more focused and highlights the importance of the circuit and its role in the response to stress. Thank you for helping to improve the presented science.

      Key changes made in response to the reviewers comments include:

      • Revision of statistical analyses for nearly all figures, with the addition of a new table of summary statistics to include F and/or t values alongside p-values.

      • Addition of statistical analyses for all fiber photometry data.

      • Examination of data for possible sex dependent effects.

      • Clarification of breeding strategies and genotype differences, with added details to methods to improve clarity.

      • Addressing concerns about the specificity of virus injections and the spread, with additional details added to methods.

      • Modification of terminology related to goal-directed behavior based on reviewer feedback, including removal of the term from the manuscript.

      • Clarification and additional data on the use of photostimulation and its effects, including efforts to inactivate neurons for further insight, despite technical challenges.

      • Correction of grammatical errors throughout the manuscript.

      Reviewer 1:

      Despite the manuscript being generally well-written and easy to follow, there are several grammatical errors throughout that need to be addressed.

      Thank you for highlighting this issue. Grammatical errors have been fixed in the revised version of the manuscript.

      Only p values are given in the text to support statistical differences. This is not sufficient. F and/or t values should be given as well.

      In response to this critique and similar comments from Reviewer 2, we re-evaluated our approach to statistical analyses and extensively revised analyses for nearly all figures. We also added a new table of summary statistics (Supplemental Table 1) containing the type of analysis, statistic, comparison, multiple comparisons, and p value(s). For Figures 4C-E, 5C, 6C-E, 7H-I, and 8H we analyzed these data using two-way repeated measures (RM) ANOVA that examined the main effect of time (either number of sessions or stimulation period) in the same animal and compared that to the main effect of genotype of the animal (Cre+ vs Cre-), and if there was an interaction. For Supplemental Figure 7A we also conducted a two-way RM ANOVA with time as a factor and activity state (number of port activations in active vs inactive nose port) as the other in Cre+ mice. For Figures 5D-E we conducted a two-way mixed model ANOVA that accounted and corrected for missing data. In figures that only compared two groups of data (Figures 5F-L, 6F, 8C-D, 8I, and Supp 6F-G) we used two-tailed t-test for the analysis. If our question and/or hypothesis required us to conduct multiple comparisons between or within treatments, we conducted Bonferroni’s multiple comparisons test for post hoc analysis (we note which groups we compared in Supplemental Table 1). For figures that did or did not show a change in calcium activity (Figure 3G, 3I-K, 7B, 7D-E, 8E-F), we compared waveform confidence intervals (Jean-Richard-Dit-Bressel, Clifford, McNally, 2020). The time windows we used as comparison are noted in Supplemental Table 1, and if the comparisons were significant at 95%, 99%, and 99.9% thresholds.

      None of prior comparisons in prior analyses that were significant were found to have fallen below thresh holds for significance. Of those found to be not significantly different, only one change was noted. In Figure 6E there was now a significant baseline difference between Cre+ and Cre- mice with Cre- mice taking longer to first engage the port compared to Cre+ mice (p=0.045). Although the more rigorous approach the statistical analyses did not change our interpretations we feel the enhanced the paper and thank the reviewer for pushing this improvement.

      Moreover, the fibre photometry data does not appear to have any statistical analyses reported - only confidence intervals represented in the figures without any mention of whether the null hypothesis that the elevations in activity observed are different from the baseline.

      This is particularly important where there is ambiguity, such as in Figure 3K, where the spontaneous activity of the animal appears to correlate with a spike in activity but the text mentions that there is no such difference. Without statistics, this is difficult to judge.

      Thank you for highlighting this critical point and providing an opportunity to strengthen our manuscript. We added statistical analyses of all fiber photometry data using a recently described approach based on waveform confidence intervals (Jean-Richard-Dit-Bressel, Clifford, McNally, 2020). In the statistical summary (Supplemental Table 1) we note the time window that we used for comparison in each analysis and if the comparisons were significant at 95%, 99%, and 99.9% thresholds. Thank you from highlighting this and helping make the manuscript stronger.

      With respect to Figure 3K, we are not certain we understood the spike in activity the reviewer referred to. Figure 3J and K include both velocity data (gold) and Ca2+ dependent signal (blue). We used episodes of velocity that were comparable to the avoidance respond during the ambush test and no significant differences in the Ca2+ signal when gating around changes in velocity in the absence of stressor (Supplemental Table1). This is in contrast to the significant change in Ca2+ signal following a mock predator ambush (Figure 3J). We interpret these data together to indicate that locomotion does not correlate with an increase in calcium activity in SuMVGLUT2+::POA neurons, but that coping to a stressor does. This conclusion is further examined in supplemental Figure 5, including examining cross-correlation to test for temporally offset relationship between velocity and Ca2+ signal in SUMVGLUT2+::POA neurons.

      The use of photostimulation only is unfortunate, it would have been really nice to see some inactivation of these neurons as well. This is because of the well-documented issues with being able to determine whether photostimulation is occurring in a physiological manner, and therefore makes certain data difficult to interpret. For instance, with regards to the 'active coping' behaviours - is this really the correct characterisation of what's going on? I wonder if the mice simply had developed immobile responding as a coping strategy but when they experience stimulation of these neurons that they find aversive, immobility is not sufficient to deal with the summative effects of the aversion from the swimming task as well as from the neuronal activation? An inactivation study would be more convincing.

      We agree with the point of the reviewer, experiments demonstrating necessity of SUMVGLUT2+::POA neurons would have added to the story here. We carried out multiple experiments aimed at addressing questions about necessity of SuMVGLUT2+::POA neurons in stress coping behaviors, specifically the forced swim assay. Efforts included employing chemogenetic, optogenetic, and tetanus toxin-based methods. We observed no effects on locomotor activity or stress coping. These experiments are both technically difficult and challenging to interpret. Interpretation of negative results, as we obtained, is particularly difficult because of potential technical confounds. Selective targeting of SuMVGLUT2+::POA neurons for inhibition requires a process requiring three viral injections and two recombination steps, increasing variability and reducing the number of neurons impacted. Alternatively, photoinhibition targeting SuMVGLUT2+::POA cells can be done using Retro-AAV injected into POA and a fiber implant over SuM. We tried both approaches. Data obtained were difficult to interpret because of questions about adequate coverage of SuMVGLUT2+::POA population by virally expressed constructs and/or light spread arose. The challenge of adequate coverage to effectively prevent output from the targeted population is further confounded by challenges inherent in neural inhibition, specifically determining if the inhibition created at the cellular level is adequate to block output in the context of excitatory inputs or if neurons must be first engaged in a particular manner for inhibition to be effective. Baseline neural activity, release probability, and post-synaptic effects could all be relevant, which photo-inhibition will potentially not resolve. So, while the trend is to always show “necessary and sufficient” effects, we’ve tried nearly everything, and we simply cannot conclude much from our mixed results. There are also wellestablished problems with existing photo-inhibition methods, which while people use them and tout them, are often ignored. We have a lot of expertise in photo-inhibition optogenetics, and indeed have used it with some success, developed new methods, yet in this particular case we are unable to draw conclusions related to inhibition. People have experienced similar challenges in locus coeruleus neurons, which have very low basal activity, and inhibition with chemogenetics is very hard, as well as with optogenetic pump-based approaches, because the neurons fire robust rebound APs. We have spent almost 2.5 years trying to get this to work in this circuit because reviews have been insistent on this result for the paper to be conclusive. Unfortunately, it simply isn’t possible in our view until we know more about the cell types involved. This is all in spite of experience using the approach in many other publications.

      We also employed less selective approaches, such as injecting AAV-DIO-tetanus toxin light chain (Tettox) constructs directly into SuM VGLUT2-Cre mice but found off target effects impacting animal wellbeing and impeding behavioral testing due viral spread to surrounding areas.

      While we are disappointed for being unable to directly address questions about necessity of SuMVGLUT2+::POA neurons in active coping with experimental data, we were unable to obtain results allowing for clear interpretation across numerous other domains the reviewers requested. We also feel strongly that until we have a clear picture of the molecular cell type architecture in the SuM, and Cre-drivers to target subsets of neurons, this question will be difficult to resolve for any group. We are working now on RNAseq and related spatial transcriptomics efforts in the SuM and examining additional behavioral paradigm to resolve these issues, so stay tuned for future publications.

      Accordingly, we avoid making statements relating to necessity in the manuscript. In spite of having several lines of physiological data with strong robust correlations behavior related to the SuMVGLUT2+::POA circuit.

      Nose poke is only nominally instrumental as it cannot be shown to have a unique relationship with the outcome that is independent of the stimuli-outcome relationships (in the same way that a lever press can, for example). Moreover, there is nothing here to show that the behaviours are goal-directed.

      Thank you for highlighting this point. Regarding goal-direct terminology, we removed this terminology from the manuscript. Since the mice perform highly selective (active vs inactive) port activation robustly across multiple days of training the behavior likely transitions to habitual behavior. We only tested the valuation of stimuli termination of the final day of training with time limited progressive ratio test. With respect to lever press versus active port activation, we are unclear how using a lever in this context would offer a different interpretation. Lever pressing may be more sensitive to changes in valuation when compared to nose poke port activation (Atalayer and Rowland 2008); however, in this study the focus of the operant behavior is separating innate behaviors for learned action–outcome instrumental learned behaviors for threat response (LeDoux and Daw 2018). The robust highly selective activation of the active port illustrated in Figure 6 fits as an action–outcome instrumental behavior wherein mice learn to engage the active but not inactive port to terminate photostimulation. The first activation of the port occurs through exploration of the arena but as demonstrated by the number of active port activations and the decline in time of the first active port engagement, mice expressing ChR2eYFP learn to engage the port to terminate the stimulation. To aid in illustrating this point we have added Supplemental Figure 7 showing active and inactive port activations for both Cre+ and Cre- mice. This adds clarity to high rate of selective port activation driven my stimulation of SUMVGLUT2+::POA neurons compared to controls. The elimination of goal directed and providing additional data narrows and supports one of the key points of the operant experiment.

      With regards to Figure 1: This is a nice figure, but I wonder if some quantification of the pathways and their density might be helpful, perhaps by measuring the intensity of fluorescence in image J (as these are processes, not cell bodies that can be counted)? Mind you, they all look pretty dense so perhaps this is not necessary! However, because the authors are looking at projections in so-called 'stress-engaged regions', the amygdala seems conspicuous by its absence. Did the authors look in the amygdala and find no projections? If so it seems that this would be worth noting.

      This is an interesting question but has proven to be a very technically challenging question. We consulted with several leaders who routinely use complimentary viral tracing methods in the field. We were unable to devise a method to provide a satisfactorily meaningful quantitative (as opposed to qualitative) approach to compare SUMVGLUT2+::POA to SuMVGLUT2+ projections. A few limitations are present that hinder a meaningful quantitative approach. One limitation was the need for different viral strategies to label the two populations. Labeling SuMVGLUT2+::POA neurons requires using VGLUT2-Flp mice with two injections into the POA and one into SuM. Two recombinase steps were required, reducing efficiency of overlap. This combination of viral injections, particularly the injections of RetroAAVs in the POA, can induce significant quantitative variability due to tropism, efficacy, and variability of retro-viral methods, and viral infection generally. These issues are often totally ignored in similar studies across the “neural circuit” landscape, but it doesn’t make them less relevant here.

      Although people do this in the field, and show quantification, we actually believe that it can be a quite misleading read-out of functionally relevant circuitry, given that neurotransmitter release ultimately is amplified by receptors post-synaptically, and many examples of robust behavioral effects have been observed with low fiber tracing complimentary methods (McCall, Siuda et al. 2017). In contrast, the broader SuMVGLUT2+ population was labeled using a single injection into the SuM. This means there like more efficient expression of the fluorophore. Additionally, in areas that contain terminals and passing fibers understanding and interpreting fluorescent signal is challenging. Together, these factors limit a meaningful quantitative comparison and make an interpretation difficult to make. In this context, we focused on a conservative qualitative presentation to demonstrate two central points. That 1) SuMVGLUT2+::POA neurons are subset of SuMVGLUT2+ neurons that project to specific areas and that exclude dentate gyrus, and they 2) arborize extensively to multiple areas which have be linked to threat responses. We agree that there is much to be learned about how different populations in SuM connect to targets in different regions of the brain and to continue to examine this question with different techniques. A meaningful quantitative study comparing projections is technically complex and, we feel, beyond our ability for this study.

      Also, for the reasons above we do not believe that quantification provides exceptional clarity with respect to the putative function of the circuit, glutamate released, or other cotransmitters given known amplification at the post-synaptic side of the circuit.

      With regard to the amygdala, other studies on SuM projections have found efferent projections to amygdala (Ottersen, 1980; Vertes, 1992). In our study we were unable to definitively determine projections from SuMVGLUT2+::POA neurons to amygdala, which if present are not particularly dense. For this reason we were conservative and do not comment on this particular structure.

      I would suggest removing the term goal-directed from the manuscript and just focusing on the active vs. passive distinction.

      We removed the use of goal-directed. Thank you for helping us clarify our terminology.

      The effect observed in Figure 7I is interesting, and I'm wondering if a rebound effect is the most likely explanation for this. Did the authors inhibit the VGAT neurons in this region at any other times and observe a similar rebound? If such a rebound was not observed it would suggest that it is something specific about this task that is producing the behaviour. I would like it if the authors could comment on this.

      We agree that results showing the change in coping strategy (passive to active) in forced swim after but not during stimulation of SuMVGAT+ neurons is quite interesting (Figure 7I). This experiment activated SuMVGAT+ neurons during a section of the forced swim assay and mice showed a robust shift to mobility after the stimulation of SuMVGAT+ neurons stopped. We did not carry out inhibition of SuMVGAT+ neurons in this manuscript. As the reviewer suggested, strong inhibition of local SuM neurons, including SUMVGLUT2+::POA neurons, could lead to rebound activity that may shift coping behaviors in confusing ways. We agree this is an interesting idea but do not have data to support the hypothesis further at this time.

      Reviewer 2

      (1) These are very difficult, small brain regions to hit, and it is commendable to take on the circuit under investigation here. However, there is no evidence throughout the manuscript that the authors are reliably hitting the targets and the spread is comparable across experiments, groups, etc., decreasing the significance of the current findings. There are no hit/virus spread maps presented for any data, and the representative images are cropped to avoid showing the brain regions lateral and dorsal to the target regions. In images where you can see the adjacent regions, there appears expression of cell bodies (such as Supp 6B), suggesting a lack of SuM specificity to the injections.

      We agree with the reviewer that the areas studied are small and technically challenging to hit. This was one of driving motivations for using multiple tools in tandem to restrict the area targeted for stimulation. Approaches included using a retrograde AAVs to express ChR2eFYP in SUMVGLUT2+::POA neurons; thereby, restricting expression to VGLUT2+ neurons that project to the POA. Targeting was further limited by placement of the optic fiber over cell bodies on SuM. Thus, only neurons that are VGLUT2+, project to the POA, and were close enough to the fiber were active by photostimulation. Regrettably, we were not able to compile images from mice where the fiber was misplaced leading to loss of behavioral effects. We would have liked to provide that here to address this comment. Unfortunately, generating heat maps for injections is not possible for anatomic studies that use unlabeled recombinase as part of an intersectional approach. Also determining the point of injection of a retroAAV can be difficult to accurately determine its location because neurons remote to injection site and their processes are labeled.

      Experiments described in Supplemental Figure 6B on VGAT neurons in SuM were designed and interpreted to support the point that SUMVGLUT2+::POA neurons are a distinct population that does not overlap with GABAergic neurons. For this point it is important that we targeted SuM, but highly confined targeting is not needed to support the central interpretation of the data. We do see labeling in SuM in VGAT-Cre mice but photo stimulation of SuMVGAT+ neurons does not generate the behavioral changes seen with activation of SUMVGLUT2+::POA neurons. As the reviewer points out, SuM is small target and viral injection is likely to spread beyond the anatomic boundaries to other VGAT+ neurons in the region, which are not the focus here. The activation would be restricted by the spread of light from the fiber over SuM (estimated to be about a 200um sphere in all directions). We did not further examine projections or localization of VGAT+ neurons in this study but focused on the differential behavioral effects of SUMVGLUT2+::POA neurons.

      (2) In addition, the whole brain tracing is very valuable, but there is very little quantification of the tracing. As the tracing is the first several figures and supp figure and the basis for the interpretation of the behavior results, it is important to understand things including how robust the POA projection is compared to the collateral regions, etc. Just a rep image for each of the first two figures is insufficient, especially given the above issue raised. The combination of validation of the restricted expression of viruses, rep images, and quantified tracing would add rigor that made the behavioral effects have more significance.

      For example, in Fig 2, how can one be sure that the nature of the difference between the nonspecific anterograde glutamate neuron tracing and the Sum-POA glutamate neuron tracing is real when there is no quantification or validation of the hits and expression, nor any quantification showing the effects replicate across mice? It could be due to many factors, such as the spread up the tract of the injection in the nonspecific experiment resulting in the labeling of additional regions, etc.

      Relatedly, in Supp 4, why isn’t C normalized to DAPI, which they show, or area? Similar for G what is the mcherry coverage/expression, and why isn’t Fos normalized to that?

      Thank you for highlighting the importance of anatomy and the value of anatomy. Two points based on the anatomic studies are central to our interpretation of the experimental data. First, SUMVGLUT2+::POA are a distinct population within the SuM. We show this by demonstrating they are not GABAergic and that they do not project to dentate gyrus. Projections from SuM to dentate gyrus have been described in multiple studies (Boulland et al., 2009; Haglund et al., 1987; Hashimotodani et al., 2018; Vertes, 1992) and we demonstrate them here for SuMVGLUT2+ cells. Using an intersectional approach in VGLUT2-Flp mice we show SUMVGLUT2+::POA neurons do not project to dentate gyrus. We show cell bodies of SUMVGLUT2+::POA neurons located in SuM across multiple figures including clear brain images. Thus, SUMVGLUT2+::POA neurons are SuM neurons that do not project to dentate gyrus, are not GABAergic, send projections to a distinct subset of targets, most notably excluding dentate gyrus. Second, SUMVGLUT2+::POA neurons arborize sending projections to multiple regions. We show this using a combinatorial genetic and viral approach to restrict expression of eYFP to only neurons that are in SuM (based on viral injection), project to the POA (based on retrograde AAV injection in POA), and VGLUT2+ (VGLUT2-Flp mice). Thus, any eYFP labeled projection comes from SUMVGLUT2+::POA neurons. We further confirmed projections using retroAAV injection into areas identified using anterograde approaches (Supplemental Figure 2). As discussed above in replies to Reviewer 1, we feel limitations are present that preclude meaningful quantitative analysis. We thus opted for a conservative interpretation as outlined.

      Prior studies have shown efferent projections from SuM to many areas, and projections to dentate gyrus have received substantial attention (Bouland et al., 2009; Haglund, Swanson, and Kohler, 1984; Hashimotodani et al., 2018; Soussi et al., 2010; Vertes, 1992; Pan and McNaugton, 2004). We saw many of the same projections from SuMVGLUT2+ neurons. We found no projections from SUMVGLUT2+::POA neurons to dentate gyrus (Figure 2). Our description of SuM projection to dentate gyrus is not new but finding a population of neurons in SuM that does not project to dentate gyrus but does project to other regions in hippocampus is new. This finding cannot be explained by spread of the virus in the tract or non-selective labeling.

      (3) The authors state that they use male and female mice, but they do not describe the n’s for each experiment or address sex as a biological variable in the design here. As there are baseline sex differences in locomotion, stress responses, etc., these could easily factor into behavioral effects observed here.

      Sex specific effects are possible; however, the studies presented here were not designed or powered to directly examine them. A point about experimental design that helps mitigate against strong sex dependent effect is that often the paradigm we used examined baseline (pre-stimulation) behavior, how behavior changed during stimulation, and how behavior returned (or not) to baseline after stimulation. Thus, we test changes in individual behaviors. Although we had limited statistical power, we conducted analyses to examine the effects of sex as variable in the experiments and found no differences among males and females.

      (4) In a similar vein as the above, the authors appear to use mice of different genotypes (however the exact genotypes and breeding strategy are not described) for their circuit manipulation studies without first validating that baseline behavioral expression, habituation, stress responses are not different. Therefore, it is unclear how to interpret the behavioral effects of circuit manipulation. For example in 7H, what would the VGLUT2-Cre mouse with control virus look like over time? Time is a confound for these behaviors, as mice often habituate to the task, and this varies from genotype to genotype. In Fig 8H, it looks like there may be some baseline differences between genotypes- what is normal food consumption like in these mice compared to each other? Do Cre+ mice just locomote and/or eat less? This issue exists across the figures and is related to issues of statistics, potential genotype differences, and other experimental design issues as described, as well as the question about the possibility of a general locomotor difference (vs only stress-induced). In addition, the authors use a control virus for the control groups in VGAT-Cre manipulation studies but do not explain the reasoning for the difference in approach.

      Thank you for highlighting the need for greater clarity about the breeding strategies used and for these related questions. We address the breeding strategy and then move to address the additional concerns raised. We have added details to the methods section to address this point. For VGLUT2-Cre mice we use litter mates controls from Cre/WT x WT/WT cross. The VGLUT2-Cre line (RRID:IMSR_JAX:028863) (Vong L , et al. 2011) used here been used in many other reports. We are not aware of any reports indicating a phenotype associated with the addition of the IRES-Cre to the Slc17a6 loci and there is no expected impact of expression of VGLUT2. Also, we see in many of the experiments here that the baseline (Figures 4, 5, and 7) behaviors are not different between the Cre+ and Cre- mice. For VGAT-Cre mice we used a different breeding strategy that allowed us to achieve greater control of the composition of litters and more efficient cohorts cohort. A Cre/Cre x WT/WT cross yielded all Cre/WT litters. The AAV injected, ChR2eYFP or eYFP, allowed us to balance the cohort.

      Regarding Figure 7H, which shows time immobile on the second day of a swim test, data from the Cre- mice demonstrate the natural course of progression during the second day of the test. The control mice in the VGAT-Cre cohort (Figure 7I) have similar trend. The change in behavior during the stimulation period in the Cre+ mice is caused by the activation of SUMVGLUT2+::POA neurons. The behavioral shift largely, but not completely, returns to baseline when the photostimulation stops. We have no reason to believe a VGLUT2-Cre+ mouse injected with control AAV to express eYFP would be different from WT littermate injected with AVV expressing ChR2eYFP in a Cre dependent manner.

      Turning to concerns related to 8H, which shows data from fasted mice quantify time spent interacting with food pellet immediately after presentation of a chow pellet, we found no significant difference between the control and Cre+ mice. We unaware of any evidence indicating that the two groups should have a different baseline since the Cre insertion is not expected to alter gene expression and we are unaware of reports of a phenotype relating to feeding and the presence of the transgene in this mouse line. Even if there were a small baseline shift this would not explain the large abrupt shift induced by the photostimulation. As noted above, we saw shifts in behavior abruptly induced by the initiation of photostimulation when compared to baseline in multiple experiments. This shift would not be explained by a hypothetical difference in the baseline behaviors of litter mates.

      (5) The statistics used throughout are inappropriate. The authors use serial Mann-Whitney U tests without a description of data distributions within and across groups. Further, they do not use any overall F tests even though most of the data are presented with more than two bars on the same graph. Stats should be employed according to how the data are presented together on a graph. For example, stats for pre-stim, stim, and post-stim behavior X between Cre+ and Cre- groups should employ something like a two-way repeated measures ANOVA, with post-hoc comparisons following up on those effects and interactions. There are many instances in which one group changes over time or there could be overall main effects of genotype. Not only is serially using Mann-Whitney tests within the same panel misleading and statistically inaccurate, but it cherry-picks the comparisons to be made to avoid more complex results. It is difficult to comprehend the effects of the manipulations presented without more careful consideration of the appropriate options for statistical analysis.

      We thank the reviewer for pointing this out and suggesting alterative analyses, we agree with the assessment on this topic. Therefore, we have extensively revised the statical approach to our data using the suggested approach. Reviewer 1 also made a similar comment, and we would like to point to our reply to reviewer 1’s second point in regard to what we changed and added to the new statistical analyses. Further, we have added a full table detailing the statical values for each figure to the paper.

      Conceptual:

      (6) What does the signal look like at the terminals in the POA? Any suggestion from the data that the projection to the POA is important?

      This is an interesting question that we will pursue in future investigations into the roles of the POA. We used the projection to the POA from SuM to identify a subpopulation in SuM and we were surprised to find the extensive arborization of these neurons to many areas associated with threat responses. We focused on the cell bodies as “hubs” with many “spokes”. Extensive studies are needed to understand the roles of individual projections and their targets. There is also the hypothetical technical challenge of manipulating one projection without activating retrograde propagation of action potentials to the soma. At the current time we have no specific insights into the roles of the isolated projection to POA. Interpretation of experiments activating only “spoke” of the hub would be challenging. Simple terminal stimulation experiments are challenged by the need to separate POA projections from activation of passing fibers targeting more anterior structures of the accumbens and septum.

      (7) Is this distinguishing active coping behavior without a locomotor phenotype? For example, Fig. 5I and other figure panels show a distance effect of stimulation (but see issues raised about the genotype of comparison groups). In addition, locomotor behavior is not included for many behaviors, so it is hard to completely buy the interpretation presented.

      We agree with the reviewer and thank them for highlighting this fundamental challenge in studies examining active coping behaviors in rodents, which requires movement. Additionally, actively responding to threatening stressors would include increased locomotor activity. Separation of movement alone from active coping can be challenging. Because of these concerns we undertook experiments using diverse behavioral paradigms to examine the elicited behaviors and the recruitment of SuMVGLUT2+::POA neurons to stressors. We conducted experiments to directly examine behaviors evoked by photoactivation of SuMVGLUT2+::POA. In these experiments we observed a diversity of behaviors including increased locomotion and jumping but also treading/digging (Figure 4). These are behaviors elicited in mice by threatening and noxious stimuli. An Increase of running or only jumping could signify a specific locomotor effect, but this is not what was observed. Based on these behaviors, we expected to find evidence of increase movement in open field (Figure 5G-I) and light dark choice (Figure 5J-L) assays. For many of the assays, reporting distance traveled is not practical. An important set of experiments that argues against a generic increase in locomotion is the operant behavior experiments, which require the animal to engage in a learned behavior while receiving photostimulation of SuMVGLUT2+::POA neurons (Figure 6). This is particularly true for testing using a progressive ratio when the time of ongoing photostimulation is longer, yet animals actively and selectively engage the active port (Figure 6G-H). Further, we saw a shift in behavioral strategy induce by photoactivation in forced swim test (Figure 7H). Thus, activation of SUMVGLUT2+::POA neurons elicited a range of behaviors that included swimming, jumping, treading, and learned response, not just increased movement. Together these data strongly argue that SuMVGLUT2+::POA neurons do not only promote increased locomotor behavior. We interpret these data together with the data from fiber photometry studies to show SuMVGLUT2+::POA neurons are recruited during acute stressors, contribute to aversive affective component of stress, and promote active behaviors without constraining the behavioral pattern.

      Regarding genotype, we address this in comments above as well but believe that clarifying the use of litter mates, the extensive use of the VGLUT2-Cre line by multiple groups, and experimental design allowing for comparison to baseline, stimulation evoked, and post stimulation behaviors within and across genotypes mitigate possible concerns relating to the genotype.

      (8) What is the role of GABA neurons in the SuM and how does this relate to their function and interaction with glutamate neurons? In Supp 8, GABA neuron activation also modulates locomotion and in Fig 7 there is an effect on immobility, so this seems pretty important for the overall interpretation and should probably be mentioned in the abstract.

      Thank you for noting these interesting findings. We added text to highlight these findings to the abstract. Possible roles of GABAergic neurons in SuM extend beyond the scope of the current study particularly since SuM neurons have been shown to release both GABA and glutamate (Li Y, Bao H, Luo Y, et al. 2020, Root DH, Zhang S, Barker DJ et al. 2018). GABAergic neurons regulate dentate gyrus (Ajibola MI, Wu JW, Abdulmajeed WI, Lien CC 2021), REM sleep (Billwiller F, Renouard L, Clement O, Fort P, Luppi PH 2017), and novelty processing Chen S, He L, Huang AJY, Boehringer R et al. 2020). The population of exclusively GABAergic vs dual neurotransmitter neurons in SuM requires further dissection to be understood. How they may relate to SUMVGLUT2+::POA neurons require further investigation.

      Questions about figure presentation:

      (9) In Fig 3, why are heat maps shown as a single animal for the first couple and a group average for the others?

      Thank you for highlighting this point for further clarification. We modified the labels in the figure to help make clear which figures are from one animal across multiple trials and those that are from multiple animals. In the ambush assay each animal one had one trial, to avoid habituation to the mock predator. Accordingly, we do not have multiple trials for each animal in this test. In contrast, the dunk assay (10 trial/animal) and the shock (5 trials/animal) had multiple trials for each animal. We present data from a representative animal when there are multiple trials per animal and the aggerate data.

      Why is the temporal resolution for J and K different even though the time scale shown is the same?

      Thank you for noticing this error carried forward from a prior draft of the figure so we could correct it. We replaced the image in 3J with a more correctly scaled heatmap.

      What is the evidence that these signal changes are not due to movement per se?

      Thank you for the question. There are two points of evidence. First, all the 465 nm excitation (Ca2+ dependent) data was collected in interleaved fashion with 415 nm (isosbestic) excitation data. The isosbestic signal is derived from GCaMP emission but is independent of Ca2+ binding (Martianova E, Aronson S, Proulx CD. 2019). This approach, time-division multiplexing, can correct calcium-dependent for changes in signal most often due to mechanical change. The second piece of evidence is experimental. Using multiple cohorts of mice, we examined if the change in Ca2+ signal was correlated with movement. We used the threshold of velocity of movement seen following the ambush. We found no correlation between high velocity movements and Ca2+ signal (Figure 3K) including cross correlational analysis (Supplemental figure 5). Based on these points together we conclude the change in the Ca2+ signal in SUMVGLUT2+::POA neurons is not due to movement induced mechanical changes and we find no correlation to movement unless a stressor is present, i.e. mock predator ambush or forced swim. Further, the stressors evoke very different locomotor responses fleeing, jumping, or swimming.

      (10) In Fig 4, the authors carefully code various behaviors in mice. While they pick a few and show them as bars, they do not show the distribution of behaviors in Cre- vs Cre+ mice before manipulation (to show they have similar behaviors) or how these behaviors shift categories in each group with stimulation. Which behaviors in each group are shifting to others across the stim and post-stim periods compared to pre-stim?

      This is an important point. We selected behaviors to highlight in Figure4 C-E because these behaviors are exhibited in response to stress (De Boer & Koolhaas, 2003; van Erp et al., 1994). For the highlighted behaviors, jumping, treading/digging, grooming, we show baseline (pre photostimulation), stimulation, and post stimulation for Cre+ and Cre- mice with the values for each animal plotted. We show all nine behaviors as a heat map in Figure 4B. The panels show changes that may occur as a function of time and show changes induced by photostimulation.

      The heatmaps demonstrate that photostimulation of SUMVGLUT2+::POA neurons causes a suppression of walking, grooming, and immobile behaviors with an increase in jumping, digging/treading, and rapid locomotion. After stimulation stops, there is an increase in grooming and time immobile. The control mice show a range of behaviors with no shifts noted with the onset or termination of photostimulation.

      Of note, issues of statistics, genotype, and SABV are important here. For example, the hint that treading/digging may have a slightly different pre-stim basal expression, it seems important to first evaluate strain and sex differences before interpreting these data.

      We examined the effects of sex as a biological variable in the experiments reported in the manuscript and found no differences among males and females in any of the experiments where we had enough animals in each sex (minimum of 5 mice) for meaningful comparisons. We did this by comparing means and SEM of males and females within each group (e.g. Cre+ males vs Cre+ female, Cre- males vs Cre- females) and then conducted a t-test to see if there was a difference. For figures that show time as a variable (e.g Figure 6C-E), we compared males and females with time x sex as main factors and compared them (including multiple comparisons if needed). We found no significant main effects or interactions between males and females. Because of this, and to maximize statistical power, we decided to move forward to keep males and females together in all the analyses presented in the manuscript. It is worth noting also that the core of the experimental design employed is a change in behavior caused by photostimulation. The mice are also the same strain with only difference being the modification to add an IRES and sequence for Cre behind the coding sequence of the Slc17A6 (VGLUT2) gene.

      (11) Why do the authors use 10 Hz stimulation primarily? is this a physiologically relevant stim frequency? They show that they get effects with 1 Hz, which can be quite different in terms of plasticity compared to 10 Hz.

      Thank you for the raising this important question. Because tests like open field and forced swim are subject to habituation and cannot be run multiple times per animal a test frequency was needed to use across multiple experiments for consistency. The frequency of 10Hz was selected because it falls within the rate of reported firing rates for SuM neurons (Farrel et al., 2021; Pedersen et al., 2017) and based on the robust but sub maximal effects seen in the real-time place preference assays. Identification of the native firing rates during stress response would be ideal but gathering this data for the identified population remains a dauting task.

      (12) In Fig 5A-F, it is unclear whether locomotion differences are playing a role. Entrances (which are low for both groups) are shown but distance traveled or velocity are not.

      In B, there is no color in the lower left panel. where are these mice spending their time? How is the entirety of the upper left panel brighter than the lower left? If the heat map is based on time distribution during the session, there should be more color in between blue and red in the lower left when you start to lose the red hot spots in the upper left, for example. That is, the mice have to be somewhere in apparatus. If the heat map is based on distance, it would seem the Cre- mice move less during the stim.

      We appreciate the opportunity to address this question, and the attention to detail the reviewer applied to our paper. In the real time place preference test (RTPP) stimulation would only be provided while the animal was on the stimulation side. Mice quickly leave the stimulation side of the arena, as seen in the supplemental video, particularly at the higher frequencies. Thus, the time stimulation is applied is quite low. The mice often retreat to a corner from entering the stimulation side during trials using higher frequency stimulation. Changing locomotor activity along could drive changes in the number entrances but we did not find this. In regard to the heat map, the color scale is dynamically set for each of the paired examples that are pulled from a single trial. To maximize the visibility between the paired examples the color scale does not transfer between the trials. As a result, in the example for 10 Hz the mouse spent a larger amount of time in the in the area corresponding to the lower right corner of the image and the maximum value of the color scale is assigned to that region. As seen in the supplemental video, mice often retreated to the corner of the non-stimulation side after entering the stimulation side. The control animal did not spend a concentrated amount of time in any one region, thus there is a lack of warmer colors. In contrast the baseline condition both Cre+ and Cre- mice spent time in areas disturbed on both sides of arena, as expected. As a result, the maximum value in the heat map is lower and more area are coded in warmer colors allowing for easier visual comparison between the pair. Using the scale for the 10 Hz pair across all leads to mostly dark images. We considered ways to optimized visualization across and within pairs and focused on the within pair comparison for visualization.

      (13) By starting with 1 hz, are the experimenters inducing LTD in the circuit? what would happen if you stop stimming after the first epoch? Would the behavioral effect continue? What does the heat map for the 1 hz stim look like?

      Relatedly, it is a lot of consistent stimulation over time and you likely would get glutamate depletion without a break in the stim for that long.

      Thank you for the opportunity to add clarity around this point regarding the trials in RTPP testing. Importantly, the trials were not carried out in order of increasing frequency of stimulation, as plotted. Rather, the order of trials was, to the extent possible with the number of mice, counterbalanced across the five conditions. Thus, possible contribution of effects of one trial on the next were minimized by altering the order of the trials.

      We have added a heat map for the 1 Hz condition to figure 5B.

      For experiments on RTPP the average stimulation time at 10Hz was less than 10 seconds per event. As a result, the data are unlikely to be affected by possible depletion of synaptic glutamate. For experiments using sustained stimulation (open field or light dark choice assays) we have no clear data to address if this might be a factor where 10Hz stimulation was applied for the entire trial.

      (14) In Fig 6, the authors show that the Cre- mice just don't do the task, so it is unclear what the utility of the rest of the figure is (such as the PR part). Relatedly, the pause is dependent on the activation, so isn't C just the same as D? In G and H, why ids a subset of Cre+ mice shown?

      Why not all mice, including Cre- mice?

      Thank you for the opportunity to improve the clarity of this section. A central aspect of the experiments in Figure 6 is the aversiveness of SUMVGLUT2+::POA neuron photostimulation, as shown in Figure 5B-F. The aversion to photostimulation drives task performance in the negative reinforcer paradigm. The mice perform a task (active port activation) to terminate the negative reinforcer (photostimulation of SuMVGLUT2+::POA neurons). Accordingly, control mice are not expected to perform the task because SuMVGLUT2+::POA neurons are not activated and, thus the mice are not motivated to perform the task.

      A central point we aim to covey in this figure is that while SuMVGLUT2+::POA neurons are being stimulated, mice perform the operant task. They selectively activated the active port (Supplemental Figure 7). As expected, control mice activate the active port at a low level in the process of exploring the arena. This diminishes on subsequent trials as mice habituate to the arena (Figure 6D). The data in Figures 6 C and D are related but can be divergent. Each pause in stimulation requires a port activation of a FR1 test but the number of port activations can exceed the pauses, which are 10 seconds long, if the animal continues to activate the port. Comparing data in Figures 6 C and D revels that mice generally activated the port two to three times for each pause earned with a trend towards greater efficiency on day 4 with more rewards and fewer activations.

      The purpose of the progressive ratio test is to examine if photostimulation of SuMVGLUT2+::POA continues to drive behavior as the effort required to terminate the negative stimuli increases. As seen in Figures 6 G and H, the stimulation of SuMVGLUT2+::POA neurons remains highly motivating. In the 20-minute trial we did not find a break point even as the number of port activations required to pause the stimulation exceed 50. We do not show the Cre- mice is Figure 6G and H because they did not perform the task, as seen in Figure 6F. For technical reasons in early trials, we have fully timely time stamped data for rewards and port activations from a subset of the Cre+ mice. Of note, this contains both the highest and lowest performing mice from the entire data set.

      Taken together, we interpret the results of the operant behavioral testing as demonstrating that SuMVGLUT2+::POA neuron activation is aversive, can drive performance of an operant tasks (as opposed to fixed escape behaviors), and is highly motivating.

      (15) In Fig 7, what does the GCaMP signal look like if aligned to the onset of immobility? It looks like since the hindpaw swimming is short and seems to precede immobility, and the increase in the signal is ramping up at the onset of hindpaw swimming, it may be that the calcium signal is aligned with the onset of immobility.

      What does it look like for swimming onset?

      In I, what is the temporal resolution for the decrease in immobility? Does it start prior to the termination of the stim, or does it require some elapsed time after the termination, etc?

      Thank for the opportunity to addresses these points and improve that clarity of our interpretation of the data. Regarding aligning the Ca2+ signal from fiber photometry recordings to swimming onset and offset, it is important to note that the swimming bouts are not the same length. As a result, in the time prior to alignment to offset of behaviors animals will have been swimming for different lengths of time. In Figure 7 C, we use the behavioral heat map to convey the behavioral average. Below we show the Ca2+ dependent signal aligned at the offset of hindpaw swim for an individual mouse (A) and for the total cohort (B). This alignment shows that the Ca2+ dependent signal declines corresponding to the termination of hindpaw swimming. Because these bouts last less than the total the widow shown, the data is largely included in Figure 7 C and D, which is aligned to onset. Due to the nuance of the difference is the alignment and the partial redundancy, we elected to include the requested alignment to swimming offset in the reply rather in primary figure.

      Author response image 1.

      Turning to the question regarding swimming onset, the animals started swimming immediately when placed in the water and maintained swimming and climbing behaviors until shifting behaviors as illustrated in Figure 7A and B. During this time the Ca2+-dependent signal was elevated but there is only one trial per animal. This question can perhaps be better addressed in the dunk assay presented in Figure 3C, F and G and Supplemental Figure 4 H and I. Here swimming started with each dunk and the Ca2+ signal increased.

      Regarding the question for about figure 7I. We scored for entire periods (2 mins) in aggerate. We noted in videos of the behavior test that there was an abrupt decrease in immobility tightly corresponding to the end of stimulation. In a few animals this shift occurred approximately 15-20s before the end of stimulation. This may relate to the depletion of neurotransmitter as suggested by the reviewer.

      Reviewer 3

      Major points

      (1) Results in Figure 1 suggested that SuM-Vglu2::POA projected not only POA but also to the diverse brain regions. We can think of two models which account for this. One is that homogeneous populations of neurons in SuM-Vglu2::POA have collaterals and innervated all the efferent targets shown in Figure 1. Another is to think of distinct subpopulations of neurons projecting subsets of efferent targets shown in Figure 1 as well as POA. It is suggested to address this by combining approaches taken in experiments for Figure 1 and Supplemental Figure 2.

      Thank you for raising this interesting point. We have attempted combining retroAAV injections into multiple areas that receive projections from SUMVGLUT2+::POA neurons. However, we have found the results unsatisfactory for separating the two models proposed. Using eYFP and tdTomato expressing we saw some overlapping expressing in SuM. We are not able to conclude if this indicates separate populations or partial labeling of a homogenous populations. A third option seems possible as well. There could be a mix of neurons projecting to different combinations of downstream targets. This seems particularly difficult to address using fluorophores. We are preparing to apply additional methodologies to this question, but it extends beyond the scope of this manuscript.

      (2) Since the authors drew a hypothetical model in which the diverse brain regions mediate the effect of SuM-Vglu2::POA activation in behavioral alterations at least in part, examination of the concurrent activation of those brain regions upon photoactivation of SuM-Vglu2::POA. This must help the readers to understand which neural circuits act upon the induction of active coping behavior under stress.

      Thank you for raising this important point. We agree that activating glutamatergic neurons should lead to activation of post synaptic neurons in the target regions. Delineating this in vivo is less straight forward. Doing so requires much greater knowledge of post synaptic partners of SUMVGLUT2+::POA neurons. There are a number of issues that would need to be accounted for. Undertaking two color photo stimulation plus fiber photometry is possible but not a technical triviality. Further, it is possible that we would measure Ca2+ signals in neurons that have no relevant input or that local circuits in a region may shape the signal. We would also lack temporal resolution to identify mono-postsynaptic vs polysynaptic connections. Thus, we would struggle to know if the change in signal was due to the excitatory input from SuM or from a second region. At present, we remain unclear on how to pursue this question experimentally in a manner that is likely to generate clearly interpretable results.

      (3) In Figure 4, "active coping behaviors" must be called "behaviors relevant to the active behaviors" or "active coping-like behaviors", since those behaviors were in the absence of stressors to cope with.

      Thank you for the suggestion on how to clarify our terminology. We have adopted the active coping-like term.

      (4) For the Dunk test, it is suggested to describe the results and methods more in detail, since the readers would be new to it. In particular, the mice could change their behavior between dunks under this test, although they still showed immobility across trials as in Supplemental Figure 4I. Since neural activity during the test was summarized across trials as in Figure 3, it is critical to examine whether the behavior changes according to time.

      Thank you for identifying this opportunity to improve our manuscript. We have expanded and added a detailed description of the dunk test in the methods section.

      As for Supplemental Figure 4I, we apologize for the confusion because the purpose of this figure is to show that mice remained mobile for the entire 30-second dunk trial. This did not appreciably change over the 10 trials. We have revised this figure to plot both immobile and mobile time to achieve greater clarity on this point.

      Minor points

      Typos

      In Figure 1, please add a serotype of AAVs to make it compatible with other figures and their legends.

      In the main text and Figure 2K, the authors used MHb/LHb and mHb/lHb in a mixed fashion. Please make them unified.

      In the figure legend of Figure 6, change "SuMVGLUT2+::POA neurons drive" to "SuMVGLUT2+::POA neurons " in the title.

      In line 86, please change "Retro-AAV2-Nuc-flox(mCherry)-eGFP" to "AAV5-Nuc-flox(mCherry)eGFP".

      In line 80, please change "Positive controls" to "As positive controls, ".

      Thank you for taking the time and making the effort to identify and call these out. We have corrected them.

    1. Author Response

      The following is the authors’ response to the previous reviews

      The revised manuscript is much improved - many unclear points are now better explained. However, in our opinion, some issues could still be significantly improved.

      1. Statistics: none of us are experts in statistics but several things remain questionable in our opinion and if it were our study, we would consult with an expert:

      a) while we understand the authors note about N-chasing and p-hacking, we wonder how the number of N's was premeditated before obtaining the results. Why in 4M an N of 3 is sufficient while in 3E the N is >20 (and not mentioned). At the very least, we think it would be wise to be cautious when stating something as not-significant when it is clear (as in 4M) that the likelihood of it actually being statistically significant is quite large.

      b) In most analyses, the data is not only normalized by actin or some other measure but also to the first (i.e left side on the graph) condition, resulting in identical data points that equal '1' (in Figure 4 alone - C; I; K; M; and O) - while this might be scientifically sound, it should be mentioned (the specific normalization) and also note that this technique shadows any real variance that exists in the original data in this condition. consider exploring techniques to overcome this issue.

      c) In 3C, - if we understand the experiment, you want to convince us that the DIFFERENCE between eB2-FC compared to FC is larger in the control compared to the experiment. We are not absolutely sure that the statistical tools employed here are sufficient - which is why we would consult an expert.

      A) We are aware that many studies do not consistently quantify such experiments. For example, there are essentially no published examples of the signalling timelines of EphB2 receptors as in Fig. 5. By striving to quantifying such biochemical effects, an unquantified experiment stands out, and so perhaps we were too strict by trying to quantify as many experiments as possible, resulting in low n’s for some of them. We acknowledge that additional experiments on EPHB1 protein stability may reach significance. We have adjusted our text on line 332-335 to point to this interesting trend, and slightly changed the conclusion to this section. Similarly, we commented on similar trends when describing Figs. 1E and 4G on lines 901 and 952.

      B) For the Western blot band intensity normalisation, we believe that our method is scientifically sound. Normally, when the replicate samples are loaded on one gel and blotted on the same membrane, the experimenter only needs to normalise the target band intensity to its cognate loading control band intensity for quantitation. However, we usually have a large number of samples from multiple experiments, carried out on different dates. For example, in Fig. 4B,C there are 7 biological replicates collected from 7 experiments and in Fig. 4D there are 10 protein samples. It is not possible for us to run all samples on the same gel. In addition, due to the combined effects of variance in transfer efficiency, the potency of antibodies, detection efficiency and the developing time for each blot, it is practically impossible to generate similar band intensity for each batch. Thus, we use normalisation of test bands to the loading control for individual experiments, and this analysis method is widely accepted by reputable journals with a focus on biochemical experiments (for example: PMID 37695914: Fig. 3 A,B,C; PMID 36282215: Fig. 3 B,C,D,E; PMID 33843588: Fig. 3 C,D,E,F,G,H). Since the value of the first sample on the plot is 1, which is a hypothetical value and does not meet the parametric test requirement, we performed one-sample t-test for statistics when other samples are compared with the first sample (PMID 35243233 Fig. 6 A,B,C,D; https://www.graphpad.com/quickcalcs/oneSampleT1/, “A one sample t-test compares the mean with a hypothetical value. In most cases, the hypothetical value comes from theory. For example, if you express your data as 'percent of control', you can test whether the average differs significantly from 100.”). Thus, we believe that our normalisation and statistical methods are both correct with a large number of precedents.

      C) This comment refers to the cell collapse experiment shown in Fig. 3C for which the data are plotted in Fig. 3D. We stand by the statistical method used. There are two groups of cells (CTRLCRISPR and MYCBP2 CRISPR) and two treatments for each cell group (Fc control and eB2), thus we should use two-way ANOVA. Since we compared the cell retraction effects of Fc and eB2 on the two groups of cells, Sidak post hoc comparison is the right method to avoid errors introduced by multiple comparisons. Here is an example of an eLife article that used the same statistical method for similar comparisons: PMID 37830910, Fig. 1 H,I. To make the comparison easier, we grouped the experiments by cell type (CTRLCRISPR and MYCBP2 CRISPR) as opposed to by treatment. Below, the old version is on the right, and the new version is on the left. The conclusion is that eB2 induces less cell collapse in cells depleted of MYCBP2, when compared to the control cells. However, eB2 is still able to collapse cells lacking MYCBP2.

      Author response image 1.

      Revisiting these data, we noticed an error introduced when CC compiled the data used to generate Fig. 3D. The data were acquired from nine biological replicates per condition. CC used a mix of two methods for cell collapse rate calculation: the first method involved the sum of collapsed cells and all cells from multiple regions of one coverslip (biological replicate). The second method involved computing a collapse rate in each region which then was used to calculate the average collapse rate for the entire coverslip (technical replicate). Given the small cell numbers due to sparse culture conditions, we believe that the first method is a more conservative approach. We hence re-plotted all replicate data using the first method. This resulted in slightly different % collapse and p values. These were changed accordingly in the text and plot and do not affect the conclusion of this experiment.

      2) thanks for the clarification that the interaction between the extracellular domain of EPHB2 and MYCBP2 might not occur directly - however, unless we missed this it was not clearly stated in the text. It is an important point and also a cool direction for the future - to find the elusive co-receptor that actually helps EPHB2 and MYCBP2 form a complex.

      We now also refer to this in the results section on line 215.

      “Since EPHB2 is a transmembrane protein and MYCBP2 is localised in the cytosol, these experiments suggest that the interaction between the extracellular domain of EPHB2 and MYCBP2 might be indirect and mediated by other unknown transmembrane proteins.”

      3) The Hela CRISPR cell line is better explained in the response letter but still not sufficiently explained in the text for a non-expert reader. If the authors want any reader to comprehend this, we would strongly recommend adding a scheme.

      We now include a schematic outlining the CRISPR cell generation as Fig. 3A and its description on line 926.

      Author response image 2.

      4) To clarify some of our previous (and persisting) concerns about Figure 3D/E - it is true that a reduction in 25% of cell size is dramatic. But (if we understand correctly) your claim is that a reduction in 22% (this is a guess, as the actual numbers are not supplies) is significantly less than 25%. Even if it is, statistically speaking, significant, what is the physiological relevance of this very slight effect? In this experiment, the N was quite large, and we wonder if the images in D are representative - it would be nice to label the data points in E to highlight which images you used.

      We now mention the average cell area contraction measurements in the legend to Fig. 3F on line 935. We also tracked down the individual cells shown in Fig. 3E and they are now labelled as data points in blue in Fig. 3F. HeLa cell collapse is a simplified model of EPHB2 function and we do not know whether the difference between the behaviour of CTRLCRISPR and MYCBP2 CRISPR cells is physiologically significant and thus we prefer not to speculate on this.

      5) Figure 3F and other stripe assays - In the end, it is your choice how to quantify. We believe that quantifying area of overlap is a more informative and objective measurement that might actually benefit your analyses. That said, if you do keep the quantification as it is now, you have to define the threshold of what you mean by "cell/s (or an axon in 7A, where it is even more complicated as are you eluding to primary, secondary, or even smaller branches) are RESIDING within the stripe". Is 1% overlap sufficient or do you need 10 or 50% overlap?

      We now added this statement to the methods on line 745: “A cell was considered to be on an ephrin-B2 stripe when more than 50% of its nucleus was located on that stripe”. For chick explant stripe assay, when measuring the length of an axon on a stripe, we only measured the main axons originated from the explants.

      For explant/stripe experiments in Fig. 7 AB, we now use the term “GFP-expressing neurite” rather than “branch”. This was already present in the results of the previous version, but the methods and legend needed to be brought up to date (lines 786 and 1008. We think that “branch” was a confusing term that was supposed to mean the same thing as “neurite” but came across as some indication of branching. We do not know whether the GFP+ neurites were primary or secondary extensions of explants, or in fact, whether some of them contained more than one axon. We also adjusted the method to reflect the fact that some stripes were used in conjunction with a single explant and added a reference to a previous study extensively using this method (Poliak et al., 2015) on line 778.

      6) We still don't get the link to the lysosomal degradation. Your data suggests that in your cells EPHB2 is primarily degraded by the lysosomal pathway and not proteasome. Any statement about MYCBP2 is not strongly supported by the data, in our opinion - Unless you develop some statistical measurement that shows that the effect of BafA1 is statistically different in MYCBP2 cells than in control cells. Currently, this is not the case and the link is therefore not warranted in our opinion.

      We generated a new version of Fig. 4K with average increase in EPHB2 levels in the presence of BafA1 and CoQ, compared to DMSO treated controls (see below). BafA1 and CoQ restored EPHB2 protein levels by 19% and 14% respectively in CtrlCRISPR cells, while the inhibitors restored EPHB2 protein levels by 40% and 35% respectively in MYCBP2 CRISPR cells.

      Author response image 3.

      For each of the 4 replicates, the increase in EPHB2 levels by BafA1 compared to DMSO is as follows:

      Author response table 1.

      These values are not significantly different between CtrlCRISPR cells versus MYCBP2 CRISPR cells (p= 0.08, student’s t test). Similarly for the CoQ experiment. We now temper our conclusion for this experiment: Although the difference in percentage increase between CTRLCRISPR cells and MYCBP2CRISPR cells is not significant, this trend raises the possibility that the loss of MYCBP2 promotes EPHB2 receptor degradation through the lysosomal pathway (line 319). We also adjusted the section title (line 306).

      7) While the C. elegans part is now MUCH better explained - we are not sure we understand the additional insight. The fact that vab-1 and glo4 double mutants are additive as are vab1 and fsn1, suggest they act in parallel (if the mutants are NULL, and not if they are hypomorphs, if one wants to be accurate) - how this relates to your story is unclear. The vab1/rpm1 double mutant is still uninformative and incomplete. rpm1 phenotype is so severe that nothing would make it more severe. We read the Jin paper that the authors directed to - nothing makes the rpm1 phenotype more severe. Yes, some DOWNSTREAM elements make the rpm1 phenotype LESS severe - this is not something you were testing, to the best of our knowledge. Rather, you wanted to see if rpm1 mutant resulted in stabilization of vab1 and thus suppression of vab1 phenotype - we are just not sure the system is amenable to test (actually reject) your hypothesis that Vab1 is degraded by rpm1. Also, assuming we are talking about NULLs, the fact that the rpm1 phenotype is WAY stronger than the vab1 mutant, suggests that rpm1 functions via multiple routes, adding even more complexity to the system. Given these results, despite the much improved clarity, we are still not sure that the worm data adds new insight, rather than potentially confusing the reader.

      We realise that the genetic interactions between vab-1 and the RPM-1/MYCBP2 signalling network are complicated. However, we insist on keeping the data for the sake of its availability for future studies and completeness. We also think it is important for readers and the community to see these data, even if the authors and reviewers are not entirely in agreement about the importance/interpretation of experimental outcomes. It is our hope that the community will examine the results and draw their own conclusions.

      A few points of clarification:

      The C. elegans experiments were designed to test genetically if the vertebrate interactions between EPHB2 and MYCBP2 and its signalling network are conserved. We studied two kinds of interactions: (1) between vab-1 and RPM-1/MYCBP2 downstream proteins (GLO-4 and FSN-1) and (2) between vab-1 and rpm-1. For these studies, we used null alleles for vab-1, glo-4 and fsn-1 which is now noted on lines 440, 453, 475 and 859. Our findings are consistent with the VAB-1 Ephrin receptor functioning in parallel to known RPM-1 binding proteins. This is further supported by new data: vab-1; fsn-1 double mutants showed enhanced incidence of axon overextension defects using a second transgenic background, zdIs5 (Pmec-4::GFP), to visualize axon termination (Fig. 8F).

      This second transgenic background also allowed us to generate new data to address your concerns about phenotypic saturation in rpm-1 mutants. To do this, we used the zdIs5 (Pmec4::GFP) genetic background, in which axon termination defects are not saturated in rpm-1 mutants (Fig. 8F) because they can be enhanced by other mutants such as cdc-42 and unc-33 (Fig. 7C, D, in Borgen et al. Development 144, 4658–4672 (2017), PMID 29084805). In this new background, we found that vab-1 loss of function fails to enhance the incidence of severe “hook” defects in rpm-1 mutants which is an indication that the two genes function in the same pathway. Importantly, prior studies in this background, also showed that mutants in the RPM-1 signalling network (e.g. fsn-1, glo-4 and ppm-2) do not enhance the incidence of severe “hook” defects as double mutants with rpm-1 compared to rpm-1 single mutants (Fig. 7B, ibid.).

      To reflect these ideas more clearly, we revised the Results section pertaining to C. elegans genetics (starting on line 418) and tempered our discussion (lines 517). Basically, this section now says that we studied genetic interactions between vab-1 and the RPM-1/MYCBP2 signalling network. From these experiments we conclude that: (1) The enhancement of overextension defects in vab-1; glo-4 and vab-1; fsn-1 double mutants compared to single mutants indicates that VAB-1/EPHR functions in parallel to known RPM-1 binding proteins to facilitate axon termination, and (2) Since the vab-1; rpm-1 double mutants do not display an increased frequency or severity of overextension defects compared to rpm-1 single mutants, VAB-1 /EPHR functions in the same genetic pathway as RPM-1/MYCBP2.

      The new genetic data included in this version were generated by Karla J. Opperman who is now included as a co-author.

      Further corrections:

      Author response image 4.

      Because of the errors associated with quantifications in Fig. 3D (see above), we reviewed other quantification methodologies and noticed another discrepancy that required a correction. In the hippocampal neuron growth cone collapse assay shown in the previous version of Fig. 7 D (left), the growth cones were classified into three groups: 1, fully collapsed; 2, hard to tell, but not fully collapsed; 3, fan-shape cones. Two different quantifications were performed as follows: (1), number of fully collapsed cones divided by the numbers of all growth cones; (2), number of fully collapsed cones divided by [number of fully collapsed cones + fan-shape cones]. CC erroneously used the second method to generate Fig. 7D.

      We think that the first method is more appropriate. Furthermore, since n=5 for the Fc and eB1-Fc conditions, but n=3 for the eB2-Fc condition, we decided to omit it. The final plot for figure 7D is the following:

      Author response image 5.

      Our conclusion still stands that exogenous FBD1 WT overexpression impaired the growth cone collapse mediated by EphB.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this paper, Steinemann et al. characterized the nature of stochastic signals underlying the trial-averaged responses observed in the lateral intraparietal cortex (LIP) of non-human primates (NHPs), while these performed the widely used random dot direction discrimination task. Ramp-up dynamics in the trial averaged LIP responses were reported in numerous papers before. However, the temporal dynamics of these signals at the single-trial level have been subject to debate. Using large-scale neuronal recordings with Neuropixels in NHPs, allows the authors to settle this debate rather compellingly. They show that drift-diffusion-like computations account well for the observed dynamics in LIP.

      Strengths:

      This work uses innovative technical approaches (Neuropixel recordings in behaving macaque monkeys). The authors tackle a vexing question that requires measurements of simultaneous neuronal population activity and hence leverage this advanced recording technique in a convincing way

      They use different population decoding strategies to help interpret the results.

      They also compare how decoders relying on the data-driven approach using dimensionality reduction of the full neural population space compare to decoders relying on more traditional ways to categorize neurons that are based on hypotheses about their function. Intriguingly, although the functionally identified neurons are a modest fraction of the population, decoders that only rely on this fraction achieve comparable decoding performance to those relying on the full population. Moreover, decoding weights for the full population did not allow the authors to reliably identify the functionally identified subpopulation.

      Weaknesses:

      No major weaknesses beyond a few, largely clarification issues, detailed below.

      We thank Reviewer 1 (R1) for this summary. The revised manuscript incorporates R1’s suggestions, as detailed below.

      Reviewer #2 (Public Review):

      Steinemann, Stine, and their co-authors studied the noisy accumulation of sensory evidence during perceptual decision-making using Neuropixels recordings in awake, behaving monkeys. Previous work has largely focused on describing the neural underpinnings through which sensory evidence accumulates to inform decisions, a process which on average resembles the systematic drift of a scalar decision variable toward an evidence threshold. The additional order of magnitude in recording throughput permitted by the methodology adopted in this work offers two opportunities to extend this understanding. First, larger-scale recordings allow for the study of relationships between the population activity state and behavior without averaging across trials. The authors’ observation here of covariation between the trial-to-trial fluctuations of activity and behavior (choice, reaction time) constitutes interesting new evidence for the claim that neural populations in LIP encode the behaviorally-relevant internal decision variable. Second, using Neuropixels allows the authors to sample LIP neurons with more diverse response properties (e.g. spatial RF location, motion direction selectivity), making the important question of how decision-related computations are structured in LIP amenable to study. For these reasons, the dataset collected in this study is unique and potentially quite valuable.

      However, the analyses at present do not convincingly support two of the manuscript’s key claims: (1) that ”sophisticated analyses of the full neuronal state space” and ”a simple average of Tconin neurons’ yield roughly equivalent representations of the decision variable; and (2) that direction-selective units in LIP provide the samples of instantaneous evidence that these Tconin neurons integrate. Supporting claim (1) would require results from sophisticated population analyses leveraging the full neuronal state space; however, the current analyses instead focus almost exclusively on 1D projections of the data. Supporting claim (2) convincingly would require larger samples of units overlapping the motion stimulus, as well as additional control analyses.

      We thank the reviewer (R2) for their careful reading of our paper and the many useful suggestions.

      As detailed below, the revised manuscript incorporates new control analyses, improved quantification, and statistical rigor, which now provide compelling support for key claim #1. We do not regard claim #2 as a key claim of the paper. It is an intriguing finding with solid support, worthy of dissemination and further investigation. We have clarified the writing on this matter.

      Specific shortcomings are addressed in further detail below:

      (1) The key analysis-correlation between trial-by-trial activity fluctuations and behavior, presented in Figure 5 is opaque, and would be more convincing with negative controls. To strengthen the claim that the relationship between fluctuations in (a projection of) activity and fluctuations in behavior is significant/meaningful, some evidence should be brought that this relationship is specific - e.g. do all projections of activity give rise to this relationship (or not), or what level of leverage is achieved with respect to choice/RT when the trial-by-trial correspondence with activity is broken by shuffling.

      We do not understand why R2 finds the analysis opaque, but we are grateful for the lucid recommendations. The relationships between fluctuations in neural activity and behavior are indeed “specific” in the sense that R2 uses this term. In addition to the shuffle control, which destroys both relationships (Reviewer Figure 1), we performed additional control analyses that preserve the correspondence of neural signals and behavior on the same trial. We generated random coding directions (CDs) by establishing weight vectors that were either chosen from a standard normal distribution or by permuting the weights assigned to PC-1 in each session. The latter is the more conservative measure. Projections of the neural responses onto these random coding directions render 𝑆rand(𝑡). Specifically, the degree of leverage is effectively zero or greatly reduced. These analyses are summarized in a new Supplementary Figure S10. The bottom row of Figure S10 also addresses the question, “What degree of leverage and mediation would be expected for a theoretical decision variable?” This is accomplished by simulating decision variables using the drift-diffusion model fits in Figure 1c. The simulation is consistent with the leverage and (incomplete) mediation observed for the populations of Tcon neurons. For details see Methods, Simulated decision variables and Leverage of single-trial activity on behavior.

      (2) The choice to perform most analysis on 1D projections of population activity is not wholly appropriate for this unique type of dataset, limiting the novelty of the findings, and the interpretation of similarity between results across choices of projection appears circular:

      We disagree with the characterization of our argument as circular, but R2 raises several important points that will probably occur to other careful readers. We address them as subpoints 2.1–2.4, below. Importantly, we are neither claiming nor assuming that the LIP population activity is one-dimensional. We have revised the paper to avoid giving this impression. We are also not claiming that the average of Tin neurons (or the 1D projections) explains all features of the LIP population, nor would we expect it to, given the diversity of response fields across the population. Our objective is to identify the specific dimension within population activity that captures the decision variable (DV), which has been characterized successfully as a one-dimensional stochastic process—that is, a scalar function of time. We have endeavored to clarify our thinking on this point in the revised manuscript (e.g., lines 97–98, 103–104).

      (2.1) The bulk of the analyses (Figure 2, Figure 3, part of Figure 4, Figure 5, Figure 6) operate on one of several 1D projections of simultaneously recorded activity. Unless the embedding dimension of these datasets really does not exceed 1 (dimensionality using e.g. participation ratio in each session is not quantified), it is likely that these projections elide meaningful features of LIP population activity.

      We now report the participation ratio (4.4 ± 0.4, mean ± s.e. across sessions), and we state that the first 3 PCs explain 67.1±3.1% of the variance of time- and coherence-dependent signals used for the PCA. We agree that the 1D projections may elide meaningful features of LIP population activity. Indeed, we make this point through our analysis of the Min neurons. We do not claim that the 1D projections explain all of the meaningful features of LIP population activity. They do, however, reveal the decision variable, which is our main focus. These 1D signals contain features that correlate with events in the superior colliculus, summarized in Stine et al. (2023), attesting to their biological relevance.

      (2.2) Further, the observed similarity of results across these 1D projections may not be meaningful/interpretable. First, the rationale behind deriving Sramp was based on the ramping historically observed in Tin neurons during this task, so should be expected to resemble Tin.

      The Reviewer is correct that we would expect 𝑆ramp to resemble the ramping observed in Tin neurons. We refer to this approach as hypothesis-driven. It captures the drift component of drift-diffusion. It is true that the Tcon neurons exhibit such ramps in their trial average firing rates, but this does not guarantee in

      that the single-trial population firing rates would manifest as drift-diffusion. Indeed Latimer et al. (2015) concluded that the ramp-like averages comprise stepping from a low to a high firing rate on each trial at a random time. Therefore, while R2 is right to characterize the similarity of Tcon to the ramp direction in in trial-averaged activity as unsurprising, their similarity on single trials is not guaranteed.

      (2.3) Second, Tin comprises the largest fraction of the neuron groups sampled during most sessions, so SPC1 should resemble Tin too. The finding that decision variables derived from the whole population’s activity reduce essentially to the average of Tin neurons is thus at least in part ’baked in’ to the approach used for deriving the decision variables.

      This is incorrect. The Tcon in neurons constitute only 14.5% of the population, on average, across the sessions (see Table 1). This misunderstanding might contribute to R2’s concern about the importance of these neurons in shaping PC1. It is not simply because they are over-represented. Also, addressing R2’s concern about circularity, we would like to remind R2 that the selection of Tin neurons was based only on their spatial selectivity in the delayed saccade task. We do not see how it could be baked-in/guaranteed that a simple average of these neurons (i.e. zero degrees of freedom) yields dynamics and behavioral correlations that match those produced by dimensionality-reduction techniques that (𝑖) have degrees of freedom equal to the number of neurons and (𝑖𝑖) are blind to the neurons’ spatial selectivity. We have additionally modified what is now Supplementary Figure S13 (old Supplementary Figure S8), which portrays the mean accuracy of choice decoders trained on the neural activity of all neurons, only Tin neurons, all but the Tin neurons, and all but Tin and Min neurons, respectively. Figure S13 now highlights how much more readily choice can be decoded from the small population of Tin neurons than the remainder of the population.

      (2.4) The analysis presented in Figure S6 looks like an attempt to demonstrate that this isn’t the case, but is opaque. Are the magnitudes of weights assigned to units in Tin larger than in the other groups of units with preselected response properties? What is their mean weighting magnitude, in comparison with the mean weight magnitude assigned to other groups? What is the null level of correspondence observed between weight magnitude and assignment to Tin (e.g. a negative control, where the identities of units are scrambled)?

      The revised Figure S6—what is now Figure S9—displays more clearly that the weights assigned to Tcon and Tips neurons (purple & yellow, respectively) are larger in magnitude than those assigned in in to other neurons (gray). Author response table 1 shows a more detailed breakdown of the groups. Note that the length of the vector of weights is one. We are unsure what R2 means by “the null level of correspondence.” Perhaps it helps to know that the mean weight of the “other neurons” is close to zero for all four coding directions. However, it is the overlap of the weights and the relative abundance of non-Tin neurons that is more germane to the point we are making. To wit, knowing the weight (or percentile) of a neuron is a poor predictor that it belongs to the Tin category. This point is most clearly supported by the logistic regression (Fig. S9, bottom row). In other words, the large group of non-Tin neurons contribute substantially to all four coding directions examined in Figure S9. Thus, the similarity between Tin neurons and PC1 is not simply due to an over-representation of Tin neurons as suggested in item 2.3.

      Author response table 1.

      Mean weights assigned to neuron classes in four coding directions.

      (3) The principal components analysis normalization procedure is unclear, and potentially incorrect and misleading: Why use the chosen normalization window (±25ms around 100ms after motion stimulus onset) for standardizing activity for PCA, rather than the typical choice of mean/standard deviation of activity in the full data window? This choice would specifically squash responses for units with a strong visual response, which distorts the covariance matrix, and thus the principal components that result. This kind of departure from the standard procedure should be clearly justified: what do the principal components look like when a standard procedure is used, and why was this insufficient/incorrect/unsuitable for this setting?

      We used the early window because it is a robust measure of overall excitability, but we now use a more conventional window that spans the main epoch of our analyses, 200–600 ms after motion onset. This method yields results qualitatively similar to the original method. We are persuaded that this is the more sensible choice. We thank R2 for raising this concern.

      (4) Analysis conclusions would generally be stronger with estimates of variability and control analyses: This applies broadly to Figures 2-6.

      We have added estimates of variability and control analyses where appropriate.

      Figure 2 shows examples of single-trial signals. The variability is addressed in Figure 3a and the new Supplementary Figure S5.

      Figure 3 now contains error bars derived by bootstrapping (see Methods, Variance and autocorrelation of smoothed diffusion signals). We have also added Supplementary Figure S5, which substantiates the sublinearity claim using simulations.

      Figure 4 (i) We now indicate the s.e.m. of decoding accuracy (across sessions) by the shading in Figure 4a. (ii) The black symbols in new Supplementary Figure S8 show the mean±s.e.m. for all pairwise comparisons shown in Figure 4d & e. (iii) Supplementary Figure S8 also summarizes two control analyses that deploy random coding directions (CDs) in neuronal state space. The upper row of Fig S9 compares the observed cosine similarity (CoSim)—between the CD identified by the graph title and the other four CDs labeled along the abscissa—with values obtained with 1000 random CDs established by random permutations of the weight assignments. The brown symbols are the mean±sdev of the CoSim (N=1000). The error bars are smaller than the symbols. We use the cumulative distribution of CoSim under permutation to estimate p-values (p<0.001 for all comparisons). We used a similar approach to estimate the distribution of the analogous correlation statistics between signals rendered by random directions in state space (Figure S8, lower row). For additional details, please see Methods, Similarity of single-trial signals.

      Figure 5: The rigor of all claims associated with this figure is adduced from two control analyses and a simulation. The first control breaks the trial-by-trial correspondence between neural signals and behavior (Reviewer Figure 1). The second control shows that neural activity does not have substantial leverage on behavior when projected onto random directions in state space (Supplementary Figure S10, top). Simulations of decision variables using parameters derived from the fits to the behavioral data (Figure 1) support a degree of leverage and mediation comparable to the values observed for 𝑆Tincon (Supplementary Figure S10, bottom). For additional details, please see Methods (Leverage of single-trial activity on behavior) and the reply to item 1, above.

      Figure 6: Panels c&d show estimates of variability across neurons and experimental sessions, respectively. The reported p-value is based on a permutation test (see Methods, Correlations between Min and Tconin ). The correlations shown in panel e (heatmap) are derived from pooled data across sessions. The reported p-value is based on a permutation test (see Methods, Correlations between Min and Tconin ).

      Reviewer #3 (Public Review):

      Summary:

      The paper investigates which aspects of neural activity in LIP of the macaque give rise to individual decisions

      (specificity of choice and reaction times) in single trials, by recording simultaneously from hundreds of neurons. Using a variety of dimensionality reduction and decoding techniques, they demonstrate that a population-based drift-diffusion signal, which relies on a small subset of neurons that overlap choice targets, is responsible for the choice and reaction time variability. Analysis of direction-selective neurons in LIP and their correlation with decision-related neurons (T con in [Tconin ] neurons ) suggests that evidence integration occurs within area LIP.

      Strengths:

      This is an important and interesting paper, which resolves conflicting hypotheses regarding the mechanisms that underlie decision-making in single trials. This is made possible by exploiting novel technology (Primatepixels recordings), in conjunction with state-of-the-art analyses and well-established dynamic random dot motion discrimination tasks.

      General recommendations:

      (1) Please tone down causal language. You presentcompelling correlativeevidencefor the idea thatLIP population activity encodes the drift-diffusion DV. We feel that claims beyond that (e.g., ”Single-trial drift-diffusion signals control the choice and decision time”) would require direct interventions, and are only partially supported by the current evidence. Further examples are provided in point 1) of Reviewer 1 below.

      We have adopted the recommendation to “tone down the causal language.” Throughout the manuscript, we strive to avoid conveying the false impression that the present findings provide causal support for the decision mechanism. However, other causal studies of LIP support causality in the random dot motion task (Hanks et al., 2006; Jeurissen et al., 2022). It is therefore justifiable to use terms that imply causality in statements intended to convey hypotheses about mechanism. We agree that we should not give the false impression that the present support for said mechanism is adduced from causal perturbations in this study, as there were none.

      (2) Please provide a commonly used, data-driven quantification of the dimensionality of the population activity – for example, using participation ratio or the number of PCs explaining 90 % of the variance. This will help readers evaluate the conclusions about the dimensionality of the data.

      Principal component analysis reveals a participation ratio of 4.4 ± 0.4 (mean ±s.e., across sessions), and the first 3 PCs explain 67.1 ± 3.1 percent of the variance. The dimensionality of the data is low, but greater than one. We state this in Methods (Principal Component Analysis) and in Results (Single-trial drift-diffusion signals approximate the decision variable, lines 200–201).

      (3) Please justify the normalization procedure used for PCA: Why use the chosen normalization window (±25ms around 100ms after motion stimulus onset) for standardizing activity for PCA, rather than the more common quantification of mean/standard deviation across the full data window? What do the first principal components look like when the latter procedure is used?

      We now use a more conventional window that spans the main epoch of our analyses, 200–600 ms after motion onset. This method yields results qualitatively similar to the original method. We are persuaded that this is the more sensible choice.

      (4) Please provide estimates of variability for variance and autocorrelation in Fig. 3 (e.g., through bootstrapping). Further, simulations could substantiate the claim about the expected sub-linearity at later time points (Fig. 3a) due to the upper stopping bound and limited firing rate range.

      We thank the reviewers for these helpful recommendations. The revised Fig. 3 now contains error bars derived by bootstrapping (see Methods, Variance and autocorrelation of smoothed diffusion signals). We have also added Supplementary Figure S5, which substantiates the sub-linearity claim using simulations.

      (5) Please add controls and estimates of variability for decoding across sessions in Fig. 4: what are the levels of within-trial correlation/cosine similarity for random coding directions? What is the variability in the estimates of values shown in a/d/e?

      We have addressed each of these items. (1) Figure 4a now shows the s.e.m. of decoding accuracy (across sessions). (2) Regarding the variability of estimates shown in Figure 4d & e, the standard errors are displayed in the new supplementary Figure S8. It makes sense to show them there because there is no natural way to represent error on the heat maps in Figure 4, and Figure S8 concerns the comparison of the values in Figure 4d&e to values derived from random coding directions. (3) Random coding directions lead to values of cosine similarity and within-trial correlation that do not differ significantly from zero. We show this in several ways, summarized in our reply to Public Review item 4. Additional details are in the revised manuscript (Methods, Similarity of single-trial signals) and the new Supplementary Figure S8.

      (6) Please perform additional analysis to strengthen the claim from Fig. 6, that Min represents the integrand and not the integral. The analysis in Fig. 6d could be repeated with the integral (cumulative sum) of the single-trial Min signals. Does this yield an increase in leverage over time?

      The short answer is, yes in part. Reviewer Figure 2a provides support for leverage of the integral on choice, and this leverage, like 𝑆Tincon (t), increases as a function of time. The effect is present in all seven sessions that have both Mleftin and Mrightin neurons (all 𝑝 < 1𝑒 − 10). However, as shown in panel b, the same integral fails to demonstrate more than a hint of leverage on RT. All correlations are barely negative, and the magnitude does not increase as a function of time. We suspect—but cannot prove—that this failure arises because of limited power and the expected weak effect. Recall that the mediation analysis of RT is restricted to longer trials. Moreover, the correlation between the Min difference and the Tin signal is less than 0.1 (heatmap, Fig. 6e), implying that the Min difference explains less than 1% of the variance of 𝑆Tin(𝑡). We considered including Reviewer Figure 2 in the paper, but we feel it would be disingenuous (cherry-picking) to report only the positive outcome of the leverage on choice. If the editors feel strongly about it, we would be open to including it, but leaving these analyses out of the revised manuscript seems more consistent with our effort to deëmphasize this finding. In the future, we plan to record simultaneously from populations MT and LIP neurons (Min and Tin, of course) and optimize Min neuron yield by placing the RDM stimulus in the periphery.

      (7) Please describe the complete procedure for determining spatially-selective activity. E.g.: What response epoch was used, what was the spatial layout of the response targets, were responses to all ipsi- vs contralateral targets pooled, what was the spatial distribution of response fields relative to the choice targets across the population?

      We thank the reviewers for pointing out this oversight. We now explain this procedure in the Methods (lines 629–644):

      Neurons were classified post hoc as Tin by visual-inspection of spatial heatmaps of neural activity acquired in the delayed saccade task. We inspected activity in the visual, delay, and perisaccadic epochs of the task. The distribution of target locations was guided by the spatial selectivity of simultaneously recorded neurons in the superior colliculus (see Stine 2023 for details). Briefly, after identifying the location of the SC response fields, we randomly presented saccade targets within this location and seven other, equally spaced locations at the same eccentricity. In monkey J we also included 1–3 additional eccentricities, spanning 5–16 degrees. Neurons were classified as Tin if they displayed a clear, spatially-selective response in at least one epoch to one of the two locations occupied by the choice targets in the main task. Neurons that switched their spatial selectivity in different epochs were not classified as Tin. The classification was conducted before the analyses of activity in the motion discrimination task. The procedure was meant to mimic those used in earlier single-neuron studies of LIP (e.g., Roitman & Shadlen 2002) in which the location of the choice targets was determined online by the qualitative spatial selectivity of the neuron under study. The Tcon neurons in the in present study were highly selective for either the contralateral or ipislateral choice target used in the RDM task (AUC = 0.89±0.01; 𝑝 < 0.05 for 97% of neurons, Wilcoxon rank sum test). Given the sparse sampling of saccade target locations, we are unable to supply a quantitative estimate of the center and spatial extent of the RFs.

      (8) Please clarify if a neuron could be classified as both Tin and Min. Or were these categories mutually exclusive?

      These categories are mutually exclusive. If a neuron has spatially-selective persistent activity, as defined by the method described above, it is classified as a Tin neuron and not as an Min neuron even if it also shows motion-selective activity during passive motion viewing. We now specify this in the Methods (lines 831–832).

      Reviewer #1 (Recommendations For The Authors):

      𝑅∗1.1a Causal language (Line 23-24): “population activity represents […] drift” and “we provide direct support for the hypothesis that drift-diffusion signal is the quantity responsible for the variability in choice and RT” reads at first sight as if the authors claim that they present evidence for a causal effect of LIP activity on choice. The authors areotherwisenuanced and carefultopointout thattheir evidence is correlational. What seems to be meant is that the population activity/drift-diffusion signal ”approximates the DV that gives rise to the choices […]” (cf. line 399). I would recommend using such alternative phrasing to avoid confusion (and the typically strong reactions by readers against misleading causal statements).

      We have adopted the reviewer’s recommendation and have modified the text throughout to reduce causal language. See our response to General Recommendation 1.

      𝑅∗1.1b Relatedly, any discussion about the possibility of LIP being causally involved in evidence integration (e.g. lines 429-445 [Au: now 462–478]) should also comment on the possibility of a distributed representation of the decision variable given that neural correlates of the DV have been reported in several areas including PFC, caudate and FEF.

      We believe this is possible. However, we hope to avoid discussions about causality given that it is not a focus of the paper. Although it is somewhat tangential, we have shown elsewhere that LIP is causal in the sense that causal manipulations affect behavior, but it is also true that causality does not imply necessity, and similarly, lack of necessity does not imply “only correlation.” Regarding distributed representations, it is worth keeping in mind the cautionary counter-example furnished by the SC study (Stine et al., 2023). The firing rates measured by averaging over trials are similar in SC and LIP; both manifest as coherence and direction-dependent ramps, leading to the suggestion that they form a distributed representation of the decision variable. With single-trial resolution, we now know that LIP and SC exhibit distinct dynamics—drift-diffusion and bursting, respectively. It remains to be seen if single-trial resolution achievable by simultaneous Neuropixels recordings from prefrontal areas and LIP reveal shared or distinct dynamics.

      𝑅∗1.2 How was the spatially selective activity determined? The classification of Tin neurons is critical to this study - how was their spatial selectivity determined? Please describe this in similar detail as the description of direction selectivity on lines 681-690 [Au: now 824–832]. E.g.: what response epoch was used, what was the spatial layout of the response targets, were responses to all ipsi- vs contralateral targets pooled, and what was the spatial distribution of response fields relative to the choice targets across the population?

      We now explain the selection procedure in Methods (lines 629–644). Please see our reply to General Recommendation 7, above.

      𝑅∗1.3 Could a neuron be classified as both Tin and Min, or were these categories mutually exclusive? Please clarify. (This goes beyond the scope of the current study: but did the authors find evidence for topographic organization or clustering of these categories of neurons?)

      These categories are mutually exclusive. Please see our response to General Recommendation 8, above.

      𝑅∗1.4 Contrary to the statement on line 121, the trial averages in Fig. 2a, 2b show coherence dependency at the time of the saccade in saccade-aligned traces for the coding strategies, except for STin (fig. 2c). Is this a result of the choice for t1 (= 0.1s)? (The authors may want to change their statement on line 121.) Relatedly, do the population responses for the two coding strategies Sramp and SPC1 depend on the epoch used to derive weights for individual neurons?

      We have revised the description to accommodate R2’s observation. 𝑆ramp retains weak coherence-dependence before saccades towards the choice target contralateral to the recording site. This was true in four of the eight sessions. For 𝑆PC1, there is no longer a coherence dependency for the Tin choices, owing to the change in normalization method (see revised Figure 2b).

      We also corrected an error in the Methods section. Specifically, the ramp ends at 𝑡1 \= 0.05 s before the time of the saccade, not 𝑡1 \= 0.1 s. While we no longer emphasize the similarity of traces aligned to saccade, it is reasonable to find issue with the observation that they retain a dependency on coherence (𝑆ramp only) because, according to theory, traces associated with Tin choices should reach a common positive threshold at decision termination. That said, for the Ramp direction there may be a reason to expect this discrepancy from theory. The deterministic part of drift-diffusion includes an urgency signal that confers positive convexity to the deterministic drift. This accelerating nonlinearity is not captured by the ramp, and it is more prominent at longer decision times, thus low coherences. We do not share this interpretation in the revised manuscript, in part because retention of coherence dependency is present in only half the sessions (see Reviewer Figure 3) The correction to the definition of 𝑡1 also provides an opportunity to address R2’s final question (“Relatedly,…?”). For 𝑆ramp this particular variation in 𝑡1 does not affect 𝑆ramp, and 𝑆PC1 no longer retains coherence dependency for Tin choices. Note that our choice of 𝑡0 and 𝑡1 is based on the empirical observation that the ramping activity in response averages of Tin neurons typically begins 200 ms after motion onset and ends 50–100 ms before initiation of the saccadic choice. The starting time (𝑡0) is also supported by the observation that the decoding accuracy of a choice-decoder begins to diverge from chance at this time (Figure 4a).

      𝑅∗1.5 It is intriguing that Sramp and SPC1 show dynamics that look so similar (fig. 2a, 2b). How do the weights assigned to each neuron in both strategies compare across the population?

      The weights assigned to each neuron are very similar across the two strategies as indicated by a cosine similarity (0.65 ± 0.04, mean ±s.e.m. across sessions).

      𝑅∗1.6 Tin neurons, which show dynamics closely resembling different coding directions (fig. 2) and the decoders do not have weights that can distinguish them from the rest of the population in each of these analyses (fig. S7). Is it fair to interpret these findings as evidence for broad decision-related co-variability in the recorded neural population in LIP?

      Yes, our results are consistent with this interpretation. However, it is worth reiterating that decoding performance drops considerably when Tin neurons are not included (see Supplementary Figure S13). Thus, this broad decision-related co-variability is present but weak.

      𝑅∗1.7 It is intriguing that the decoding weights of the different decoders did not allow the authors to reliably identify Tin neurons. Could this be, in part, due to the low dimensionality of the population activity and task that the animals are presumably overtrained on? Or do the authors expect this finding to hold up if the population activity and task were higher dimensional?

      Great question! We can only speculate, but it seems possible that a more complex, “higher dimensional” task could make it easier to identify Tin neurons. For example, a task with four choices instead of two may decrease correlations among groups of neurons with different response fields. We have added this caveat to the discussion (lines 459-–461). One minor semantic objection: The animal has learned to perform a highly contrived task at low signal-to-noise. The animal is well-trained, not over-trained.

      𝑅∗1.8 Lines 135-137 [Au: now 141–142]: The similarity in the single trial traces from different coding strategies (fig. 2a-2c, left) is not as evident to me as the authors suggest. It might be worthwhile computing the correlation coefficients between individual traces for each pair of strategies and reporting the mean correlation to support the author’s point.

      We report the mean correlation between single-trial signals generated by the chosen dimensionality reduction methods in Figure 4e. We show the variability in this measure in Supplementary Figure S8. We have also adjusted the opacity of the single-trial traces in Figure 2, left.

      𝑅∗1.9 Minor/typos:

      -line 74: consider additionally citing Hyafil et al. 2023.

      -line 588: ”that were strongly correlated”?

      -line 615: ”were the actual drift-diffusion process were...”.

      -line 717: ”a causal influence” -> ”no causal influence”.

      Fig. 6: panel labels e vs d are swapped between the figure and caption.

      Fig. 3c: labels r1,3 & r2,3 are flipped.

      We have addressed all of these items. Thank you.

      Reviewer #2 (Recommendations For The Authors):

      𝑅∗2.1 (Figure 2) Determine whether restricting the analysis to 1D projections of the data is a suitable approach given the actual dimensionality of the datasets being analyzed:

      - Should show some quantification of the dimensionality of the recorded activity; could do this by quantifying the dimensionality of population activity in each session, e.g. with participation ratio or related measures (like # PCs to explain some high proportion of the variance, e.g. 90 %). If much of the variation is not described in 1 dimension, then the paper would benefit from some discussion/analysis of the signals that occupy the other dimensions.

      We now report the participation ratio (4.4 ± 0.4, mean ±s.e. across sessions), and we state that the first 3 PCs explain 67.1 ± 3.1% of the variance of the time- and coherence-dependent signals used for the PCA (mean ±s.e). We agree that the 1D projections may elide meaningful features of LIP population activity. Indeed, we make this point through our analysis of the Min neurons. To reiterate our response above, we do not claim that the 1D projections explain all of the meaningful features of LIP population activity. They do, however, reveal the decision variable, which is our main focus. These 1D signals contain features that correlate with events in the superior colliculus, summarized in Stine et al. (2023), attesting to their biological relevance.

      The Reviewer is correct that our approach presupposes a linear embedding of the 1D decision variable inthepopulationactivity. Inotherwords, anonlinearrepresentationofthe1Ddecisionvariableinpopulation activity could have an embedding dimensionality greater than 1, and there may well be a non-linear method that reveals this representation. To test this possibility, we decoded choice on each trial from population activity using (1) a linear decoder (logistic classifier) or (2) a multi-layer neural network, which can exploit non-linearities. We found that, for each session, the two decoders performed similarly: the neural network outperforms the logistic decoder (barely) in just one session. The analysis suggests that the assumption of linear embedding of the decision variable is justified. We hope this analysis convinces the reviewer that “sophisticated analyses of the full neuronal state space” and “a simple average of [Tcon ] neurons” do in indeed yield roughly equivalent representations of the decision variable. We have included the results of this analysis in Supplementary Figure S12. See also item 2 of the Public response.

      𝑅∗2.2 (Figure 3) Add estimates of variability for variance and autocorrelation through time from single-trial signals:

      –   E.g. by bootstrapping. Would be helpful for making rigorous the discussion of when the deviation from the theory is outside what would be expected by chance, even if it doesn’t change the specific conclusions here.

      –   If possible, it would help (by simulations, or maybe an added reference if it exists) to substantiate the claim about the expected sub-linearity at later time-points (Figure 3a) due to the upper stopping bound and limited firing rate range.

      We thank the reviewer for this helpful comment. The revised Fig. 3 now contains error bars derived by bootstrapping (see Methods, §Variance and autocorrelation of smoothed diffusion signals). We have also added Supplementary Figure S5, which substantiates the sub-linearity claim using simulations.

      𝑅∗2.3 (Figure 4) Add controls and estimates of variability for decoding across sessions:

      –   As a baseline - what is the level of within-trial correlation/cosine similarity when random coding directions are used?

      –   What is the variability in the estimates of values shown in a/d/e?

      We have addressed each of these items. (1) Figure 4a now shows the s.e.m. of decoding accuracy (across sessions). (2) Regarding the variability of estimates shown in Figure 4d & e, the standard errors are displayed in the new Supplementary Figure S8. It makes sense to show them there because (i) there is no natural way to represent error on the heat maps in Figure 4, and (ii) S8 concerns the comparison of the values in Figure 4d & e to values derived from random coding directions. (3) Random coding directions lead to values of cosine similarity and within-trial correlation that do not differ significantly from zero. We show this in several ways, summarized in our reply to Public Review item 4. Additional details are in the revised manuscript (Methods: Similarity of single-trial signals) and the new Supplementary Figure S8. We also provide this information in response to Recommendation 5, above.

      𝑅∗2.4 (Figure 5) Add negative controls and significance tests to support claims about trends in leverage:

      –   What is the level of increase in leverage attained from random 1D projections of the data, or other projections where the prior would be no leverage?

      –   What is the range of leverage values fit for a simulated signal with a ground-truth of no trend?

      We have added two control analyses. In addition to a shuffle control, which destroys the relationship (Review Figure 1) we performed additional analyses that preserve the correspondence of neural signals and behavior on the same trial. We generated random coding directions (CDs) by establishing weight-vectors that were either chosen from a Normal distribution or by permuting the weights assigned to PC-1 in each session. The latter is the more conservative measure. Projections of the neural responses onto these random coding directions render 𝑆rand(𝑡). Specifically, the degree of leverage is effectively zero or very much reduced. These analyses are summarized in a new Supplementary Figure S10. The distributions of our test statistics (e.g., leverage on choice and RT) under the variants of the null hypothesis also support traditional metrics of statistical significance. Figure S10 (bottom row) also provides an approximate answer to the question: What degree of leverage and mediation would be expected for a theoretical decision variable? Briefly, we simulated 60,000 trials using the race model that best fits the behavioral data of monkey M. For any noise-free representation of a Markovian integration process, the leverage of an early sample of the DV on behavior would be mediated completely by later activity as the latter sample—up to the time of commitment—subsumes all variability captured by the earlier sample. We, therefore, generated 𝑆sim(𝑡) by first subsampling the simulated data to match the trial numbers of each session. To evaluate a DV approximated from the activity of 𝑁 Tconin neurons per session rather than the true DV represented by the entire population, we generated 𝑁 noisy instantiations of the signal for each of the subsampled, simulated trials. The noisy decision variable, 𝑆sim (t) is the mean activity of these 𝑁 noise-corrupted signals. The simulation is consistent with the leverage and incomplete mediation observed for the populations of Tcon neurons. For in additional details, see Methods, §Leverage of single-trial activity on behavior) and Supplementary Figure S10, caption. See also our response to item 1 of the Public Response.

      𝑅∗2.5 The analysis is performed across several signed coherence levels, with data detrended for each signed coherence and choice to enable comparison of fluctuations relative to the relevant baseline; are results similar for the different coherences?

      The results are qualitatively similar for individual coherences. There is less power, of course, because there are fewer trials. The analyses cannot be performed for coherences ≥ 12.8% because there are not enough trials that satisfy the inclusion criteria (presence of left and right choice trials with RT ≤ 670 ms). Nonetheless, leverage on choice and RT is statistically significant for 27 of the 30 combinations of motion strengths < 12.8% × three signals (𝑆ramp, 𝑆PC1 and 𝑆Tin) × behavioral measures (RT and choice) (RT: all 𝑝 < 0.008, Fisher-z; choice: all 𝑝 < 0.05, t-test ). The three exceptions are trials with 6.4% coherence rightward motion, which do not correlate significantly with RT on leftward choice trials. Reviewer Figure 4 shows the results of the leverage and mediation analyses, using only the 0% coherence trials.

      𝑅∗2.6 (Figure 6) Additional analysis to strengthen the claim that Min represents the integrand and not the integral:

      a. Repeating the analysis in Figure 6d with the integral (cumulative sum) of the single-trial Min signals and instead observing a significant increase in leverage over time would be strong evidence for this interpretation. If you again see no increase, then it suggests that the activity of these units (while direction selective) may not be strongly yoked to behavior. This scenario (no increasing leverage of the integral of Min on behavior through time) also raises an intriguing alternative possibility: that the noise driving the ’diffusion’ of drift-diffusion here may originate in the integrating circuit, rather than just reflecting the complete integration of noise in the stream of evidence itself.

      b. Repeating the analysis in Figure 6d with the projection of the M subspace onto its own first PC (e.g. take the union of units {Mrightin, Mleftin} [our ], do PCA just on those units’ single

      trial activities, identify the first PC, and project those activities on that dimension to obtain SPC1-M.

      c. Ameliorating the sample-size limitation by relaxing the criteria for inclusion in Min - performing the same analyses shown, but including all units with visual RFs overlapping the motion stimulus, irrespective of their direction selectivity.

      a. Reviewer Figure 2a provides support for leverage of the integral on choice, and this leverage, like , increases as a function of time. The effect is present in all seven sessions that have both and neurons (all 𝑝 < 1𝑒 − 10). However, as shown in panel b, the same integral fails

      to demonstrate more than a hint of leverage on RT (all correlations are negative) and the magnitude does not vary as a function of time. We suspect—but cannot prove—that this failure arises because of limited power and the expected weak effect. Recall that the mediation analysis of RT is restricted to longer trials and that the correlation between the Min difference and the signal is less than 0.1 over the heatmap in Fig. 6e, implying that the Min difference explains less than 1% of the variance of 𝑆Tin(𝑡). We considered including Reviewer Figure 2 in the paper, but we feel it would be disingenuous (cherrypicking) to report only the positive outcome of the leverage on choice. If the editors feel strongly about it, we would be open to including it, but leaving these analyses out of the revised manuscript seems more consistent with our effort to deëmphasize this finding. In the future, we plan to record simultaneously from populations MT and LIP neurons (Min and Tin, of course) and optimize Min neuron yield by placing the RDM stimulus in the periphery. We also provide this information in response to Recommendation (6) above.

      b.  We tried the R’s suggestion to apply PCA to the union of Min neurons , , fully expecting PC1 to comprise weights of opposite sign for the right and left preferring neurons, but that is not what we observed. Instead, the direction selectivity is distributed over at least two PCs. We think this is a reflection of the prominence of other signals, such as the strong visual response and normalization signals (see Shushruth et al., 2018). In the spirit of the R’s suggestion, we also established an “evidence coding direction” using a regression strategy similar to the Ramp CD applied to the union of Min neurons. The strategy produced a coding direction with opposite signed weights dominating the right and left subsets. The projection of the neural data on this evidence CD yields a signal similar to the difference variable used in Fig. 6e (i.e., signals that are approximately constant firing rates vs time and scale as a function of signed coherence). These unintegrated signals exhibit weak leverage on choice and RT, consistent with Figure 6d. However, the integrated signal has leverage on choice but not RT, similar to the integral of the difference signal in Reviewer Figure 2.

      c.   We do not understand the motivation for this analysis. We could apply PCA or dPCA (or the regression approach, described above) to the population of units with RFs that overlap the motion stimulus, but it is hard to see how this would test the hypothesis that direction-selective neurons similar to those in area MT supply the momentary evidence. As mentioned, we have very few Min neurons (as few as two in session 3). Future experiments that place the motion stimulus in the periphery would likely increase the yield of Min neurons and would be better suited to study this question. As such, we do not see the integrand-like responses of Min neurons as a major claim of the paper. Instead, we view it as an intriguing observation that deserves follow-up in future experiments, including simultaneous recordings from populations of MT and LIP neurons (Min and Tin, of course). We have softened the language considerably to make it clear that future work will be needed to make strong claims about the nature of Min neurons.

      𝑅∗2.7 Other questions: Figure 2c is described as showing the average firing rate of units in Tconin on single trials, but must also incorporate some baseline subtraction (as the shown traces dip into negative firing rates). Whatbaselineissubtracted? Aretheseresidualsignals, asdescribedforlaterfigures, orisadifferent method used? (Presumably, a similar procedure is used also for Figure 2a/b, given that all single-trial traces begin at 0.). Is the baseline subtraction justified? If the dataset really does reflect the decision variable with single-trial resolution, eliminating the baseline subtraction when visualizing single-trial activity might actually help to make the point clearer: trials which (for any reason) begin with a higher projection on the particular direction that furnishes the DV would be predicted to reach the decision bound, at any fixed coherence, more quickly than trials with a smaller projection onto this direction.

      We thank the reviewer for this comment. For each trial, the mean activity between 175 ms and 225 ms after motion onset was subtracted when generating the single-trial traces. The baseline subtraction was only applied for visualization to better portray the diffusion component in the signal. Unless otherwise indicated, all analyses are computed on non-baseline corrected data. We now describe in the caption of Figure 2 that “For visualization, single-trial traces were baseline corrected by subtracting the activity in a 50 ms window around 200 ms.” Examples of the raw traces used for all follow-up analyses are displayed in Reviewer Figure 6.

      Reviewer #3 (Recommendations For The Authors):

      I only have a few comments to make the paper more accessible:

      𝑅∗3.1 I struggle to understand how the linear fitting from -1 to 1 was done. More detail about how the single cell single-trial activity was generated to possibly go from -1 to 1 or do I completely misunderstand the approach? I assume the data standardization does that job?

      We have rephrased and added clarifying detail to the section describing the derivation of the ramp signal in the Methods (Ramp direction).

      We applied linear regression to generate a signal that best approximates a linear ramp, on each trial, 𝑖, that terminates with a saccade to the choice-target contralateral to the hemisphere of the LIP recordings. The ramps are defined in the epoch spanning the decision time: each ramp begins at 𝑓𝑖(𝑡0) = −1, where 𝑡0 \= 0.2 s after motion onset, and ends at 𝑓𝑖(𝑡1) = 1, where 𝑡1 \= 𝑡sac − 0.05 s (i.e., 50 ms before saccade initiation). The ramps are sampled every 25 ms and concatenated using all eligible trials to construct a long saw-tooth function (see Supplementary Figure S2). The regression solves for the weights assigned to each neuron such that the weighted sum of the activity of all neurons best approximates the saw-tooth. We constructed a time series of standardized neural activity, sampled identically to the saw-tooth. The spike times from each neuron are represented as delta functions (rasters) and convolved with a non-causal 25 ms boxcar filter. The mean and standard deviation of all sampled values of activity were used to standardize the activity for each neuron (i.e., Z-transform). The coefficients derived by the regression establish the vector of weights that define 𝑆ramp. The algorithm ensures that the population signal 𝑆ramp(𝑡), but not necessarily individual neurons, have amplitudes ranging from approximately −1 to 1.

      𝑅∗3.2 It is difficult to understand how the urgency signal is derived, to then generate fig S4.

      The urgency signal is estimated by averaging 𝑆𝑥(𝑡) at each time point relative to motion onset, using only the 0% coherence trials. We have clarified this in the caption of Supplementary Figure S4.

      Author response image 1.

      Shuffle control for Fig. 5. Breaking the within-trial correspondence between neural signal, 𝑆(𝑡), and choice suppresses leverage to near zero.

      Author response image 2.

      Leverage of the integrated difference signal on choice and RT. Traces are the average leverage across seven sessions. Same conventions as in Figure 5.

      Author response image 3.

      Trial-averaged 𝑆ramp activity during individual sessions. Same as Figure 2b for individual sessions for Monkey M (left) and Monkey J (right). The figure is intended to illustrate the consistency and heterogeneity of the averaged signals. For example, the saccade-aligned averages lose their association with motion strength before left (contra) choices in sessions 1, 2, 5, and 6 but retain the association in sessions 3, 4, 7, and 8.

      Author response image 4.

      Drift-diffusion signals have measurable leverage on choice and RT even when only 0%-coherence trials are included in the analysis.

      Author response image 5.

      Raw single-trial activity for three types of population averages. Representative single-trial activity during the first 300 ms of evidence accumulation using two motion strengths: 0% and 25.6% coherence toward the left (contralateral) choice target. Unlike in Figure 2 in the paper, single-trial traces are not baseline corrected by subtracting the activity in a 50 ms window around 200 ms. We highlight a number of trials with thick traces and these are the same trials in each of the rows.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews: 

      Reviewer #1 (Public review):

      Summary:

      The investigators in this study analyzed the dataset assembly from 540 Salmonella isolates, and those from 45 recent isolates from Zhejiang University of China. The analysis and comparison of the resistome and mobilome of these isolates identified a significantly higher rate of cross-region dissemination compared to localized propagation. This study highlights the key role of the resistome in driving the transition and evolutionary history of S. Gallinarum.

      Strengths:

      The isolates included in this study were from 16 countries in the past century (1920 to 2023). While the study uses S. Gallinarun as the prototype, the conclusion from this work will likely apply to other Salmonella serotypes and other pathogens.

      Thank you very much for your positive feedback. We recognize, as you noted, that emphasizing Salmonella enterica Serovar Gallinarum in the title may lead readers to perceive our methods and conclusions as overly restrictive. In light of your evaluation of our work, we have revised the title to: “Avian-specific Salmonella transition to endemicity is accompanied by localized resistome and mobilome interaction” We believe this final version not only reflects the applicability of our conclusions, as you appreciated, but also addresses your previous suggestion to highlight the resistome and mobilome.

      Revisions in the manuscript Lines: 1-3

      Weaknesses:

      While the isolates came from 16 countries, most strains in this study were originally from China.

      We believe that this issue was discussed in detail in our previous response. Although potential bias exists, we have minimized its impact by constructing the largest global S. Gallinarum genome dataset to date. In addition, we have further emphasized these limitations in the manuscript.

      Comments on revisions:

      This reviewer is happy with the detailed responses from the authors regarding revising this manuscript. I do not have further comments.

      We greatly appreciate your positive feedback and are pleased that our responses have addressed your concerns.

      Reviewer #2 (Public review):

      Summary:

      The authors sequence 45 new samples of S. Gallinarum, a commensal Salmonella found in chickens, which can sometimes cause disease. They combine these sequences with around 500 from public databases, determine the population structure of the pathogen, and coarse relationships of lineages with geography. The authors further investigate known anti-microbial genes found in these genomes, how they associate with each other, whether they have been horizontally transferred, and date the emergence of clades.

      Strengths:

      - It doesn't seem that much is known about this serovar, so publicly available new sequences from a high burden region are a valuable addition to the literature.

      - Combining these sequences with publicly available sequences is a good way to better contextualise any findings.

      - The genomic analyses have been greatly improved since the first version of the manuscript, and appropriately analyse the population and date emergence of clades.

      - The SNP thresholds are contextualised in terms of evolutionary time.

      - The importance and context of the findings are fairly well described.

      Thank you so much for your thorough review and constructive comments on the manuscript.

      Weaknesses:

      -  There are still a few issues with the genomic analyses, although they no longer undermine the main conclusions:

      We are grateful for the valuable time and effort you have dedicated to improving our manuscript. In this revision, we have provided a point-by-point response to each of your concerns. Moreover, with the addition of new supplementary materials and modifications to the figures, we have re-examined and adjusted the numbering of figures and supplementary materials in the text to ensure they appear correctly in the manuscript.

      (1) Although the SNP distance is now considered in terms of time, the 5 SNP distance presented still represents ~7yrs evolution, so it is unlikely to be a transmission event, as described. It would be better to use a much lower threshold or describe the interpretation of these clusters more clearly. Bringing in epidemiological evidence or external references on the likely time interval between transmissions would be helpful.

      We sincerely thank you for highlighting this issue. We appreciate your concern regarding the use of a 5-SNP threshold to define a transmission event, especially given the approximate 7-year evolutionary timeframe. Considering our updated estimate for the evolutionary rate of S. Gallinarum (approximately 0.74 SNPs per year, with a 95% HPD range of 0.42 to 1.06), we have revised the manuscript to use a 2-SNP threshold (approximately representing less than two years of evolution) to better control the temporal span of transmission events. In addition, we have updated the manuscript to reflect this new threshold and demonstrated that the use of a more stringent SNP threshold does not affect the overall conclusions of the study.

      Specifically, we adopted the newly established 2-SNP threshold to update Figure 3a and corresponding Supplementary Figure 8. The heatmap on the far right of New Figure 3a illustrates the SNP distances among 45 newly isolated S. Gallinarum strains from two locations in Zhejiang Province (Taishun and Yueqing). New Supplementary Figure 8 simulates potential transmission events between the bvSP strains isolated from Zhejiang Province (n=95) and those from other regions of China with available provincial information (n=435). These analyses collectively demonstrate the localized transmission patterns of bvSP within China.

      For New Figure 3a, we found that even with the 2-SNP threshold, the number of potential transmission events among the 45 newly isolated S. Gallinarum strains from the two Zhejiang locations (Taishun and Yueqing) remains unchanged. In fact, we observed that the results from SNP tracing using an SNP threshold of less than 5 are consistent (see Author response image 1). 

      Author response image 1.

      Clustering results of 45 newly isolated S. Gallinarum strains using different SNP thresholds of 1, 2, 3, 4, and 5 SNPs. The five subplots represent the clustering results under each threshold. Each point corresponds to an individual strain, and lines connect strains with potential transmission relationships.

      For New Supplementary Figure 8, we employed the 2-SNP threshold and found that the number of transmission events between the bvSP strains isolated from Zhejiang Province (n=95) and those from other Chinese provinces (n=435) decreased from 91 to 53. The names of the strains involved in these potential transmission events are listed in Supplementary Table 5.

      Revisions in the manuscript

      Lines: 352-357

      Figures: Figure 3; Supplementary Figure 8

      Table: Supplementary Table 5

      (2) The HGT definition has not fundamentally been changed and therefore still has some issues, mainly that vertical evolution is still not systematically controlled for. 

      We sincerely thank you for highlighting this issue. We hope the following explanation will help clarify and improve our manuscript, as well as address your concerns.

      In bacteria, mobile genetic elements (MGEs) such as plasmids, transposons, integrons, and prophages, as mentioned in our manuscript, are segments of DNA that encode enzymes and proteins responsible for mediating the movement of genetic material between bacterial genomes (commonly referred to as “jumping genes”). These MGEs contribute to the mechanisms of horizontal gene transfer (HGT) in Salmonella, including transduction (via prophages), conjugation (via plasmids), and transposition (via integrons and transposons) (Nat Rev Microbiol. 2005 Sep;3(9):722-32). These “jumping genes” can enable Salmonella to acquire additional antimicrobial resistance genes (ARGs), which may not only originate from other Salmonella strains but also from distantly related species.

      To further address your concern regarding the systematic control of vertical evolution, we employed the HGTphyloDetect pipeline developed by Le Yuan et al. (Brief Bioinform. 2023 Mar 19;24(2):bbad035) to control for vertical evolution in the ARG sequences mentioned in our manuscript. We chose HGTphyloDetect because, as noted, "jumping genes" often occur among evolutionarily distant species, rendering the use of Gubbins potentially unsuitable for these distant HGT events.

      Using the HGTphyloDetect pipeline, we extracted base sequences for the eight ARGs shown in Figure 6b with an HGT frequency greater than zero (bla<sup>TEM-1B</sup>, sul1, dfrA17, aadA5, sul2, aph(3’’)-Ib, tet(A), aph(6)-Id). For bla<sup>TEM-1B</sup>, sul1, dfrA17, aadA5, and sul2, the HGT frequency reached 100% across different isolates, indicating that these ARG sequences have a unique sequence type. In contrast, due to the ResFinder settings requiring both similarity and coverage to meet a minimum value of 90%, the base sequences for aph(3’’)-Ib, tet(A), and aph(6)-Id are not unique. Consequently, we applied the HGTphyloDetect pipeline individually to each sequence type of ARGs to verify their association with HGT events. Specifically, among 436 bvSP isolates collected in China, we identified two sequence types of aph(3’’)-Ib, four sequence types of tet(A), and three sequence types of aph(6)-Id.

      Subsequently, to identify potential ARGs horizontally acquired from evolutionarily distant organisms, we queried the translated amino acid sequences of each ARG against the National Center for Biotechnology Information (NCBI) non-redundant protein database. We then evaluated whether these sequences were products of HGT by calculating Alien Index (AI) scores and out_perc values.

      The calculation of AI score is as follows:

      In this study, bbhG and bbhO represent the E-values of the best blast hit in ingroup and outgroup lineages, respectively. The outgroup lineage is defined as all species outside of the kingdom, while the ingroup lineage encompasses species within the kingdom but outside of the subphylum. An AI score ≥ 45 is considered a strong indicator that the gene in question is likely derived from an HGT event.

      Regarding the calculation method for out_perc:

      Finally, according to the definition provided by the HGTphyloDetect pipeline, ARGs with AI score ≥ 45 and out_perc ≥ 90% are presumed to be potential candidates for HGT from evolutionarily distant species. We have compiled the calculation results for the aforementioned genes in New Supplementary Table 9. The results indicate that all ARGs presented in Figure 6b, which exhibited a HGT frequency greater than zero, were acquired horizontally by S. Gallinarum. Based on these findings, we have revised the manuscript accordingly.

      Revisions in the manuscript

      Lines: 302-307; 616-650; 955-957

      Table: Supplementary Table 9

      Using a 5kb window is not sufficient, as LD may extend across the entire genome.

      We agree with your point that linkage disequilibrium (LD) could influence the transmission of genes within chromosomal regions. LD can lead to the non-random cooccurrence of alleles at different loci within a population. Considering that horizontal gene transfer (HGT) events involving more distantly related ARGs may be accompanied by vertical propagation on chromosomes, and to simultaneously assess the impact of LD, we conducted two evaluations.

      It is important to note that the following assessments are based on the assumption that plasmid replicons detected by PlasmidsFinder are part of self-replicating, extrachromosomal DNA.

      (1) In the revised pipeline used to calculate ARG HGT frequencies, we categorized a total of 621 ARGs carried by 436 bvSP isolates collected in China and found that 415 of these ARGs were located on MGEs. We further investigated the distribution of these 415 ARGs across different MGEs, taking into account the complex nesting relationships among them. We observed that 90% of the ARGs (372/415) were located on plasmid contigs. It is important to clarify that this finding does not contradict our statement in the manuscript regarding plasmids and transposons as the primary reservoirs for resistome geo-temporal dissemination. This is because transposons, integrons, and prophages carrying ARGs can also be found on plasmids. Additionally, only 25 bvSG isolates from China contained ARGs, which were likely acquired via transposons or integrons located on the chromosome.  

      (2) In our manuscript, we searched for ARGs within a 5kb upstream and downstream region (a total of 10kb) of transposons and integrons (The BLASTn parameters used in the Bacant pipeline to identify transposons and integrons were set to a coverage threshold of 60%, rather than 100%). However, in light of the potential impact of LD on vertical transmission, we expanded our search to include a 10kb upstream and downstream range (a total of 20kb)  for these 25 isolates. The decision to expand the search range to 10kb upstream and downstream range is based on the following two considerations: 1) Based on literature, we determined the overall lengths of the integrons and transposons carried by the 25 isolates (Tn801, Tn6205, Tn1721, In498, In1440, In473, and In282), and found that the maximum length of these elements is ~13.5 kb. Using a 10kb upstream and downstream threshold effectively covers these integrons/transposons. 2) The limitation posed by genomic fragmentation due to next-generation sequencing, which restrict the search range. We present the results of this expanded search for colocalization of ARGs with transposons and integrons at: Figshare:  https://doi.org/10.6084/m9.figshare.28129130.v1

      We found that these results were consistent with those obtained using the previous search range.

      Taken together, these results suggest that although linkage disequilibrium may influence genetic processes within chromosomal regions—particularly for the few chromosomeassociated antibiotic resistance genes linked to integrons and transposons—the overall impact in our study is likely minimal. This conclusion is supported by the observation that 90% of the ARGs in our dataset are located on plasmids, and even an expanded search range does not alter this outcome. Additionally, by incorporating Alien Index scores and calculating out_perc, we can further confirm the occurrence of horizontal gene transfer events.

      However, it is undeniable that other studies using our current pipeline may be affected. As a temporary remedial measure, we have included a note in the "README" file  as below (https://github.com/tjiaa/Cal_HGT_Frequency):

      “Note: Considering that ARGs located on the chromosome and carried by mobile genetic elements—such as integrons and transposons—may introduce potential computational errors, we recommend evaluating the number of ARGs associated with these elements on the chromosome during your analysis. If a majority of ARGs in your dataset fall into this category, we suggest using additional methods to evaluate the potential impact of linkage disequilibrium. Additionally, by modifying the “MGE_start” and “MGE_end” parameters in the “eLife_MGE_ARG_Co_location.ipynb” script, you can assess the distance between different ARGs and integrons or transposons on the chromosome. This approach will further aid in evaluating the impact of linkage disequilibrium on the genetic process.”

      We believe this approach will assist researchers in further assessing the potential impact of vertical evolution and help other users determine whether additional methods are necessary to account for such effects.

      As the authors have now run gubbins correctly, they could use the results from this existing analysis to find recent HGT.

      We sincerely thank you for your valuable suggestion. Utilizing additional methods to predict potential horizontal gene transfer (HGT) events could indeed enhance the robustness of the results. However, "jumping genes" often occur among evolutionarily distant species, rendering the use of Gubbins potentially unsuitable for these distant HGT events.

      Furthermore, the primary focus of our study is to identify HGT of antimicrobial resistance genes (ARGs) in the Salmonella genome driven by mobile genetic elements. Therefore, we employed the HGTphyloDetect pipeline developed by Le Yuan et al. (Brief Bioinform. 2023 Mar 19;24(2):bbad035) to control for vertical evolution in the ARG sequences. The specific computational methods and conclusions have been detailed above.

      To definite mobilisation, perhaps a standard pipeline such (e.g. https://github.com/EBIMetagenomics/mobilome-annotation-pipeline) would be more convincing.

      Thank you for your valuable suggestion. We agree that defining mobilization using a standardized pipeline can add rigor and clarity to our analysis. The pipeline you referenced (https://github.com/EBI-Metagenomics/mobilome-annotation-pipeline) is an excellent resource and provides a robust approach to the identification and annotation of mobile genetic elements.

      We have examined and run this pipeline, which uses “IntegronFinder” and “ICEfinder” to detect integrons, “geNomad” to identify plasmids, and “geNomad” and “VIRify” to detect prophages. Our initial checks revealed that the numbers of integrons, plasmids, and prophages identified using this pipeline were consistent with those detected in our study. However, due to the significantly different output formats, the results from this pipeline could not be integrated with the pipeline we used for calculating HGT frequency.

      We will incorporate the standardized pipeline you suggested in future studies to further improve the reliability of our findings.

      (3) The invasiveness index is better described, but the authors still did not provide convincing evidence that the small difference is actually biologically meaningful (there was no statistical difference between the two strains provided in response Figure 6). What do other Salmonella papers using this approach find, and can their links be brought in? If there is still no good evidence, a better description of this difference would help make the conclusions better supported.

      We sincerely appreciate your thoughtful feedback. The initial introduction of the invasiveness index in our manuscript aimed to quantitatively assess the differences in invasiveness between two geographically distinct strains of S. Gallinarum (isolated from Taishun and Yueqing) by comparing the degradation of 196 top predicted genes associated with invasiveness in their genomes. We found a highly significant statistical difference (P < 0.0001) in the invasiveness index between them.

      Several studies have also employed the invasiveness index to predict biological relevance in Salmonella strains, and we believe these examples provide further context for our approach:

      (1) Caisey V. Pulford et al, Nat Microbiol, 2021, used the same method to calculate the invasiveness index for Salmonella Typhimurium and employed it to characterize the invasiveness of different lineage strains. They found that Salmonella in Lineage-3 exhibited the highest invasiveness index, suggesting an adaptation from an intestinal to a systemic lifestyle. The authors noted, "Although the invasiveness index cannot yet be experimentally validated, Salmonella isolates with different invasiveness indices produce distinct clinical symptoms in a human population (BMC Med. 2020 Jul 17; 18(1):212)". They emphasized the necessity of developing more robust methods to measure Salmonella invasiveness.

      (2) Sandra Van Puyvelde et al, Nat Commun, 2019, reported that Salmonella Typhimurium sequence type 313 (ST313) lineage II.1 exhibited a higher invasiveness index compared to lineage II, suggesting that the two lineages might have distinct adaptations to an invasive lifestyle. Further experiments demonstrated significant differences between these lineages in terms of biofilm formation (A red dry and rough (RDAR) assay) and metabolic capacity for carbon compounds.

      (3) Wim L. Cuypers et al, Nat Commun, 2023, calculated the invasiveness index for 284 global Salmonella Concord strains across different lineages and found that Lineage-4 potentially exhibited the highest invasiveness.

      Given these evidences, we acknowledge that no significant difference in mortality was observed between the L2b and L3b S. Gallinarum strains in 16-day-old SPF chicken embryos. Existing literature suggests that strains with higher invasiveness indices may still exhibit differences in biofilm formation and metabolic capacities, reflecting their adaptation to different host environments. As such, we maintain that the invasiveness index remains a valuable metric for evaluating the genomic differences between S. Gallinarum strains from Taishun and Yueqing. We plan to further investigate these differences through phenotypic experiments in our next research.

      In the revised manuscript, we have added the following discussion along with additional references:

      Lines 358-365: “Moreover, the invasiveness index of bvSP from Taishun and Yueqing suggests that different lineages of S. Gallinarum recovered from distinct regions may exhibit biological differences. Previous studies have shown that strains with higher invasiveness indexes tend to be more virulent in hosts (30, 31), potentially causing neurological or arthritic symptoms in S. Gallinarum infections. Furthermore, strains with varying invasiveness indexes have been confirmed to differ in their biofilm formation abilities and metabolic capacities for carbon compounds (32).”

      Revisions in the manuscript:

      Lines: 358-365, 806-827.

      In summary, the analysis is broadly well described and feels appropriate. Some of the conclusions are still not fully supported, although the main points and context of the paper now appear sound.

      Thank you so much for your positive evaluation of our work. We hope that the revised manuscript meets your expectations and offers a more accurate interpretation of our findings.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      This is a great improvement over the first version and I thank the authors for a thorough response, as well as changing their conclusions in response to their improvements.

      Other small remaining issues:

      Figure 3: Heatmap of SNPs is hard to read in grayscale. It also just represents the between clade distances already shown by the tree. It would be more useful to present intraclade distances only to see the SNP resolution _within_ each lineage. Using a better colour scheme would also help.

      Thank you for your insightful comments and suggestions regarding Figure 3. We agree that the grayscale heatmap may present challenges in terms of visual clarity. To address this, we have updated the heatmap with a more distinct color gradient, ensuring better contrast and easier interpretation (New Figure 3). 

      Regarding your second suggestion: "It would be more useful to present intraclade distances only to see the SNP resolution within each lineage," we believe it is already addressed in the current version of New Figure 3. Specifically, the heatmap on the right side of New Figure 3 illustrates the SNP distances between S. Gallinarum isolates from Taishun and Yueqing, with the goal of demonstrating that genomic variation within isolates from a single region is generally smaller compared to those from different regions. In this figure, 45 newly isolated S. Gallinarum strains are categorized into two lineages: L2b and L3b. The heatmap on the right side of Figure 3 displays the SNP distances between all pairwise combinations of these 45 strains, where the intraclade distances are represented by the red regions (highlighting the pairwise distances within each lineage, specifically L3b and L2b, which are indicated by two triangles). The between-clade distances are shown by the blue regions.

      We also believe in further exploring the intraclade distances across the entire dataset of 580 S. Gallinarum strains, as it could provide additional insights. However, this analysis would extend beyond the scope of the current section.

      Revisions in the manuscript Line: 998

      Figure: Figure 3

      Please remove Figure 6c, it does not add anything to the paper and raises questions about performing this regression.

      Thank you for pointing out this issue. We have removed Figure 6c and the corresponding description in the "Results" section from the manuscript (New Figure 6).

      Revisions in the manuscript Lines: 316, 319, 1035-1041.

      Figure: Figure 6

      Again, thank you all for your time and efforts in reviewing our work. We believe the improved manuscript meets the high standards of the journal.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Weaknesses & incompletely supported claims:

      (1) A central mechanistic claim of the paper is that "DCP1a can regulate DCP2's cellular decapping activity by enhancing DCP2's affinity to RNA, in addition to bridging the interactions of DCP2 with other decapping factors. This represents a pivotal molecular mechanism by which DCP1a exerts its regulatory control over the mRNA decapping process." Similar versions of this claim are repeated in the abstract and discussion sections. However, this appears to be entirely at odds with the observation from in vitro decapping assays with immunoprecipitated DCP2 that showed DCP1 knockout does not significantly affect the enzymatic activity of DCP2 (Figures 2B-D; I note that there may be a very small change in DCP2 activity shown in panel C, but this may be due to slightly different amounts of immunoprecipitated DCP2 used in the assay, as suggested by panel D). If DCP1 pivotally regulates decapping activity by enhancing RNA binding to DCP2, why is no difference in decapping activity observed in the absence of DCP1?

      Furthermore, the authors show only weak changes in relative RNA levels immunoprecipitated by DCP2 with versus without DCP1 (~2-3 fold change; consistent with the Valkov 2016 NSMB paper, which shows what looks like only modest changes in RNA binding affinity for yeast Dcp2 +/- Dcp1). Is the argument that only a 2-3 fold change in RNA binding affinity is responsible for the sizable decapping defects and significant accumulation of deadenylated intermediates observed in cells upon Dcp1 depletion? (and if so, why is this the case for in-cell data, but not the immunoprecipitated in vitro data?)

      We appreciate the reviewer's thoughtful comments on our paper. The reviewer points out an apparent contradiction between the claim that DCP1a regulates DCP2's cellular decapping activity and the observation that knocking out DCP1a does not significantly affect DCP2's enzymatic activity in vitro. However, it is important to underscore the challenge of reconciling differences between in vitro and in vivo experiments in scientific research. Although in vitro systems provide a controlled environment, they have inherent limitations that often fail to capture the complexities of cellular processes. Our in vitro experiments used immunoprecipitated proteins to ensure the presence of relevant factors, but these experiments cannot fully replicate the precise stoichiometry and dynamic interactions present in a cellular environment. Furthermore, the limited volume in vitro can actually facilitate reactions that may not occur as readily in the complex and heterogeneous environment of a cell. Therefore, the lack of a significant difference in decapping activity observed in vitro does not necessarily negate the regulatory role of DCP1 in the cellular context. Rather, it underscores our previous oversight of DCP1's importance in the decapping process under in vitro conditions. The conclusions regarding DCP1's regulatory mechanisms remain valid and supported by the presented evidence, especially when considering the inherent differences between in vitro and in vivo experimental conditions. It is precisely because of these differences that we recognized our previous underestimation of DCP1's significance. Therefore, our subsequent experiments focused on elucidating DCP1's regulatory mechanisms in the decapping process

      The authors acknowledge this apparent discrepancy between the in vitro DCP2 decapping assays and in-cell decapping data, writing: "this observation could be attributed to the inherent constraints of in vitro assays, which often fall short of faithfully replicating the complexity of the cellular environment where multiple factors and cofactors are at play. To determine the underlying cause, we postulated that the observed cellular decapping defect in DCP1a/b knockout cells might be attributed to DCP1 functioning as a scaffold." This is fair. They next show that DCP1 acts as a scaffold to recruit multiple factors to DCP2 in cells (EDC3, DDX6, PatL1, and PNRC1 and 2). However, while DCP1 is shown to recruit multiple cofactors to DCP2 (consistent with other studies in the decapping field, and primarily through motifs in the Dcp1 C-terminal tail), the authors ultimately show that *none* of these cofactors are actually essential for DCP2-mediated decapping in cells (Figures 3A-F). More specifically, the authors showed that the EVH1 domain was sufficient to rescue decapping defects in DCP1a/b knockout cells, that PNRC1 and PNRC2 were the only cofactors that interact with the EVH1 domain, and finally that shRNA-mediated PNRC1 or PNCR2 knockdown has no effect on in-cell decapping (Figures 3E and F). Therefore, based on the presented data, while DCP1 certainly does act as a scaffold, it doesn't seem to be the case that the major cellular decapping defect observed in DCP1a/b knockout is due to DCP1's ability to recruit specific cofactors to DCP2.

      The findings that none of the decapping cofactors recruited by DCP1 to DCP2 are essential for decapping in cells further underscore the complexity of the decapping process in vivo. This observation suggests that while DCP1's scaffolding function is crucial for recruiting cofactors, the decapping process likely involves additional layers of regulation that are not fully captured by our current understanding of DCP1. Furthermore, the reviewer mentions that the observed changes in RNA binding affinity (approximately 2-3 fold) in our in vitro experiments seem relatively modest. While these changes may appear insignificant in vitro, their cumulative impact in the dynamic cellular environment could be substantial. Even minor perturbations in RNA binding affinity can trigger cascading effects, leading to significant changes in decapping activity and the accumulation of deadenylated intermediates upon Dcp1 depletion. Cellular processes involve complex networks of interrelated events, and small molecular changes can result in amplified biological outcomes. The subtle molecular variations observed in vitro may translate into significant phenotypic outcomes within the complex cellular environment, underscoring the importance of DCP1a's regulatory role in the cellular decapping process.

      So as far as I can tell, the discrepancy between the in vitro (DCP1 not required) and in-cell (DCP1 required) decapping data, remains entirely unresolved. Therefore, I don't think that the conclusions that DCP1 regulates decapping by (a) changing RNA binding affinity (authors show this doesn't matter in vitro, and that the change in RNA binding affinity is very small) or (b) by bridging interactions of cofactors with DCP2 (authors show all tested cofactors are dispensable for robust in-cell decapping activity), are supported by the evidence presented in the paper (or convincingly supported by previous structural and functional studies of the decapping complex).

      We have addressed the reconciliation of differences between in vitro and in vivo experiments in the revised manuscript and emphasized the importance of considering cellular interactions when interpreting our findings.

      (2) Related to the RNA binding claims mentioned above, are the differences shown in Figure 3H statistically significant? Why are there no error bars shown for the MBP control? (I understand this was normalized to 1, but presumably, there were 3 biological replicates here that have some spread of values?). The individual data points for each replicate should be displayed for each bar so that readers can better assess the spread of data and the significance of the observed differences. I've listed these points as major because of the key mechanistic claim that DCP1 enhances RNA binding to DCP2 hinges in large part on this data.

      Thank you for your feedback. Regarding your comments on the statistical significance of the differences shown in Figure 3H and the absence of error bars for the MBP control, we will address these concerns in the revised manuscript. We’ll include individual data points for the three biological replicates and corresponding statistical analysis to more clearly demonstrate the data spread and significance of the observed differences.

      (3) Also related to point (1) above, the kinetic analysis presented in Figure 2C shows that the large majority of transcript is mostly decapped at the first 5-minute timepoint; it may be that DCP2-mediated decapping activity is actually different in vitro with or without DCP1, but that this is being missed because the reaction is basically done in less than 5 minutes under the conditions being assayed (i.e. these are basically endpoint assays under these conditions). It may be that if kinetics were done under conditions to slow down the reaction somewhat (e.g. lower Dcp2 concentration, lower temperatures), so that more of the kinetic behavior is captured, the apparent discrepancy between in vitro and in-cell data would be much less. Indeed, previous studies have shown that in yeast, Dcp1 strongly activates the catalytic step (kcat) of decapping by ~10-fold, and reduces the KM by only ~2 fold (Floor et al, NSMB 2010). It might be beneficial to use purified proteins here (only a Western blot is used in Figure 2D to show the presence of DCP2 and/or DCP1, but do these complexes have other, and different, components immunoprecipitated along with them?), if possible, to better control reaction conditions.

      This contradiction between the in vitro and in-cell decapping data undercuts one of the main mechanistic takeaways from the first half of the paper. This needs to be addressed/resolved with further experiments to better define the role of DCP1-mediated activation, or the mechanistic conclusions significantly changed or removed.

      We genuinely appreciate the reviewer’s insightful comments on the kinetic analysis presented in Figure 2C. Your astute observation regarding the potential influence of reaction duration on the interpretation of in vitro decapping activity, especially in the absence of DCP1, is well-received. The time-sensitive nature of our experiments, as you rightly pointed out, might not fully capture the nuanced kinetic behaviors. In addition, the DCP2 complex purified from cells could not be precisely quantified. In response to your suggestion, we attempted to purify human DCP2 protein from E. coli; however, regrettably, the purified protein failed to exhibit any enzymatic activity. This disparity may be attributed to species differences.

      Considering the reviewer’s valuable insights, our revised manuscript emphasized that purified DCP2 from cells exhibits activity regardless of the presence of DCP1. This adjustment aims to provide a clearer perspective on our findings and to better align with the nuances of our experimental design and the meticulous consideration of the results.

      (4) The second half of the paper compares the transcriptomic and metabolic profiles of DCP1a versus DCP1b knockouts to reveal that these target a different subset of mRNAs for degradation and have different levels of cellular metabolites. This is a great application of the DCP1a/b KO cells developed in this paper and provides new information about DCP1a vs b function in metazoans, which to my knowledge has not really been explored at all. However, the analysis of DCP1 function/expression levels in human cancer seems superficial and inconclusive: for example, the authors conclude that "...these findings indicate that DCP1a and DCP1b likely have distinct and non-redundant roles in the development and progression of cancer", but what is the evidence for this? I see that DCP1a and b levels vary in different cancer cell types, but is there any evidence that these changes are actually linked to cancer development, progression, or tumorigenesis? If not, these broader conclusions should be removed.

      Thank you to the reviewer for pointing out that such a description may be misleading. We have removed our previous broader conclusion and revised our sentences. To further explore the potential impact of DCP1a and DCP1b on cancer progression, we examined the association between the expression levels of DCP1a and DCP1b and progression-free interval (PFI). We have incorporated this information into our revised manuscript.

      (5) The authors used CRISPR-Cas9 to introduce frameshift mutations that result in premature termination codons in DCP1a/b knockout cells (verified by Sanger sequencing). They then use Western blotting with DCP1a or DCP1b antibodies to confirm the absence of DCP1 in the knockout cell lines. However, the DCP1a antibody used in this study (Sigma D5444) is targeted to the C-terminal end of DCP1a. Can the authors conclusively rule out that the CRISPR/Cas-generated mutations do not result in the production of truncated DCP1a that is just unable to be detected by the C-terminally targeted antibody? While it is likely the introduced premature termination codon in the DCP1a gene results in nonsense-mediated decay of the resulting transcript, this outcome is indeed supported by the knockout results showing large defects in cellular decapping which can be rescued by the addition of the EVH1 domain, it would be better to carefully validate the success of the DCP1a knockout and conclusively show no truncated DCP1a is produced by using N-terminally targeted DCP1a antibodies (as was the case for DCP1b).

      Thank you for your insightful comment regarding the validation of our DCP1a/b knockout cell line. We acknowledge your point about the DCP1a C-terminal targeting of the Sigma D5444 antibody used in our Western blot analysis. We agree that we cannot definitively rule out the possibility of truncated DCP1a protein production solely based on the lack of full-length protein detection. To address this limitation, we utilized a commercial information available N-terminally targeted DCP1a antibody (aviva ARP39353_T100) in a Western blot analysis. This will allow us to comprehensively detect any truncated protein fragments remaining after the CRISPR-Cas9-generated frameshift mutation.

      Some additional minor comments:

      • More information would be helpful on the choice of DCP1 truncation boundaries; why was 1-254 chosen as one of the truncations?

      Thank you for the reviewer's comment and suggestion. Regarding the choice of DCP1 1-254 truncation boundaries based on the predicted structure from AlphaFoldDB (A0A087WT55). We will include this information in the revised manuscript.

      • Figure S2D is a pretty important experiment because it suggests that the observed deadenylated intermediates are in fact still capped; can a positive control be added to these experiments to show that removal of cap results in rapid terminator-mediated degradation?

      Unfortunately, due to our institution's current laboratory safety policies, we are unable to perform experiments involving the use of radioactive isotopes such as 32P. Therefore, while adding the suggested positive control experiment to demonstrate rapid RNA degradation upon decapping would further validate our interpretation, we regret that we cannot carry out this experiment at the moment. However, the observed deadenylated intermediates in Figure S2D match the predicted size of capped RNA fragments, and not the expected sizes of degradation products after decapping. Furthermore, previous literature has well-established that for these types of RNAs, decapping leads directly to rapid 5' to 3' exonuclease-mediated degradation, without producing stable deadenylated intermediates. Thus, we believe that the current data is sufficient to support our conclusion that the deadenylated intermediates retain the 5' cap structure.

      Reviewer #2 (Public Review):

      Weaknesses:

      The direct targets of DCP1a and/or DCP1b were not determined as the analysis was restricted to RNA-seq to assess RNA abundance, which can be a result of direct or indirect regulation by DCP1a/b.

      Thank you for raising this important point. In our study, we acknowledge that the use of RNA-seq to assess RNA abundance provides a broad overview of the regulatory impacts of DCP1a and DCP1b. This method captures changes in RNA levels that may arise from both direct and indirect regulatory actions of these proteins. While we did not directly determine the targets of DCP1a and DCP1b, the data obtained from our RNA-seq analysis serve as a foundational step for future targeted experiments, which could include techniques such as RIP-seq, to delineate the direct targets of DCP1a and DCP1b more precisely. We believe that our current findings contribute valuable information to the field and pave the way for these subsequent analyses.

      P-bodies appear to be larger in human cells lacking DCP1a and DCP1b but a lack of image quantification prevents this conclusion from being drawn.

      Thank you for the reviewer’s valuable feedback. We have addressed the reviewer’s concern regarding P-bodies' size in human cells lacking DCP1a and DCP1b. We have now performed image quantification and can confirm that P-bodies are indeed larger in these cells.

      The lack of details in the methodology and figure legends limit reader understanding.

      We acknowledge the reviewer's concerns regarding the level of detail provided in the methodology and figure legends. To address this, we are committed to enhancing both sections with additional details and clarifications in our revised manuscript. Thank you for bringing this to our attention.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) To me, the second half of the paper comparing DCP1a and DCP1b is in many ways distinct from the first half and could stand on its own as an interesting paper if this comparative analysis is explored a little deeper (maybe by validating some of the differences in decay observed for individual mRNAs targeted by DCP1a versus DCP1b, by measuring and comparing the decay rates of some individual transcripts under differential control by DCP1a vs b?), and revising the conclusions about links to cancer as mentioned above. I think these later comparative results in the paper present the most new and interesting data concerning DCP1 function in humans (especially since I think the mechanistic conclusions from the first half aren't well supported yet or are at least inconsistent), but when I read these later sections of the paper I struggle to understand the key takeaways from the transcriptomic and metabolomic data.

      Thank you for the reviewer's suggestions. Estimating the decay rates of individual transcripts within the transcriptomes of DCP1a_KO, DCP1b_KO, and wild type can provide insight into the direct targets of DCP1a or DCP1b. However, this requires either time-series RNA-seq or specialized sequencing technologies such as Precision Run-On sequencing (PRO-seq) or RNA Approach to Equilibrium Sequencing (RATE-Seq). Unfortunately, we lack the necessary dataset in our project to estimate the decay rates for the potential targets identified in our RNA-seq data. Despite this limitation, we acknowledge the potential of this approach in identifying the true targets of DCP1a and DCP1b and have included this idea in our discussion.

      (2) I think it would be helpful to add a little more descriptive or narrative language to the figure legends (I know some of them are already quite long!) so that readers can follow the general idea of the experiment through the figure legend as well as the main text; as written, the figure legends are mostly exclusively technical details, so it can be hard to parse what experiment is being carried out in some cases.

      Thank you for the reviewer’s suggestion, we will strive to improve the language of the figure legends to include technical details while clearly conveying the main idea of the experiment. We will ensure that the language of the figure legends is more readable and comprehensible so that readers can more easily parse what experiment is being carried out.

      Reviewer #2 (Recommendations For The Authors):

      Suggestions for improved or additional experiments, data, or analyses:

      The use of RNA-seq to measure RNA abundance in DCP1a and/or b knockout cells can give some insight into both the indirect and direct effects of DCP1a/b on gene expression but cannot identify the direct targets of these genes. Rather, global analysis of RNA stability or capturing uncapped RNA decay intermediates would allow the authors to conclude they have identified direct targets of DCP1a and/or b. Without such analyses, the interpretation of these data should be scaled back to clearly state that RNA levels can be altered through indirect effects of DCP1a/b absence throughout the text.

      We appreciate the reviewer's suggestion. We have modified our sentences to emphasize that the dysregulated genes could be caused by both direct and indirect effects.

      A control/randomly generated gene list should be analyzed for GO terms to determine whether the enrichment of cancer-related pathways in the differentially expressed genes in the DCP1a/b knockout cells is meaningful.

      Thank you for the reviewer's comment. We shuffled our gene list and reperformed the pathway enrichment analysis in Figure 4C and 4D 1,000 times. We focused on the following cancer-related pathways: E2F targets, MTORC1 signaling, G2M checkpoint, MYC target V1, EMT transition, KRAS signaling DN, P53 pathway, and NOTCH signaling pathways. We then calculated how many times the q-values obtained from the shuffled gene list were more significant than the q-value obtained from our real data. In four of the eight pathways (E2F targets, MTORC1 signaling, G2M checkpoint, and MYC target v1), none of the shuffled gene lists resulted in a q-value smaller than the real one. In the other four pathways (EMT transition, KRAS signaling DN, P53 pathway, and NOTCH signaling pathways), the q-values were smaller than the real q-value 2, 11, 4, and 4 times out of the 1000 shuffles. Based on the shuffled results, we conclude that the transcriptome of DCP1a/b knockout cells is statistically enriched in these cancer-related pathways.

      Author response image 1.

      Distribution of q-values resulting from the Gene Set Enrichment Analysis (GSEA) conducted on 1,000 shuffled gene lists for eight cancer-related pathways. The q-values derived from Figure 4C and 4D are indicated by red (DCP1a_KO) and blue (DCP1b_KO) dashed lines, respectively. Some q-values derived from Figure 4C are too small to be labeled on the plots, such as in E2F targets (q value: 5.87E-07), MTORC1 signaling (q values: 6.59E-07 and 1.58E-06 for DCP1a_KO and DCP1b_KO, respectively), MYC target V1 (q value: 0.004644174 for DCP1a_KO), etc. The numbers x/1000 indicate how often the shuffled q-values were smaller than the real q-value out of 1,000 permutations.

      Comparisons of the DCP1a and/or b knockout RNA-seq results should be done to published datasets such as those published by Luo et al., Cell Chemical Biology (2021) to determine whether there are common targets with DCP2 and validate the reported findings.

      Thank you for reviewer’s suggestion. We compared the upregulated genes from DCP1a_KO, DCP1b_KO, and DCP1a/b_KO cell lines with the 91 targets of DPC2 identified by Luo et al. in Cell Chemical Biology (2021). Only EPPK1 was found to be overlapped between the potential DCP1b_KO targets and the targets of DCP2. No genes were found to be overlapped between the potential DCP1a_KO targets and the targets of DCP2. However, three genes, TES, PAX6, and C18orf21, were found to be overlapped between the significantly upregulated DEGs of DCP1a/b_KO and the targets of DCP2. We have included this information in the discussion section.

      The RNA tethering assays are not clear and are difficult to interpret without further controls to delineate the polyadenylated and deadenylated species.

      Thank you for the reviewer’s feedback. We acknowledge that the reviewer might harbor some doubts regarding the outcomes of the RNA tethering assays. Nonetheless, this methodology is well-established and has also found extensive application across many studies. We are committed to enhancing the clarity of our experiment’s details and results within the figure legends and textual descriptions.

      The representative images of p-bodies clearly show that DCP1a/b KO cells have larger p-bodies than the wild-type cells. The authors should quantify p-body size in each image set as the current interpretation of the data is that there is no difference in size or number of p-bodies, but the data suggest otherwise.

      Thank you very much for the reviewer’s insightful comments and for drawing our attention to the need to quantify p-body sizes in DCP1a/b KO and wild-type cells. We agree with the reviewer’s assessment that the representative images suggest a difference in p-body size between DCP1a/b KO cells and wild-type cells, which we initially overlooked. We will revise our manuscript accordingly to include these findings, ensuring that our interpretation of the data aligns with the observed differences.

      Statistical analysis of the Figure 2C results should be included because the difference between the wild-type and Dco1a/b KO cells with GFP-DCP2 looks significantly different but is interpreted in the text as not significant.

      Thank you for pointing out the need for a statistical analysis of the results shown in Figure 2C. We acknowledge that the visual difference between the wild-type and Dco1a/b KO cells with GFP-DCP2 suggests a significant variation, which may not have been clearly communicated in our text. We will conduct the necessary statistical analysis to substantiate the observations made in Figure 2C. Furthermore, we would like to emphasize that our primary focus was to demonstrate that purified DCP2 within cells retains its activity even in the absence of DCP1. This critical point will be highlighted and clarified in the revised version of our manuscript to prevent any misunderstanding.

      Recommendations for improving the writing and presentation:

      Additional context including what is known about the role of dcp1 in decapping from the decades of work in yeast and other model organisms should be incorporated into the introduction and discussion sections.

      Thank you for the reviewer’s suggestion. We will incorporate additional context about the function and significance of DCP1 in decapping processes within our revised manuscript's introduction and discussion sections.

      Details should be provided within the figure legends and methods section on experimental approaches and the number of replicates and statistical analyses used throughout the manuscript. For example, it is not clear whether western blots or RNA-IP experiments were performed more than once as representative images are shown.

      Thank you for the reviewer’s suggestion. In the figure legends and methods section, we will provide more details about the experimental methods, number of replicates, and statistical analyses. Regarding the Western blots and RNA-IP experiments the reviewer mentioned, we performed multiple experiments and presented representative images in the manuscript. We will clarify this in the revised manuscript to eliminate potential confusion.

      The rationale for performing metabolic profiling is not clear.

      We appreciate the reviewer's thoughtful feedback. The rationale behind conducting metabolic profiling in our study is rooted in its efficacy as a valuable tool for deciphering the consequences of specific gene mutations, particularly those closely associated with phenotypic changes or final metabolic pathways. Our objective is to utilize metabolic profiling to unravel the distinct biofunctions of DCP1a and DCP1b. By employing this approach, we aim to gain insights into the intricate metabolic alterations that result from the absence of these genes, thereby enhancing our understanding of their roles in cellular processes. We recognize the necessity of clearly presenting this rationale and promise to bolster the articulation of these points in the revised version of our manuscript to ensure the clarity and transparency of our research motivation.

      Details in the methods section should be included for the CRISPR/Cas9-mediated gene editing validation. The Sangar sequencing results presented in Figure S1b should be explained. The entire western blot(s) should be shown in Figure S1A to give confidence the Dcp1a/b KO cells are not expressing truncated proteins and the epitopes of the antibodies used to detect Dcp1a/b should be described. The northern blot probes should be described and sequences included. The transcriptomics method should be detailed.

      Thank you for your feedback, in the revised manuscript we will detail the CRISPR/Cas9 gene editing validation, explain the Sanger sequencing results in Figure S1b, show the full Western blot in Figure S1A to confirm that the Dcp1a/b knockout cells are not expressing truncated proteins, describe the Northern blot probes used, and detail the transcriptomics method, all to ensure clarity and comprehensiveness in our experimental procedures and results.

      A diagram showing the RNA tethering assays with labels corresponding to all blots/gels should be provided.

      Thank you for your suggestion. We will provide a diagram showing the RNA tethering assays with labels corresponding to all blots/gels in our revised manuscript. This will help readers better understand our experimental design and results.

      The statement, "This suggests that the disruption of the decapping process in DCP1a/b-knockout cells results in the accumulation of unprocessed mRNA intermediates" regarding the results of the RNA-seq assay is not supported by the evidence as RNA-seq does not measure RNA decay intermediates or RNA decay rates.

      Thank you for the reviewer’s comment. We agree with that RNA-seq experiments indeed do not directly measure RNA decay intermediates or RNA decay rates. Our statement could have caused confusion, and we have therefore removed this sentence from the manuscript.

      Minor corrections to the text and figures:

      Figure S6A is uninterpretable as presented.

      Thank you for the reviewer’s valuable feedback. We have taken note and made improvements. We have simplified Figure S6A to enhance its interpretability, hoping that the current version will make it easier for the readers to understand.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review):

      Original comment: There is no explanation for how this work could be a breakthrough in simulation gregarious feeding as is stated in the manuscript.

      Reviewer response: I think I understand where the authors are trying to take this next step. If the authors were to follow up on this study with the proposed implementation of inhalant/exhalent velocities profiles (or more preferably velocity/pressure fields), then that study would be a breakthrough in simulating such gregarious feeding. Based on what has been done within the present study, I think the term "breakthrough" is instead overly emphatic. An additional note on this. The authors are correct that incorporating additional models could be used to simulation a population (as has been successfully done for several Ediacaran taxa despite computational limitations), but it's not the only way. The authors 1 might explore using periodic boundary conditions on the external faces of the flow domain. This could require only a single Olivooid model to assess gregarious impacts - see the abundant literature of modeling flow through solar array fields.

      We appreciate the reviewer 1 for the suggestion. Modeling gregarious feeding via periodic boundary conditions is surely a practical way with limited computational resources. Modeling flow through solar array fields can also be an inspiring case. However, to realism the simulation of gregarious feeding behavior on an uneven seabed and with irregular organism spatial distribution, just using periodic boundary conditions may not be sufficient (see Author response image 1 for a simple example). We will go on exploring the way of realizing the simulations of large-scale gregarious feeding.

      Author response image 1.

      An example of modeling gregarious feeding behavior on an uneven seabed.

      Original comment: The claim that olivooid-type feeding was most likely a prerequisite transitional form to jet-propelled swimming needs much more support or needs to be tailored to olivooids. This suggests that such behavior is absent (or must be convergent) before olivooids, which is at odds with the increasing quantities of pelagic life (whose modes of swimming are admittedly unconstrained) documented from Cambrian and Neoproterozoic deposits. Even among just medusozoans, ancestral 1 state reconstruction suggests that they would have been swimming during the Neoproterozoic (Kayal et al., 2018; BMC Evolutionary Biology) with no knowledge of the mechanics due to absent preservation. Author response: Thanks for your suggestions. Yes, we agree with you that the ancestral swimming medusae may appear before the early Cambrian, even at the Neoproterozoic deposits. However, discussions on the affinities of Ediacaran cnidarians are severely limited because of the lack of information concerning their soft anatomy. So, it is hard to detect the mechanics due to absent preservation. Olivooids found from the basal Cambrian Kuanchuanpu Formation can be reasonably considered as cnidarians based on their radial symmetry, external features, and especially the internal anatomies (Bengtson and Yue 1997; Dong et al. 2013; 2016; Han et al. 2013; 2016; Liu et al. 2014; Wang et al. 2017; 2020; 2022). The valid simulation experiment here was based on the soft tissue preserved in olivooids.

      Reviewer response: This response does not sufficiently address my earlier comment. While the authors are correct that individual Ediacaran affinities are an area of active research and that Olivooids can reasonably be considered cnidarians, this doesn't address the actual critique in my comment. Most (not all) Ediacaran soft-bodied fossils are considered to have been benthic, but pelagic cnidarian life is widely acknowledged to at least be present during later White Sea and Nama assemblages (and earlier depending on molecular clock interpretations). The authors have certainly provided support for the mechanics of this type of feeding being co-opted for eventual jet propulsion swimming in Olivooids. They have not provided sufficient justifications within the manuscript for this to be broadened beyond this group.

      Thanks for your sincere commentary. We of course agree with the possibility of the emergence of swimming cnidarians before the lowermost Cambrian Fortunian Stage. See lines 16-129: “Ediacaran fossil assemblages with complex ecosystems consist of exceptionally preserved soft-bodied eukaryotes of enigmatic morphology, which their affinities are mostly unresolved (Tarhan et al., 2018, Integrative and Comparative Biology, 58 (4), 688–702; Evans et al., 2022, PNAS, 11(46), e220747511).” Undoubtedly Olivooids belong to cnidarians charactered by their external and internal biological structures. Limited by the fossil records, we could only speculate on the transition from the benthic to the swimming of ancestral cnidarians via the valid fossil preservation, e.g. olivooids. The transition may require processes such as increasing body size, thickening the mesoglea, and degenerating the periderm, etc. And these processes may also evolve independently or comprehensively. Moreover, the ecological behaviors of the ancestral cnidarians may evolve independently at different stages from Ediacaran to Cambrian. We therefore could not provide more sufficient justifications beyond olivooids.

      Original comment: L446: two layers of hexahedral elements is a very low number for meshing boundary layer flow

      Reviewer response: As the authors point out in the main text, these organisms are small (millimeters in scale) and certainly lived within the boundary layer range of the ocean. While the boundary layer is not the main point, it still needs to be accurately resolved as it should certainly affect the flow further towards the far field at this scale. I'm not suggesting the authors need to perfectly resolve the boundary layer or focus on using turbulence models more tailored to boundary layer flows (such as k-w), but the flow field still needs sufficient realism for a boundary bounded flow. The authors really should consider quantitatively assessing the number of hexahedral elements within their mesh refinement study.

      To address this concern, we run another four simulations based on mesh4 within our mesh refinement study to assess the number of hexahedral elements (five layers and eight layers of hexahedral elements with different thickness of boundary layer mesh (controlled by thickness adjustment factor), respectively). the results had been supplemented to Table supplement 2. As shown in the results, the number of layers of hexahedral elements seems does not significant influence the result, but the thickness of boundary layer mesh can influence the maximum flow velocity of the contraction phase. However, the results of all the simulations were generally consistent, as shown in Author response image 2. The description of the results above were added to section “Mesh sensitivity analysis”.

      Author response image 2.

      Results of mesh refinement study of different boundary layer mesh parameters.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Overview of reviewer's concerns after peer review: 

      As for the initial submission, the reviewers' unanimous opinion is that the authors should perform additional controls to show that their key findings may not be affected by experimental or analysis artefacts, and clarify key aspects of their core methods, chiefly:  

      (1) The fact that their extremely high decoding accuracy is driven by frequency bands that would reflect the key press movements and that these are located bilaterally in frontal brain regions (with the task being unilateral) are seen as key concerns, 

      The above statement that decoding was driven by bilateral frontal brain regions is not entirely consistent with our results. The confusion was likely caused by the way we originally presented our data in Figure 2. We have revised that figure to make it more clear that decoding performance at both the parcel- (Figure 2B) and voxel-space (Figure 2C) level is predominantly driven by contralateral (as opposed to ipsilateral) sensorimotor regions. Figure 2D, which highlights bilateral sensorimotor and premotor regions, displays accuracy of individual regional voxel-space decoders assessed independently. This was the criteria used to determine which regional voxel-spaces were included in the hybridspace decoder. This result is not surprising given that motor and premotor regions are known to display adaptive interhemispheric interactions during motor sequence learning [1, 2], and particularly so when the skill is performed with the non-dominant hand [3-5]. We now discuss this important detail in the revised manuscript:

      Discussion (lines 348-353)

      “The whole-brain parcel-space decoder likely emphasized more stable activity patterns in contralateral frontoparietal regions that differed between individual finger movements [21,35], while the regional voxel-space decoder likely incorporated information related to adaptive interhemispheric interactions operating during motor sequence learning [32,36,37], particularly pertinent when the skill is performed with the non-dominant hand [38-40].”

      We now also include new control analyses that directly address the potential contribution of movement-related artefact to the results.  These changes are reported in the revised manuscript as follows:

      Results (lines 207-211):

      “An alternate decoder trained on ICA components labeled as movement or physiological artefacts (e.g. – head movement, ECG, eye movements and blinks; Figure 3 – figure supplement 3A, D) and removed from the original input feature set during the pre-processing stage approached chance-level performance (Figure 4 – figure supplement 3), indicating that the 4-class hybrid decoder results were not driven by task-related artefacts.”

      Results (lines 261-268):

      “As expected, the 5-class hybrid-space decoder performance approached chance levels when tested with randomly shuffled keypress labels (18.41%± SD 7.4% for Day 1 data; Figure 4 – figure supplement 3C). Task-related eye movements did not explain these results since an alternate 5-class hybrid decoder constructed from three eye movement features (gaze position at the KeyDown event, gaze position 200ms later, and peak eye movement velocity within this window; Figure 4 – figure supplement 3A) performed at chance levels (cross-validated test accuracy = 0.2181; Figure 4 – figure supplement 3B, C). “

      Discussion (Lines 362-368):

      “Task-related movements—which also express in lower frequency ranges—did not explain these results given the near chance-level performance of alternative decoders trained on (a) artefact-related ICA components removed during MEG preprocessing (Figure 3 – figure supplement 3A-C) and on (b) task-related eye movement features (Figure 4 – figure supplement 3B, C). This explanation is also inconsistent with the minimal average head motion of 1.159 mm (± 1.077 SD) across the MEG recording (Figure 3 – figure supplement 3D).“

      (2) Relatedly, the use of a wide time window (~200 ms) for a 250-330 ms typing speed makes it hard to pinpoint the changes underpinning learning, 

      The revised manuscript now includes analyses carried out with decoding time windows ranging from 50 to 250ms in duration. These additional results are now reported in:

      Results (lines 258-261):

      “The improved decoding accuracy is supported by greater differentiation in neural representations of the index finger keypresses performed at positions 1 and 5 of the sequence (Figure 4A), and by the trial-by-trial increase in 2-class decoding accuracy over early learning (Figure 4C) across different decoder window durations (Figure 4 – figure supplement 2).”

      Results (lines 310-312):

      “Offline contextualization strongly correlated with cumulative micro-offline gains (r = 0.903, R<sup>2</sup> = 0.816, p < 0.001; Figure 5 – figure supplement 1A, inset) across decoder window durations ranging from 50 to 250ms (Figure 5 – figure supplement 1B, C).“

      Discussion (lines 382-385):

      “This was further supported by the progressive differentiation of neural representations of the index finger keypress (Figure 4A) and by the robust trial-bytrial increase in 2-class decoding accuracy across time windows ranging between 50 and 250ms (Figure 4C; Figure 4 – figure supplement 2).”

      Discussion (lines 408-9):

      “Offline contextualization consistently correlated with early learning gains across a range of decoding windows (50–250ms; Figure 5 – figure supplement 1).”

      (3) These concerns make it hard to conclude from their data that learning is mediated by "contextualisation" ---a key claim in the manuscript; 

      We believe the revised manuscript now addresses all concerns raised in Editor points 1 and 2.

      (4) The hybrid voxel + parcel space decoder ---a key contribution of the paper--- is not clearly explained; 

      We now provide additional details regarding the hybrid-space decoder approach in the following sections of the revised manuscript:

      Results (lines 158-172):

      “Next, given that the brain simultaneously processes information more efficiently across multiple spatial and temporal scales [28, 32, 33], we asked if the combination of lower resolution whole-brain and higher resolution regional brain activity patterns further improve keypress prediction accuracy. We constructed hybrid-space decoders (N = 1295 ± 20 features; Figure 3A) combining whole-brain parcel-space activity (n = 148 features; Figure 2B) with regional voxel-space activity from a datadriven subset of brain areas (n = 1147 ± 20 features; Figure 2D). This subset covers brain regions showing the highest regional voxel-space decoding performances (top regions across all subjects shown in Figure 2D; Methods – Hybrid Spatial Approach). 

      […]

      Note that while features from contralateral brain regions were more important for whole-brain decoding (in both parcel- and voxel-spaces), regional voxel-space decoders performed best for bilateral sensorimotor areas on average across the group. Thus, a multi-scale hybrid-space representation best characterizes the keypress action manifolds.”

      Results (lines 275-282):

      “We used a Euclidian distance measure to evaluate the differentiation of the neural representation manifold of the same action (i.e. - an index-finger keypress) executed within different local sequence contexts (i.e. - ordinal position 1 vs. ordinal position 5; Figure 5). To make these distance measures comparable across participants, a new set of classifiers was then trained with group-optimal parameters (i.e. – broadband hybrid-space MEG data with subsequent manifold extraction (Figure 3 – figure supplements 2) and LDA classifiers (Figure 3 – figure supplements 7) trained on 200ms duration windows aligned to the KeyDown event (see Methods, Figure 3 – figure supplements 5). “

      Discussion (lines 341-360):

      “The initial phase of the study focused on optimizing the accuracy of decoding individual finger keypresses from MEG brain activity. Recent work showed that the brain simultaneously processes information more efficiently across multiple—rather than a single—spatial scale(s) [28, 32]. To this effect, we developed a novel hybridspace approach designed to integrate neural representation dynamics over two different spatial scales: (1) whole-brain parcel-space (i.e. – spatial activity patterns across all cortical brain regions) and (2) regional voxel-space (i.e. – spatial activity patterns within select brain regions) activity. We found consistent spatial differences between whole-brain parcel-space feature importance (predominantly contralateral frontoparietal, Figure 2B) and regional voxel-space decoder accuracy (bilateral sensorimotor regions, Figure 2D). The whole-brain parcel-space decoder likely emphasized more stable activity patterns in contralateral frontoparietal regions that differed between individual finger movements [21, 35], while the regional voxelspace decoder likely incorporated information related to adaptive interhemispheric interactions operating during motor sequence learning [32, 36, 37], particularly pertinent when the skill is performed with the non-dominant hand [38-40]. The observation of increased cross-validated test accuracy (as shown in Figure 3 – Figure Supplement 6) indicates that the spatially overlapping information in parcel- and voxel-space time-series in the hybrid decoder was complementary, rather than redundant [41].  The hybrid-space decoder which achieved an accuracy exceeding 90%—and robustly generalized to Day 2 across trained and untrained sequences— surpassed the performance of both parcel-space and voxel-space decoders and compared favorably to other neuroimaging-based finger movement decoding strategies [6, 24, 42-44].”

      Methods (lines 636-647):

      “Hybrid Spatial Approach.  First, we evaluated the decoding performance of each individual brain region in accurately labeling finger keypresses from regional voxelspace (i.e. - all voxels within a brain region as defined by the Desikan-Killiany Atlas) activity. Brain regions were then ranked from 1 to 148 based on their decoding accuracy at the group level. In a stepwise manner, we then constructed a “hybridspace” decoder by incrementally concatenating regional voxel-space activity of brain regions—starting with the top-ranked region—with whole-brain parcel-level features and assessed decoding accuracy. Subsequently, we added the regional voxel-space features of the second-ranked brain region and continued this process until decoding accuracy reached saturation. The optimal “hybrid-space” input feature set over the group included the 148 parcel-space features and regional voxelspace features from a total of 8 brain regions (bilateral superior frontal, middle frontal, pre-central and post-central; N = 1295 ± 20 features).”

      (5) More controls are needed to show that their decoder approach is capturing a neural representation dedicated to context rather than independent representations of consecutive keypresses; 

      These controls have been implemented and are now reported in the manuscript:

      Results (lines 318-328):

      “Within-subject correlations were consistent with these group-level findings. The average correlation between offline contextualization and micro-offline gains within individuals was significantly greater than zero (Figure 5 – figure supplement 4, left; t = 3.87, p = 0.00035, df = 25, Cohen's d = 0.76) and stronger than correlations between online contextualization and either micro-online (Figure 5 – figure supplement 4, middle; t = 3.28, p = 0.0015, df = 25, Cohen's d = 1.2) or micro-offline gains (Figure 5 – figure supplement 4, right; t = 3.7021, p = 5.3013e-04, df = 25, Cohen's d = 0.69). These findings were not explained by behavioral changes of typing rhythm (t = -0.03, p = 0.976; Figure 5 – figure supplement 5), adjacent keypress transition times (R2 = 0.00507, F[1,3202] = 16.3; Figure 5 – figure supplement 6), or overall typing speed (between-subject; R2 = 0.028, p \= 0.41; Figure 5 – figure supplement 7).”

      Results (lines 385-390):

      “Further, the 5-class classifier—which directly incorporated information about the sequence location context of each keypress into the decoding pipeline—improved decoding accuracy relative to the 4-class classifier (Figure 4C). Importantly, testing on Day 2 revealed specificity of this representational differentiation for the trained skill but not for the same keypresses performed during various unpracticed control sequences (Figure 5C).”

      Discussion (lines 408-423):

      “Offline contextualization consistently correlated with early learning gains across a range of decoding windows (50–250ms; Figure 5 – figure supplement 1). This result remained unchanged when measuring offline contextualization between the last and second sequence of consecutive trials, inconsistent with a possible confounding effect of pre-planning [30] (Figure 5 – figure supplement 2A). On the other hand, online contextualization did not predict learning (Figure 5 – figure supplement 3). Consistent with these results the average within-subject correlation between offline contextualization and micro-offline gains was significantly stronger than withinsubject correlations between online contextualization and either micro-online or micro-offline gains (Figure 5 – figure supplement 4). 

      Offline contextualization was not driven by trial-by-trial behavioral differences, including typing rhythm (Figure 5 – figure supplement 5) and adjacent keypress transition times (Figure 5 – figure supplement 6) nor by between-subject differences in overall typing speed (Figure 5 – figure supplement 7)—ruling out a reliance on differences in the temporal overlap of keypresses. Importantly, offline contextualization documented on Day 1 stabilized once a performance plateau was reached (trials 11-36), and was retained on Day 2, documenting overnight consolidation of the differentiated neural representations.”

      (6) The need to show more convincingly that their data is not affected by head movements, e.g., by regressing out signal components that are correlated with the fiducial signal;  

      We now include data in Figure 3 – figure supplement 3D showing that head movement was minimal in all participants (mean of 1.159 mm ± 1.077 SD).  Further, the requested additional control analyses have been carried out and are reported in the revised manuscript:

      Results (lines 204-211):

      “Testing the keypress state (4-class) hybrid decoder performance on Day 1 after randomly shupling keypress labels for held-out test data resulted in a performance drop approaching expected chance levels (22.12%± SD 9.1%; Figure 3 – figure supplement 3C). An alternate decoder trained on ICA components labeled as movement or physiological artefacts (e.g. – head movement, ECG, eye movements and blinks; Figure 3 – figure supplement 3A, D) and removed from the original input feature set during the pre-processing stage approached chance-level performance (Figure 4 – figure supplement 3), indicating that the 4-class hybrid decoder results were not driven by task-related artefacts.” Results (lines 261-268):

      “As expected, the 5-class hybrid-space decoder performance approached chance levels when tested with randomly shuffled keypress labels (18.41%± SD 7.4% for Day 1 data; Figure 4 – figure supplement 3C). Task-related eye movements did not explain these results since an alternate 5-class hybrid decoder constructed from three eye movement features (gaze position at the KeyDown event, gaze position 200ms later, and peak eye movement velocity within this window; Figure 4 – figure supplement 3A) performed at chance levels (cross-validated test accuracy = 0.2181; Figure 4 – figure supplement 3B, C). “

      Discussion (Lines 362-368):

      “Task-related movements—which also express in lower frequency ranges—did not explain these results given the near chance-level performance of alternative decoders trained on (a) artefact-related ICA components removed during MEG preprocessing (Figure 3 – figure supplement 3A-C) and on (b) task-related eye movement features (Figure 4 – figure supplement 3B, C). This explanation is also inconsistent with the minimal average head motion of 1.159 mm (± 1.077 SD) across the MEG recording (Figure 3 – figure supplement 3D). “

      (7) The offline neural representation analysis as executed is a bit odd, since it seems to be based on comparing the last key press to the first key press of the next sequence, rather than focus on the inter-sequence interval

      While we previously evaluated replay of skill sequences during rest intervals, identification of how offline reactivation patterns of a single keypress state representation evolve with learning presents non-trivial challenges. First, replay events tend to occur in clusters with irregular temporal spacing as previously shown by our group and others.  Second, replay of experienced sequences is intermixed with replay of sequences that have never been experienced but are possible. Finally, and perhaps the most significant issue, replay is temporally compressed up to 20x with respect to the behavior [6]. That means our decoders would need to accurately evaluate spatial pattern changes related to individual keypresses over much smaller time windows (i.e. - less than 10 ms) than evaluated here. This future work, which is undoubtably of great interest to our research group, will require more substantial tool development before we can apply them to this question. We now articulate this future direction in the Discussion:

      Discussion (lines 423-427):

      “A possible neural mechanism supporting contextualization could be the emergence and stabilization of conjunctive “what–where” representations of procedural memories [64] with the corresponding modulation of neuronal population dynamics [65, 66] during early learning. Exploring the link between contextualization and neural replay could provide additional insights into this issue [6, 12, 13, 15].”

      (8) And this analysis could be confounded by the fact that they are comparing the last element in a sequence vs the first movement in a new one. 

      We have now addressed this control analysis in the revised manuscript:

      Results (Lines 310-316)

      “Offline contextualization strongly correlated with cumulative micro-offline gains (r = 0.903, R<sup>2</sup> = 0.816, p < 0.001; Figure 5 – figure supplement 1A, inset) across decoder window durations ranging from 50 to 250ms (Figure 5 – figure supplement 1B, C). The offline contextualization between the final sequence of each trial and the second sequence of the subsequent trial (excluding the first sequence) yielded comparable results. This indicates that pre-planning at the start of each practice trial did not directly influence the offline contextualization measure [30] (Figure 5 – figure supplement 2A, 1st vs. 2nd Sequence approaches).”

      Discussion (lines 408-416):

      “Offline contextualization consistently correlated with early learning gains across a range of decoding windows (50–250ms; Figure 5 – figure supplement 1). This result remained unchanged when measuring offline contextualization between the last and second sequence of consecutive trials, inconsistent with a possible confounding effect of pre-planning [30] (Figure 5 – figure supplement 2A). On the other hand, online contextualization did not predict learning (Figure 5 – figure supplement 3). Consistent with these results the average within-subject correlation between offline contextualization and micro-offline gains was significantly stronger than within-subject correlations between online contextualization and either micro-online or micro-offline gains (Figure 5 – figure supplement 4).”

      It also seems to be the case that many analyses suggested by the reviewers in the first round of revisions that could have helped strengthen the manuscript have not been included (they are only in the rebuttal). Moreover, some of the control analyses mentioned in the rebuttal seem not to be described anywhere, neither in the manuscript, nor in the rebuttal itself; please double check that. 

      All suggested analyses carried out and mentioned are now in the revised manuscript.

      eLife Assessment 

      This valuable study investigates how the neural representation of individual finger movements changes during the early period of sequence learning. By combining a new method for extracting features from human magnetoencephalography data and decoding analyses, the authors provide incomplete evidence of an early, swift change in the brain regions correlated with sequence learning…

      We have now included all the requested control analyses supporting “an early, swift change in the brain regions correlated with sequence learning”:

      The addition of more control analyses to rule out that head movement artefacts influence the findings, 

      We now include data in Figure 3 – figure supplement 3D showing that head movement was minimal in all participants (mean of 1.159 mm ± 1.077 SD).  Further, we have implemented the requested additional control analyses addressing this issue:

      Results (lines 207-211):

      “An alternate decoder trained on ICA components labeled as movement or physiological artefacts (e.g. – head movement, ECG, eye movements and blinks; Figure 3 – figure supplement 3A, D) and removed from the original input feature set during the pre-processing stage approached chance-level performance (Figure 4 – figure supplement 3), indicating that the 4-class hybrid decoder results were not driven by task-related artefacts.”

      Results (lines 261-268):

      “As expected, the 5-class hybrid-space decoder performance approached chance levels when tested with randomly shuffled keypress labels (18.41%± SD 7.4% for Day 1 data; Figure 4 – figure supplement 3C). Task-related eye movements did not explain these results since an alternate 5-class hybrid decoder constructed from three eye movement features (gaze position at the KeyDown event, gaze position 200ms later, and peak eye movement velocity within this window; Figure 4 – figure supplement 3A) performed at chance levels (cross-validated test accuracy = 0.2181; Figure 4 – figure supplement 3B, C). “

      Discussion (Lines 362-368):

      “Task-related movements—which also express in lower frequency ranges—did not explain these results given the near chance-level performance of alternative decoders trained on (a) artefact-related ICA components removed during MEG preprocessing (Figure 3 – figure supplement 3A-C) and on (b) task-related eye movement features (Figure 4 – figure supplement 3B, C). This explanation is also inconsistent with the minimal average head motion of 1.159 mm (± 1.077 SD) across the MEG recording (Figure 3 – figure supplement 3D).“

      and to further explain the proposal of offline contextualization during short rest periods as the basis for improvement performance would strengthen the manuscript. 

      We have edited the manuscript to clarify that the degree of representational differentiation (contextualization) parallels skill learning.  We have no evidence at this point to indicate that “offline contextualization during short rest periods is the basis for improvement in performance”.  The following areas of the revised manuscript now clarify this point:  

      Summary (Lines 455-458):

      “In summary, individual sequence action representations contextualize during early learning of a new skill and the degree of differentiation parallels skill gains. Differentiation of the neural representations developed during rest intervals of early learning to a larger extent than during practice in parallel with rapid consolidation of skill.”

      Additional control analyses are also provided supporting a link between offline contextualization and early learning:

      Results (lines 302-318):

      “The Euclidian distance between neural representations of Index<sub>OP1</sub> (i.e. - index finger keypress at ordinal position 1 of the sequence) and Index<sub>OP5</sub> (i.e. - index finger keypress at ordinal position 5 of the sequence) increased progressively during early learning (Figure 5A)—predominantly during rest intervals (offline contextualization) rather than during practice (online) (t = 4.84, p < 0.001, df = 25, Cohen's d = 1.2; Figure 5B; Figure 5 – figure supplement 1A). An alternative online contextualization determination equaling the time interval between online and offline comparisons (Trial-based; 10 seconds between Index<sub>OP1</sub> and Index<sub>OP5</sub> observations in both cases) rendered a similar result (Figure 5 – figure supplement 2B).

      Offline contextualization strongly correlated with cumulative micro-offline gains (r = 0.903, R<sup>2</sup> = 0.816, p < 0.001; Figure 5 – figure supplement 1A, inset) across decoder window durations ranging from 50 to 250ms (Figure 5 – figure supplement 1B, C). The offline contextualization between the final sequence of each trial and the second sequence of the subsequent trial (excluding the first sequence) yielded comparable results. This indicates that pre-planning at the start of each practice trial did not directly influence the offline contextualization measure [30] (Figure 5 – figure supplement 2A, 1st vs. 2nd Sequence approaches). Conversely, online contextualization (using either measurement approach) did not explain early online learning gains (i.e. – Figure 5 – figure supplement 3).”  

      Public Reviews: 

      Reviewer #1 (Public review): 

      Summary: 

      This study addresses the issue of rapid skill learning and whether individual sequence elements (here: finger presses) are differentially represented in human MEG data. The authors use a decoding approach to classify individual finger elements and accomplish an accuracy of around 94%. A relevant finding is that the neural representations of individual finger elements dynamically change over the course of learning. This would be highly relevant for any attempts to develop better brain machine interfaces - one now can decode individual elements within a sequence with high precision, but these representations are not static but develop over the course of learning. 

      Strengths: 

      The work follows a large body of work from the same group on the behavioural and neural foundations of sequence learning. The behavioural task is well established a neatly designed to allow for tracking learning and how individual sequence elements contribute. The inclusion of short offline rest periods between learning epochs has been influential because it has revealed that a lot, if not most of the gains in behaviour (ie speed of finger movements) occur in these so-called micro-offline rest periods. 

      The authors use a range of new decoding techniques, and exhaustively interrogate their data in different ways, using different decoding approaches. Regardless of the approach, impressively high decoding accuracies are observed, but when using a hybrid approach that combines the MEG data in different ways, the authors observe decoding accuracies of individual sequence elements from the MEG data of up to 94%. 

      Weaknesses:  

      A formal analysis and quantification of how head movement may have contributed to the results should be included in the paper or supplemental material. The type of correlated head movements coming from vigorous key presses aren't necessarily visible to the naked eye, and even if arms etc are restricted, this will not preclude shoulder, neck or head movement necessarily; if ICA was conducted, for example, the authors are in the position to show the components that relate to such movement; but eye-balling the data would not seem sufficient. The related issue of eye movements is addressed via classifier analysis. A formal analysis which directly accounts for finger/eye movements in the same analysis as the main result (ie any variance related to these factors) should be presented.

      We now present additional data related to head (Figure 3 – figure supplement 3; note that average measured head movement across participants was 1.159 mm ± 1.077 SD) and eye movements (Figure 4 – figure supplement 3) and have implemented the requested control analyses addressing this issue. They are reported in the revised manuscript in the following locations: Results (lines 207-211), Results (lines 261-268), Discussion (Lines 362-368).

      This reviewer recommends inclusion of a formal analysis that the intra-vs inter parcels are indeed completely independent. For example, the authors state that the inter-parcel features reflect "lower spatially resolved whole-brain activity patterns or global brain dynamics". A formal quantitative demonstration that the signals indeed show "complete independence" (as claimed by the authors) and are orthogonal would be helpful.

      Please note that we never claim in the manuscript that the parcel-space and regional voxelspace features show “complete independence”.  More importantly, input feature orthogonality is not a requirement for the machine learning-based decoding methods utilized in the present study while non-redundancy is [7] (a requirement satisfied by our data, see below). Finally, our results show that the hybrid space decoder out-performed all other methods even after input features were fully orthogonalized with LDA (the procedure used in all contextualization analyses) or PCA dimensionality reduction procedures prior to the classification step (Figure 3 – figure supplement 2).

      Relevant to this issue, please note that if spatially overlapping parcel- and voxel-space timeseries only provided redundant information, inclusion of both as input features should increase model over-fitting to the training dataset and decrease overall cross-validated test accuracy [8]. In the present study however, we see the opposite effect on decoder performance. First, Figure 3 – figure supplement 1 & 2 clearly show that decoders constructed from hybrid-space features outperform the other input feature (sensor-, wholebrain parcel- and whole-brain voxel-) spaces in every case (e.g. – wideband, all narrowband frequency ranges, and even after the input space is fully orthogonalized through dimensionality reduction procedures prior to the decoding step). Furthermore, Figure 3 – figure supplement 6 shows that hybrid-space decoder performance supers when parceltime series that spatially overlap with the included regional voxel-spaces are removed from the input feature set. 

      We state in the Discussion (lines 353-356)

      “The observation of increased cross-validated test accuracy (as shown in Figure 3 – Figure Supplement 6) indicates that the spatially overlapping information in parcel- and voxel-space time-series in the hybrid decoder was complementary, rather than redundant [41].”

      To gain insight into the complimentary information contributed by the two spatial scales to the hybrid-space decoder, we first independently computed the matrix rank for whole-brain parcel- and voxel-space input features for each participant (shown in Author response image 1). The results indicate that whole-brain parcel-space input features are full rank (rank = 148) for all participants (i.e. - MEG activity is orthogonal between all parcels). The matrix rank of voxelspace input features (rank = 267± 17 SD), exceeded the parcel-space rank for all participants and approached the number of useable MEG sensor channels (n = 272). Thus, voxel-space features provide both additional and complimentary information to representations at the parcel-space scale.  

      Author response image 1.

      Matrix rank computed for whole-brain parcel- and voxel-space time-series in individual subjects across the training run. The results indicate that whole-brain parcel-space input features are full rank (rank = 148) for all participants (i.e. - MEG activity is orthogonal between all parcels). The matrix rank of voxel-space input features (rank = 267 ± 17 SD), on the other hand, approached the number of useable MEG sensor channels (n = 272). Although not full rank, the voxel-space rank exceeded the parcel-space rank for all participants. Thus, some voxel-space features provide additional orthogonal information to representations at the parcel-space scale.  An expression of this is shown in the correlation distribution between parcel and constituent voxel time-series in Figure 2—figure Supplement 2.

      Figure 2—figure Supplement 2 in the revised manuscript now shows that the degree of dependence between the two spatial scales varies over the regional voxel-space. That is, some voxels within a given parcel correlate strongly with the time-series of the parcel they belong to, while others do not. This finding is consistent with a documented increase in correlational structure of neural activity across spatial scales that does not reflect perfect dependency or orthogonality [9]. Notably, the regional voxel-spaces included in the hybridspace decoder are significantly less correlated with the averaged parcel-space time-series than excluded voxels. We now point readers to this new figure in the results.

      Taken together, these results indicate that the multi-scale information in the hybrid feature set is complimentary rather than orthogonal.  This is consistent with the idea that hybridspace features better represent multi-scale temporospatial dynamics reported to be a fundamental characteristic of how the brain stores and adapts memories, and generates behavior across species [9].  

      Reviewer #2 (Public review): 

      Summary: 

      The current paper consists of two parts. The first part is the rigorous feature optimization of the MEG signal to decode individual finger identity performed in a sequence (4-1-3-2-4; 1~4 corresponds to little~index fingers of the left hand). By optimizing various parameters for the MEG signal, in terms of (i) reconstructed source activity in voxel- and parcel-level resolution and their combination, (ii) frequency bands, and (iii) time window relative to press onset for each finger movement, as well as the choice of decoders, the resultant "hybrid decoder" achieved extremely high decoding accuracy (~95%). This part seems driven almost by pure engineering interest in gaining as high decoding accuracy as possible. 

      In the second part of the paper, armed with the successful 'hybrid decoder,' the authors asked more scientific questions about how neural representation of individual finger movement that is embedded in a sequence, changes during a very early period of skill learning and whether and how such representational change can predict skill learning. They assessed the difference in MEG feature patterns between the first and the last press 4 in sequence 41324 at each training trial and found that the pattern differentiation progressively increased over the course of early learning trials. Additionally, they found that this pattern differentiation specifically occurred during the rest period rather than during the practice trial. With a significant correlation between the trial-by-trial profile of this pattern differentiation and that for accumulation of offline learning, the authors argue that such "contextualization" of finger movement in a sequence (e.g., what-where association) underlies the early improvement of sequential skill. This is an important and timely topic for the field of motor learning and beyond. 

      Strengths: 

      Each part has its own strength. For the first part, the use of temporally rich neural information (MEG signal) has a significant advantage over previous studies testing sequential representations using fMRI. This allowed the authors to examine the earliest period (= the first few minutes of training) of skill learning with finer temporal resolution. Through the optimization of MEG feature extraction, the current study achieved extremely high decoding accuracy (approx. 94%) compared to previous works. For the second part, the finding of the early "contextualization" of the finger movement in a sequence and its correlation to early (offline) skill improvement is interesting and important. The comparison between "online" and "offline" pattern distance is a neat idea. 

      Weaknesses: 

      Despite the strengths raised, the specific goal for each part of the current paper, i.e., achieving high decoding accuracy and answering the scientific question of early skill learning, seems not to harmonize with each other very well. In short, the current approach, which is solely optimized for achieving high decoding accuracy, does not provide enough support and interpretability for the paper's interesting scientific claim. This reminds me of the accuracy-explainability tradeoff in machine learning studies (e.g., Linardatos et al., 2020). More details follow. 

      There are a number of different neural processes occurring before and after a key press, such as planning of upcoming movement and ahead around premotor/parietal cortices, motor command generation in primary motor cortex, sensory feedback related processes in sensory cortices, and performance monitoring/evaluation around the prefrontal area. Some of these may show learning-dependent change and others may not.  

      In this paper, the focus as stated in the Introduction was to evaluate “the millisecond-level differentiation of discrete action representations during learning”, a proposal that first required the development of more accurate computational tools.  Our first step, reported here, was to develop that tool. With that in hand, we then proceeded to test if neural representations differentiated during early skill learning. Our results showed they did.  Addressing the question the Reviewer asks is part of exciting future work, now possible based on the results presented in this paper.  We acknowledge this issue in the revised Discussion:  

      Discussion (Lines 428-434):

      “In this study, classifiers were trained on MEG activity recorded during or immediately after each keypress, emphasizing neural representations related to action execution, memory consolidation and recall over those related to planning. An important direction for future research is determining whether separate decoders can be developed to distinguish the representations or networks separately supporting these processes. Ongoing work in our lab is addressing this question. The present accuracy results across varied decoding window durations and alignment with each keypress action support the feasibility of this approach (Figure 3—figure supplement 5).”

      Given the use of whole-brain MEG features with a wide time window (up to ~200 ms after each key press) under the situation of 3~4 Hz (i.e., 250~330 ms press interval) typing speed, these different processes in different brain regions could have contributed to the expression of the "contextualization," making it difficult to interpret what really contributed to the "contextualization" and whether it is learning related. Critically, the majority of data used for decoder training has the chance of such potential overlap of signal, as the typing speed almost reached a plateau already at the end of the 11th trial and stayed until the 36th trial. Thus, the decoder could have relied on such overlapping features related to the future presses. If that is the case, a gradual increase in "contextualization" (pattern separation) during earlier trials makes sense, simply because the temporal overlap of the MEG feature was insufficient for the earlier trials due to slower typing speed.  Several direct ways to address the above concern, at the cost of decoding accuracy to some degree, would be either using the shorter temporal window for the MEG feature or training the model with the early learning period data only (trials 1 through 11) to see if the main results are unaffected would be some example. 

      We now include additional analyses carried out with decoding time windows ranging from 50 to 250ms in duration, which have been added to the revised manuscript as follows: 

      Results (lines 258-261):

      “The improved decoding accuracy is supported by greater differentiation in neural representations of the index finger keypresses performed at positions 1 and 5 of the sequence (Figure 4A), and by the trial-by-trial increase in 2-class decoding accuracy over early learning (Figure 4C) across different decoder window durations (Figure 4 – figure supplement 2).”

      Results (lines 310-312):

      “Offline contextualization strongly correlated with cumulative micro-offline gains (r = 0.903, R<sup>2</sup> = 0.816, p < 0.001; Figure 5 – figure supplement 1A, inset) across decoder window durations ranging from 50 to 250ms (Figure 5 – figure supplement 1B, C).“

      Discussion (lines 382-385):

      “This was further supported by the progressive differentiation of neural representations of the index finger keypress (Figure 4A) and by the robust trial-by trial increase in 2-class decoding accuracy across time windows ranging between 50 and 250ms (Figure 4C; Figure 4 – figure supplement 2).”

      Discussion (lines 408-9):

      “Offline contextualization consistently correlated with early learning gains across a range of decoding windows (50–250ms; Figure 5 – figure supplement 1).”

      Several new control analyses are also provided addressing the question of overlapping keypresses:

      Reviewer #3 (Public review):

      Summary: 

      One goal of this paper is to introduce a new approach for highly accurate decoding of finger movements from human magnetoencephalography data via dimension reduction of a "multi-scale, hybrid" feature space. Following this decoding approach, the authors aim to show that early skill learning involves "contextualization" of the neural coding of individual movements, relative to their position in a sequence of consecutive movements.

      Furthermore, they aim to show that this "contextualization" develops primarily during short rest periods interspersed with skill training and correlates with a performance metric which the authors interpret as an indicator of offline learning. 

      Strengths: 

      A strength of the paper is the innovative decoding approach, which achieves impressive decoding accuracies via dimension reduction of a "multi-scale, hybrid space". This hybridspace approach follows the neurobiologically plausible idea of concurrent distribution of neural coding across local circuits as well as large-scale networks. A further strength of the study is the large number of tested dimension reduction techniques and classifiers. 

      Weaknesses: 

      A clear weakness of the paper lies in the authors' conclusions regarding "contextualization". Several potential confounds, which partly arise from the experimental design (mainly the use of a single sequence) and which are described below, question the neurobiological implications proposed by the authors and provide a simpler explanation of the results. Furthermore, the paper follows the assumption that short breaks result in offline skill learning, while recent evidence, described below, casts doubt on this assumption.  

      Please, see below for detailed response to each of these points.

      Specifically: The authors interpret the ordinal position information captured by their decoding approach as a reflection of neural coding dedicated to the local context of a movement (Figure 4). One way to dissociate ordinal position information from information about the moving effectors is to train a classifier on one sequence and test the classifier on other sequences that require the same movements, but in different positions (Kornysheva et al., Neuron 2019). In the present study, however, participants trained to repeat a single sequence (4-1-3-2-4).

      A crucial difference between our present study and the elegant study from Kornysheva et al. (2019) in Neuron highlighted by the Reviewer is that while ours is a learning study, the Kornysheva et al. study is not. Kornysheva et al. included an initial separate behavioral training session (i.e. – performed outside of the MEG) during which participants learned associations between fractal image patterns and different keypress sequences. Then in a separate, later MEG session—after the stimulus-response associations had been already learned in the first session—participants were tasked with recalling the learned sequences in response to a presented visual cue (i.e. – the paired fractal pattern). 

      Our rationale for not including multiple sequences in the same Day 1 training session of our study design was that it would lead to prominent interference effects, as widely reported in the literature [10-12].  Thus, while we had to take the issue of interference into consideration for our design, the Kornysheva et al. study did not. While Kornysheva et al. aimed to “dissociate ordinal position information from information about the moving effectors”, we tested various untrained sequences on Day 2 allowing us to determine that the contextualization result was specific to the trained sequence. By using this approach, we avoided interference effects on the learning of the primary skill caused by simultaneous acquisition of a second skill.

      The revised manuscript states our findings related to the Day 2 Control data in the following locations:

      Results (lines 117-122):

      “On the following day, participants were retested on performance of the same sequence (4-1-3-2-4) over 9 trials (Day 2 Retest), as well as on the single-trial performance of 9 different untrained control sequences (Day 2 Controls: 2-1-3-4-2, 4-2-4-3-1, 3-4-2-3-1, 1-4-3-4-2, 3-2-4-3-1, 1-4-2-3-1, 3-2-4-2-1, 3-2-1-4-2, and 4-23-1-4). As expected, an upward shift in performance of the trained sequence (0.68 ± SD 0.56 keypresses/s; t = 7.21, p < 0.001) was observed during Day 2 Retest, indicative of an overnight skill consolidation effect (Figure 1 – figure supplement 1A).”

      Results (lines 212-219):

      “Utilizing the highest performing decoders that included LDA-based manifold extraction, we assessed the robustness of hybrid-space decoding over multiple sessions by applying it to data collected on the following day during the Day 2 Retest (9-trial retest of the trained sequence) and Day 2 Control (single-trial performance of 9 different untrained sequences) blocks. The decoding accuracy for Day 2 MEG data remained high (87.11% ± SD 8.54% for the trained sequence during Retest, and 79.44% ± SD 5.54% for the untrained Control sequences; Figure 3 – figure supplement 4). Thus, index finger classifiers constructed using the hybrid decoding approach robustly generalized from Day 1 to Day 2 across trained and untrained keypress sequences.”

      Results (lines 269-273):

      “On Day 2, incorporating contextual information into the hybrid-space decoder enhanced classification accuracy for the trained sequence only (improving from 87.11% for 4-class to 90.22% for 5-class), while performing at or below-chance levels for the Control sequences (≤ 30.22% ± SD 0.44%). Thus, the accuracy improvements resulting from inclusion of contextual information in the decoding framework was specific for the trained skill sequence.”

      As a result, ordinal position information is potentially confounded by the fixed finger transitions around each of the two critical positions (first and fifth press). Across consecutive correct sequences, the first keypress in a given sequence was always preceded by a movement of the index finger (=last movement of the preceding sequence), and followed by a little finger movement. The last keypress, on the other hand, was always preceded by a ring finger movement, and followed by an index finger movement (=first movement of the next sequence). Figure 4 - supplement 2 shows that finger identity can be decoded with high accuracy (>70%) across a large time window around the time of the keypress, up to at least +/-100 ms (and likely beyond, given that decoding accuracy is still high at the boundaries of the window depicted in that figure). This time window approaches the keypress transition times in this study. Given that distinct finger transitions characterized the first and fifth keypress, the classifier could thus rely on persistent (or "lingering") information from the preceding finger movement, and/or "preparatory" information about the subsequent finger movement, in order to dissociate the first and fifth keypress. 

      Currently, the manuscript provides little evidence that the context information captured by the decoding approach is more than a by-product of temporally extended, and therefore overlapping, but independent neural representations of consecutive keypresses that are executed in close temporal proximity - rather than a neural representation dedicated to context. 

      During the review process, the authors pointed out that a "mixing" of temporally overlapping information from consecutive keypresses, as described above, should result in systematic misclassifications and therefore be detectable in the confusion matrices in Figures 3C and 4B, which indeed do not provide any evidence that consecutive keypresses are systematically confused. However, such absence of evidence (of systematic misclassification) should be interpreted with caution, and, of course, provides no evidence of absence. The authors also pointed out that such "mixing" would hamper the discriminability of the two ordinal positions of the index finger, given that "ordinal position 5" is systematically followed by "ordinal position 1". This is a valid point which, however, cannot rule out that "contextualization" nevertheless reflects the described "mixing".

      The revised manuscript contains several control analyses which rule out this potential confound.

      Results (lines 318-328):

      “Within-subject correlations were consistent with these group-level findings. The average correlation between offline contextualization and micro-offline gains within individuals was significantly greater than zero (Figure 5 – figure supplement 4, left; t = 3.87, p = 0.00035, df = 25, Cohen's d = 0.76) and stronger than correlations between online contextualization and either micro-online (Figure 5 – figure supplement 4, middle; t = 3.28, p = 0.0015, df = 25, Cohen's d = 1.2) or micro-offline gains (Figure 5 – figure supplement 4, right; t = 3.7021, p = 5.3013e-04, df = 25, Cohen's d = 0.69). These findings were not explained by behavioral changes of typing rhythm (t = -0.03, p = 0.976; Figure 5 – figure supplement 5), adjacent keypress transition times (R<sup>2</sup> = 0.00507, F[1,3202] = 16.3; Figure 5 – figure supplement 6), or overall typing speed (between-subject; R<sup>2</sup> = 0.028, p \= 0.41; Figure 5 – figure supplement 7).”

      Results (lines 385-390):

      “Further, the 5-class classifier—which directly incorporated information about the sequence location context of each keypress into the decoding pipeline—improved decoding accuracy relative to the 4-class classifier (Figure 4C). Importantly, testing on Day 2 revealed specificity of this representational differentiation for the trained skill but not for the same keypresses performed during various unpracticed control sequences (Figure 5C).”

      Discussion (lines 408-423):

      “Offline contextualization consistently correlated with early learning gains across a range of decoding windows (50–250ms; Figure 5 – figure supplement 1). This result remained unchanged when measuring offline contextualization between the last and second sequence of consecutive trials, inconsistent with a possible confounding effect of pre-planning [30] (Figure 5 – figure supplement 2A). On the other hand, online contextualization did not predict learning (Figure 5 – figure supplement 3). Consistent with these results the average within-subject correlation between offline contextualization and micro-offline gains was significantly stronger than within subject correlations between online contextualization and either micro-online or micro-offline gains (Figure 5 – figure supplement 4). 

      Offline contextualization was not driven by trial-by-trial behavioral differences, including typing rhythm (Figure 5 – figure supplement 5) and adjacent keypress transition times (Figure 5 – figure supplement 6) nor by between-subject differences in overall typing speed (Figure 5 – figure supplement 7)—ruling out a reliance on differences in the temporal overlap of keypresses. Importantly, offline contextualization documented on Day 1 stabilized once a performance plateau was reached (trials 11-36), and was retained on Day 2, documenting overnight consolidation of the differentiated neural representations.”

      During the review process, the authors responded to my concern that training of a single sequence introduces the potential confound of "mixing" described above, which could have been avoided by training on several sequences, as in Kornysheva et al. (Neuron 2019), by arguing that Day 2 in their study did include control sequences. However, the authors' findings regarding these control sequences are fundamentally different from the findings in Kornysheva et al. (2019), and do not provide any indication of effector-independent ordinal information in the described contextualization - but, actually, the contrary. In Kornysheva et al. (Neuron 2019), ordinal, or positional, information refers purely to the rank of a movement in a sequence. In line with the idea of competitive queuing, Kornysheva et al. (2019) have shown that humans prepare for a motor sequence via a simultaneous representation of several of the upcoming movements, weighted by their rank in the sequence. Importantly, they could show that this gradient carries information that is largely devoid of information about the order of specific effectors involved in a sequence, or their timing, in line with competitive queuing. They showed this by training a classifier to discriminate between the five consecutive movements that constituted one specific sequence of finger movements (five classes: 1st, 2nd, 3rd, 4th, 5th movement in the sequence) and then testing whether that classifier could identify the rank (1st, 2nd, 3rd, etc) of movements in another sequence, in which the fingers moved in a different order, and with different timings. Importantly, this approach demonstrated that the graded representations observed during preparation were largely maintained after this cross decoding, indicating that the sequence was represented via ordinal position information that was largely devoid of information about the specific effectors or timings involved in sequence execution. This result differs completely from the findings in the current manuscript. Dash et al. report a drop in detected ordinal position information (degree of contextualization in figure 5C) when testing for contextualization in their novel, untrained sequences on Day 2, indicating that context and ordinal information as defined in Dash et al. is not at all devoid of information about the specific effectors involved in a sequence. In this regard, a main concern in my public review, as well as the second reviewer's public review, is that Dash et al. cannot tell apart, by design, whether there is truly contextualization in the neural representation of a sequence (which they claim), or whether their results regarding "contextualization" are explained by what they call "mixing" in their author response, i.e., an overlap of representations of consecutive movements, as suggested as an alternative explanation by Reviewer 2 and myself.

      Again, as stated in response to a related comment by the Reviewer above, it is not surprising that our results differ from the study by Kornysheva et al. (2019) . A crucial difference between the studies that the Reviewer fails to recognize is that while ours is a learning study, the Kornysheva et al. study is not. Our rationale for not including multiple sequences in the same Day 1 training session of our study design was that it would lead to prominent interference effects, as widely reported in the literature [10-12].  Thus, while we had to take the issue of interference into consideration for our design, the Kornysheva et al. study did not, since it was not concerned with learning dynamics. The strengths of the elegant Kornysheva study highlighted by the Reviewer—that the pre-planned sequence queuing gradient of sequence actions was independent of the effectors or timings used—is precisely due to the fact that participants were selecting between sequence options that had been previously—and equivalently—learned. The decoders in the Kornynsheva study were trained to classify effector- and timing-independent sequence position information— by design—so it is not surprising that this is the information they reflect.

      The questions asked in our study were different: 1) Do the neural representations of the same sequence action executed in different skill (ordinal sequence) locations differentiate (contextualize) during early learning?  and 2) Is the observed contextualization specific to the learned sequence? Thus, while Kornysheva et al. aimed to “dissociate ordinal position information from information about the moving effectors”, we tested various untrained sequences on Day 2 allowing us to determine that the contextualization result was specific to the trained sequence. By using this approach, we avoided interference effects on the learning of the primary skill caused by simultaneous acquisition of a second skill.

      Such temporal overlap of consecutive, independent finger representations may also account for the dynamics of "ordinal coding"/"contextualization", i.e., the increase in 2class decoding accuracy, across Day 1 (Figure 4C). As learning progresses, both tapping speed and the consistency of keypress transition times increase (Figure 1), i.e., consecutive keypresses are closer in time, and more consistently so. As a result, information related to a given keypress is increasingly overlapping in time with information related to the preceding and subsequent keypresses. The authors seem to argue that their regression analysis in Figure 5 - figure supplement 3 speaks against any influence of tapping speed on "ordinal coding" (even though that argument is not made explicitly in the manuscript). However, Figure 5 - figure supplement 3 shows inter-individual differences in a between-subject analysis (across trials, as in panel A, or separately for each trial, as in panel B), and, therefore, says little about the within-subject dynamics of "ordinal coding" across the experiment. A regression of trial-by-trial "ordinal coding" on trial-by-trial tapping speed (either within-subject, or at a group-level, after averaging across subjects) could address this issue. Given the highly similar dynamics of "ordinal coding" on the one hand (Figure 4C), and tapping speed on the other hand (Figure 1B), I would expect a strong relationship between the two in the suggested within-subject (or group-level) regression. 

      The aim of the between-subject regression analysis presented in the Results (see below) and in Figure 5—figure supplement 7 (previously Figure 5—figure supplement 3) of the revised manuscript, was to rule out a general effect of tapping speed on the magnitude of contextualization observed. If temporal overlap of neural representations was driving their differentiation, then participants typing at higher speeds should also show greater contextualization scores. We made the decision to use a between-subject analysis to address this issue since within-subject skill speed variance was rather small over most of the training session. 

      The Reviewer’s request that we additionally carry-out a “regression of trial-by-trial "ordinal coding" on trial-by-trial tapping speed (either within-subject, or at a group-level, after averaging across subjects)” is essentially the same request of Reviewer 2 above. That request was to perform a modified simple linear regression analysis where the predictor is the sum the 4-4 and 4-1 transition times, since these transitions are where any temporal overlaps of neural representations would occur.  A new Figure 5 – figure supplement 6 in the revised manuscript includes a scatter plot showing the sum of adjacent index finger keypress transition times (i.e. – the 4-4 transition at the conclusion of one sequence iteration and the 4-1 transition at the beginning of the next sequence iteration) versus online contextualization distances measured during practice trials. Both the keypress transition times and online contextualization scores were z-score normalized within individual subjects, and then concatenated into a single data superset. As is clear in the figure data, results of the regression analysis showed a very weak linear relationship between the two (R<sup>2</sup> = 0.00507, F[1,3202] = 16.3). Thus, contextualization score magnitudes do not reflect the amount of overlap between adjacent keypresses when assessed either within- or between-subject.

      The revised manuscript now states:

      Results (lines 318-328):

      “Within-subject correlations were consistent with these group-level findings. The average correlation between offline contextualization and micro-offline gains within individuals was significantly greater than zero (Figure 5 – figure supplement 4, left; t = 3.87, p = 0.00035, df = 25, Cohen's d = 0.76) and stronger than correlations between online contextualization and either micro-online (Figure 5 – figure supplement 4, middle; t = 3.28, p = 0.0015, df = 25, Cohen's d = 1.2) or micro-offline gains (Figure 5 – figure supplement 4, right; t = 3.7021, p = 5.3013e-04, df = 25, Cohen's d = 0.69). These findings were not explained by behavioral changes of typing rhythm (t = -0.03, p = 0.976; Figure 5 – figure supplement 5), adjacent keypress transition times (R<sup>2</sup> = 0.00507, F[1,3202] = 16.3; Figure 5 – figure supplement 6), or overall typing speed (between-subject; R<sup>2</sup> = 0.028, p \= 0.41; Figure 5 – figure supplement 7).”

      Furthermore, learning should increase the number of (consecutively) correct sequences, and, thus, the consistency of finger transitions. Therefore, the increase in 2-class decoding accuracy may simply reflect an increasing overlap in time of increasingly consistent information from consecutive keypresses, which allows the classifier to dissociate the first and fifth keypress more reliably as learning progresses, simply based on the characteristic finger transitions associated with each. In other words, given that the physical context of a given keypress changes as learning progresses - keypresses move closer together in time and are more consistently correct - it seems problematic to conclude that the mental representation of that context changes. To draw that conclusion, the physical context should remain stable (or any changes to the physical context should be controlled for). 

      The revised manuscript now addresses specifically the question of mixing of temporally overlapping information:

      Results (Lines 310-328)

      “Offline contextualization strongly correlated with cumulative micro-offline gains (r = 0.903, R<sup>2</sup> = 0.816, p < 0.001; Figure 5 – figure supplement 1A, inset) across decoder window durations ranging from 50 to 250ms (Figure 5 – figure supplement 1B, C). The offline contextualization between the final sequence of each trial and the second sequence of the subsequent trial (excluding the first sequence) yielded comparable results. This indicates that pre-planning at the start of each practice trial did not directly influence the offline contextualization measure [30] (Figure 5 – figure supplement 2A, 1st vs. 2nd Sequence approaches). Conversely, online contextualization (using either measurement approach) did not explain early online learning gains (i.e. – Figure 5 – figure supplement 3). Within-subject correlations were consistent with these group-level findings. The average correlation between offline contextualization and micro-offline gains within individuals was significantly greater than zero (Figure 5 – figure supplement 4, left; t = 3.87, p = 0.00035, df = 25, Cohen's d = 0.76) and stronger than correlations between online contextualization and either micro-online (Figure 5 – figure supplement 4, middle; t = 3.28, p = 0.0015, df = 25, Cohen's d = 1.2) or micro-offline gains (Figure 5 – figure supplement 4, right; t = 3.7021, p = 5.3013e-04, df = 25, Cohen's d = 0.69). These findings were not explained by behavioral changes of typing rhythm (t = -0.03, p = 0.976; Figure 5 – figure supplement 5), adjacent keypress transition times (R<sup>2</sup> = 0.00507, F[1,3202] = 16.3; Figure 5 – figure supplement 6), or overall typing speed (between-subject; R<sup>2</sup> = 0.028, p \= 0.41; Figure 5 – figure supplement 7). “

      Discussion (Lines 417-423)

      “Offline contextualization was not driven by trial-by-trial behavioral differences, including typing rhythm (Figure 5 – figure supplement 5) and adjacent keypress transition times (Figure 5 – figure supplement 6) nor by between-subject differences in overall typing speed (Figure 5 – figure supplement 7)—ruling out a reliance on differences in the temporal overlap of keypresses. Importantly, offline contextualization documented on Day 1 stabilized once a performance plateau was reached (trials 11-36), and was retained on Day 2, documenting overnight consolidation of the differentiated neural representations.”

      A similar difference in physical context may explain why neural representation distances ("differentiation") differ between rest and practice (Figure 5). The authors define "offline differentiation" by comparing the hybrid space features of the last index finger movement of a trial (ordinal position 5) and the first index finger movement of the next trial (ordinal position 1). However, the latter is not only the first movement in the sequence but also the very first movement in that trial (at least in trials that started with a correct sequence), i.e., not preceded by any recent movement. In contrast, the last index finger of the last correct sequence in the preceding trial includes the characteristic finger transition from the fourth to the fifth movement. Thus, there is more overlapping information arising from the consistent, neighbouring keypresses for the last index finger movement, compared to the first index finger movement of the next trial. A strong difference (larger neural representation distance) between these two movements is, therefore, not surprising, given the task design, and this difference is also expected to increase with learning, given the increase in tapping speed, and the consequent stronger overlap in representations for consecutive keypresses. Furthermore, initiating a new sequence involves pre-planning, while ongoing practice relies on online planning (Ariani et al., eNeuro 2021), i.e., two mental operations that are dissociable at the level of neural representation (Ariani et al., bioRxiv 2023).  

      The revised manuscript now addresses specifically the question of pre-planning:

      Results (lines 310-318):

      “Offline contextualization strongly correlated with cumulative micro-offline gains (r = 0.903, R<sup>2</sup> = 0.816, p < 0.001; Figure 5 – figure supplement 1A, inset) across decoder window durations ranging from 50 to 250ms (Figure 5 – figure supplement 1B, C). The offline contextualization between the final sequence of each trial and the second sequence of the subsequent trial (excluding the first sequence) yielded comparable results. This indicates that pre-planning at the start of each practice trial did not directly influence the offline contextualization measure [30] (Figure 5 – figure supplement 2A, 1st vs. 2nd Sequence approaches). Conversely, online contextualization (using either measurement approach) did not explain early online learning gains (i.e. – Figure 5 – figure supplement 3).”

      Discussion (lines 408-416):

      “Offline contextualization consistently correlated with early learning gains across a range of decoding windows (50–250ms; Figure 5 – figure supplement 1). This result remained unchanged when measuring offline contextualization between the last and second sequence of consecutive trials, inconsistent with a possible confounding effect of pre-planning [30] (Figure 5 – figure supplement 2A). On the other hand, online contextualization did not predict learning (Figure 5 – figure supplement 3). Consistent with these results the average within-subject correlation between offline contextualization and micro-offline gains was significantly stronger than within-subject correlations between online contextualization and either micro-online or micro-offline gains (Figure 5 – figure supplement 4).”

      A further complication in interpreting the results stems from the visual feedback that participants received during the task. Each keypress generated an asterisk shown above the string on the screen. It is not clear why the authors introduced this complicating visual feedback in their task, besides consistency with their previous studies. The resulting systematic link between the pattern of visual stimulation (the number of asterisks on the screen) and the ordinal position of a keypress makes the interpretation of "contextual information" that differentiates between ordinal positions difficult. During the review process, the authors reported a confusion matrix from a classification of asterisks position based on eye tracking data recorded during the task and concluded that the classifier performed at chance level and gaze was, thus, apparently not biased by the visual stimulation. However, the confusion matrix showed a huge bias that was difficult to interpret (a very strong tendency to predict one of the five asterisk positions, despite chance-level performance). Without including additional information for this analysis (or simply the gaze position as a function of the number of astersisk on the screen) in the manuscript, this important control analysis cannot be properly assessed, and is not available to the public.  

      We now include the gaze position data requested by the Reviewer alongside the confusion matrix results in Figure 4 – figure supplement 3.

      Results (lines 207-211):

      “An alternate decoder trained on ICA components labeled as movement or physiological artefacts (e.g. – head movement, ECG, eye movements and blinks; Figure 3 – figure supplement 3A, D) and removed from the original input feature set during the pre-processing stage approached chance-level performance (Figure 4 – figure supplement 3), indicating that the 4-class hybrid decoder results were not driven by task-related artefacts.” Results (lines 261-268):

      “As expected, the 5-class hybrid-space decoder performance approached chance levels when tested with randomly shuffled keypress labels (18.41%± SD 7.4% for Day 1 data; Figure 4 – figure supplement 3C). Task-related eye movements did not explain these results since an alternate 5-class hybrid decoder constructed from three eye movement features (gaze position at the KeyDown event, gaze position 200ms later, and peak eye movement velocity within this window; Figure 4 – figure supplement 3A) performed at chance levels (cross-validated test accuracy = 0.2181; Figure 4 – figure supplement 3B, C). “

      Discussion (Lines 362-368):

      “Task-related movements—which also express in lower frequency ranges—did not explain these results given the near chance-level performance of alternative decoders trained on (a) artefact-related ICA components removed during MEG preprocessing (Figure 3 – figure supplement 3A-C) and on (b) task-related eye movement features (Figure 4 – figure supplement 3B, C). This explanation is also inconsistent with the minimal average head motion of 1.159 mm (± 1.077 SD) across the MEG recording (Figure 3 – figure supplement 3D).”

      The rationale for the task design including the asterisks is presented below:

      Methods (Lines 500-514)

      “The five-item sequence was displayed on the computer screen for the duration of each practice round and participants were directed to fix their gaze on the sequence. Small asterisks were displayed above a sequence item after each successive keypress, signaling the participants' present position within the sequence. Inclusion of this feedback minimizes working memory loads during task performance [73]. Following the completion of a full sequence iteration, the asterisk returned to the first sequence item. The asterisk did not provide error feedback as it appeared for both correct and incorrect keypresses. At the end of each practice round, the displayed number sequence was replaced by a string of five "X" symbols displayed on the computer screen, which remained for the duration of the rest break. Participants were instructed to focus their gaze on the screen during this time. The behavior in this explicit, motor learning task consists of generative action sequences rather than sequences of stimulus-induced responses as in the serial reaction time task (SRTT). A similar real-world example would be manually inputting a long password into a secure online application in which one intrinsically generates the sequence from memory and receives similar feedback about the password sequence position (also provided as asterisks), which is typically ignored by the user.”

      The authors report a significant correlation between "offline differentiation" and cumulative micro-offline gains. However, this does not address the question whether there is a trial-by-trial relation between the degree of "contextualization" and the amount of micro-offline gains - i.e., the question whether performance changes (micro-offline gains) are less pronounced across rest periods for which the change in "contextualization" is relatively low. The single-subject correlation between contextualization changes "during" rest and micro-offline gains (Figure 5 - figure supplement 4) addresses this question, however, the critical statistical test (are correlation coefficients significantly different from zero) is not included. Given the displayed distribution, it seems unlikely that correlation coefficients are significantly above zero. 

      As recommend by the Reviewer, we now include one-way right-tailed t-test results which provide further support to the previously reported finding. The mean of within-subject correlations between offline contextualization and cumulative micro-offline gains was significantly greater than zero (t = 3.87, p = 0.00035, df = 25, Cohen's d = 0.76; see Figure 5 – figure supplement 4, left), while correlations for online contextualization versus cumulative micro-online (t = -1.14, p = 0.8669, df = 25, Cohen's d = -0.22) or micro-offline gains t = -0.097, p = 0.5384, df = 25, Cohen's d = -0.019) were not. We have incorporated the significant one-way t-test for offline contextualization and cumulative micro-offline gains in the Results section of the revised manuscript (lines 313-318) and the Figure 5 – figure supplement 4 legend.

      The authors follow the assumption that micro-offline gains reflect offline learning.

      However, there is no compelling evidence in the literature, and no evidence in the present manuscript, that micro-offline gains (during any training phase) reflect offline learning. Instead, emerging evidence in the literature indicates that they do not (Das et al., bioRxiv 2024), and instead reflect transient performance benefits when participants train with breaks, compared to participants who train without breaks, however, these benefits vanish within seconds after training if both groups of participants perform under comparable conditions (Das et al., bioRxiv 2024). During the review process, the authors argued that differences in the design between Das et al. (2024) on the one hand (Experiments 1 and 2), and the study by Bönstrup et al. (2019) on the other hand, may have prevented Das et al. (2024) from finding the assumed (lasting) learning benefit by micro-offline consolidation. However, the Supplementary Material of Das et al. (2024) includes an experiment (Experiment S1) whose design closely follows the early learning phase of Bönstrup et al. (2019), and which, nevertheless, demonstrates that there is no lasting benefit of taking breaks for the acquired skill level, despite the presence of micro-offline gains. 

      We thank the Reviewer for alerting us to this new data added to the revised supplementary materials of Das et al. (2024) posted to bioRxiv. However, despite the Reviewer’s claim to the contrary, a careful comparison between the Das et al and Bönstrup et al studies reveal more substantive differences than similarities and does not “closely follows a large proportion of the early learning phase of Bönstrup et al. (2019)” as stated. 

      In the Das et al. Experiment S1, sixty-two participants were randomly assigned to “with breaks” or “no breaks” skill training groups. The “with breaks” group alternated 10 seconds of skill sequence practice with 10 seconds of rest over seven trials (2 min and 2 sec total training duration). This amounts to 66.7% of the early learning period defined by Bönstrup et al. (2019) (i.e. - eleven 10-second-long practice periods interleaved with ten 10-second-long rest breaks; 3 min 30 sec total training duration).  

      Also, please note that while no performance feedback nor reward was given in the Bönstrup et al. (2019) study, participants in the Das et al. study received explicit performance-based monetary rewards, a potentially crucial driver of differentiated behavior between the two studies:

      “Participants were incentivized with bonus money based on the total number of correct sequences completed throughout the experiment.”

      The “no breaks” group in the Das et al. study practiced the skill sequence for 70 continuous seconds. Both groups (despite one being labeled “no breaks”) follow training with a long 3-minute break (also note that since the “with breaks” group ends with 10 seconds of rest their break is actually longer), before finishing with a skill “test” over a continuous 50-second-long block. During the 70 seconds of training, the “with breaks” group shows more learning than the “no breaks” group. Interestingly, following the long 3minute break the “with breaks” group display a performance drop (relative to their performance at the end of training) that is stable over the full 50-second test, while the “no breaks” group shows an immediate performance improvement following the long break that continues to increase over the 50-second test.  

      Separately, there are important issues regarding the Das et al. study that should be considered through the lens of recent findings not referred to in the preprint. A major element of their experimental design is that both groups—“with breaks” and “no breaks”— actually receive quite a long 3-minute break just before the skill test. This long break is more than 2.5x the cumulative interleaved rest experienced by the “with breaks” group. Thus, although the design is intended to contrast the presence or absence of rest “breaks”, that difference between groups is no longer maintained at the point of the skill test. 

      The Das et al. results are most consistent with an alternative interpretation of the data— that the “no breaks” group experiences offline learning during their long 3-minute break. This is supported by the recent work of Griffin et al. (2025) where micro-array recordings from primary and premotor cortex were obtained from macaque monkeys while they performed blocks of ten continuous reaching sequences up to 81.4 seconds in duration (see source data for Extended Data Figure 1h) with 90 seconds of interleaved rest. Griffin et al. observed offline improvement in skill immediately following the rest break that was causally related to neural reactivations (i.e. – neural replay) that occurred during the rest break. Importantly, the highest density of reactivations was present in the very first 90second break between Blocks 1 and 2 (see Fig. 2f in Griffin et al., 2025). This supports the interpretation that both the “with breaks” and “no breaks” group express offline learning gains, with these gains being delayed in the “no breaks” group due to the practice schedule.

      On the other hand, if offline learning can occur during this longer break, then why would the “with breaks” group show no benefit? Again, it could be that most of the offline gains for this group were front-loaded during the seven shorter 10-second rest breaks. Another possible, though not mutually exclusive, explanation is that the observed drop in performance in the “with breaks” group is driven by contextual interference. Specifically, similar to Experiments 1 and 2 in Das et al. (2024), the skill test is conducted under very different conditions than those which the “with breaks” group practiced the skill under (short bursts of practiced alternating with equally short breaks). On the other hand, the “no breaks” group is tested (50 seconds of continuous practice) under quite similar conditions to their training schedule (70 seconds of continuous practice). Thus, it is possible that this dissimilarity between training and test could lead to reduced performance in the “with breaks” group.

      We made the following manuscript revisions related to these important issues: 

      Introduction (Lines 26-56)

      “Practicing a new motor skill elicits rapid performance improvements (early learning) [1] that precede skill performance plateaus [5]. Skill gains during early learning accumulate over rest periods (micro-offline) interspersed with practice [1, 6-10], and are up to four times larger than offline performance improvements reported following overnight sleep [1]. During this initial interval of prominent learning, retroactive interference immediately following each practice interval reduces learning rates relative to interference after passage of time, consistent with stabilization of the motor memory [11]. Micro-offline gains observed during early learning are reproducible [7, 10-13] and are similar in magnitude even when practice periods are reduced by half to 5 seconds in length, thereby confirming that they are not merely a result of recovery from performance fatigue [11]. Additionally, they are unaffected by the random termination of practice periods, which eliminates the possibility of predictive motor slowing as a contributing factor [11]. Collectively, these behavioral findings point towards the interpretation that micro offline gains during early learning represent a form of memory consolidation [1]. 

      This interpretation has been further supported by brain imaging and electrophysiological studies linking known memory-related networks and consolidation mechanisms to rapid offline performance improvements. In humans, the rate of hippocampo-neocortical neural replay predicts micro-offline gains [6]. Consistent with these findings, Chen et al. [12] and Sjøgård et al. [13] furnished direct evidence from intracranial human EEG studies, demonstrating a connection between the density of hippocampal sharp-wave ripples (80-120 Hz)—recognized markers of neural replay—and micro-offline gains during early learning. Further, Griffin et al. reported that neural replay of task-related ensembles in the motor cortex of macaques during brief rest periods— akin to those observed in humans [1, 6-8, 14]—are not merely correlated with, but are causal drivers of micro-offline learning [15]. Specifically, the same reach directions that were replayed the most during rest breaks showed the greatest reduction in path length (i.e. – more efficient movement path between two locations in the reach sequence) during subsequent trials, while stimulation applied during rest intervals preceding performance plateau reduced reactivation rates and virtually abolished micro-offline gains [15]. Thus, converging evidence in humans and non-human primates across indirect non-invasive and direct invasive recording techniques link hippocampal activity, neural replay dynamics and offline skill gains in early motor learning that precede performance plateau.”

      Next, in the Methods, we articulate important constrains formulated by Pan and Rickard and Bonstrup et al for meaningful measurements:

      Methods (Lines 493-499)

      “The study design followed specific recommendations by Pan and Rickard (2015): 1) utilizing 10-second practice trials and 2) constraining analysis of micro-offline gains to early learning trials (where performance monotonically increases and 95% of overall performance gains occur) that precede the emergence of “scalloped” performance dynamics strongly linked to reactive inhibition effects ( [29, 72]). This is precisely the portion of the learning curve Pan and Rickard referred to when they stated “…rapid learning during that period masks any reactive inhibition effect” [29].”

      We finally discuss the implications of neglecting some or all of these recommendations:

      Discussion (Lines 444-452):

      “Finally, caution should be exercised when extrapolating findings during early skill learning, a period of steep performance improvements, to findings reported after insufficient practice [67], post-plateau performance periods [68], or non-learning situations (e.g. performance of non-repeating keypress sequences in  [67]) when reactive inhibition or contextual interference effects are prominent. Ultimately, it will be important to develop new paradigms allowing one to independently estimate the different coincident or antagonistic features (e.g. - memory consolidation, planning, working memory and reactive inhibition) contributing to micro-online and micro-offline gains during and after early skill learning within a unifying framework.”

      Along these lines, the authors' claim, based on Bönstrup et al. 2020, that "retroactive interference immediately following practice periods reduces micro-offline learning", is not supported by that very reference. Citing Bönstrup et al. (2020), "Regarding early learning dynamics (trials 1-5), we found no differences in microscale learning parameters (micro online/offline) or total early learning between both interference groups." That is, contrary to Dash et al.'s current claim, Bönstrup et al. (2020) did not find any retroactive interference effect on the specific behavioral readout (micro-offline gains) that the authors assume to reflect consolidation. 

      Please, note that the Bönstrup et al. 2020 paper abstract states: 

      “Third, retroactive interference immediately after each practice period reduced the learning rate relative to interference after passage of time (N = 373), indicating stabilization of the motor memory at a microscale of several seconds.”

      which is further supported by this statement in the Results: 

      “The model comprised three parameters representing the initial performance, maximum performance and learning rate (see Eq. 1, “Methods”, “Data Analysis” section). We then statistically compared the model parameters between the interference groups (Fig. 2d). The late interference group showed a higher learning rate compared with the early interference group (late: 0.26 ± 0.23, early: 2.15 ± 0.20, P=0.04). The effect size of the group difference was small to medium (Cohen’s d 0.15)[29]. Similar differences with a stronger rise in the learning curve of a late interference groups vs. an early interference group were found in a smaller sample collected in the lab environment (Supplementary Fig. 3).”

      We have modified the statement in the revised manuscript to specify that the difference observed was between learning rates: Introduction (Lines 30-32)

      “During this initial interval of prominent learning, retroactive interference immediately following each practice interval reduces learning rates relative to interference after passage of time, consistent with stabilization of the motor memory [11].”

      The authors conclude that performance improves, and representation manifolds differentiate, "during" rest periods (see, e.g., abstract). However, micro-offline gains (as well as offline contextualization) are computed from data obtained during practice, not rest, and may, thus, just as well reflect a change that occurs "online", e.g., at the very onset of practice (like pre-planning) or throughout practice (like fatigue, or reactive inhibition).  

      The Reviewer raises again the issue of a potential confound of “pre-planning” on our contextualization measures as in the comment above: 

      “Furthermore, initiating a new sequence involves pre-planning, while ongoing practice relies on online planning (Ariani et al., eNeuro 2021), i.e., two mental operations that are dissociable at the level of neural representation (Ariani et al., bioRxiv 2023).”

      The cited studies by Ariani et al. indicate that effects of pre-planning are likely to impact the first 3 keypresses of the initial sequence iteration in each trial. As stated in the response to this comment above, we conducted a control analysis of contextualization that ignores the first sequence iteration in each trial to partial out any potential preplanning effect. This control analyses yielded comparable results, indicating that preplanning is not a major driver of our reported contextualization effects. We now report this in the revised manuscript:

      We also state in the Figure 1 legend (Lines 99-103) in the revised manuscript that preplanning has no effect on the behavioral measures of micro-offline and micro-online gains in our dataset:

      The Reviewer also raises the issue of possible effects stemming from “fatigue” and “reactive inhibition” which inhibit performance and are indeed relevant to skill learning studies. We designed our task to specifically mitigate these effects. We now more clearly articulate this rationale in the description of the task design as well as the measurement constraints essential for minimizing their impact.

      We also discuss the implications of fatigue and reactive inhibition effects in experimental designs that neglect to follow these recommendations formulated by Pan and Rickard in the Discussion section and propose how this issue can be better addressed in future investigations.

      To summarize, the results of our study indicate that: (a) offline contextualization effects are not explained by pre-planning of the first action sequence iteration in each practice trial; and (b) the task design implemented in this study purposefully minimize any possible effects of reactive inhibition or fatigue.  Circling back to the Reviewer’s proposal that “contextualization…may just as well reflect a change that occurs "online"”, we show in this paper direct empirical evidence that contextualization develops to a greater extent across rest periods rather than across practice trials, contrary to the Reviewer’s proposal.  

      That is, the definition of micro-offline gains (as well as offline contextualization) conflates online and "offline" processes. This becomes strikingly clear in the recent Nature paper by Griffin et al. (2025), who computed micro-offline gains as the difference in average performance across the first five sequences in a practice period (a block, in their terminology) and the last five sequences in the previous practice period. Averaging across sequences in this way minimises the chance to detect online performance changes and inflates changes in performance "offline". The problem that "online" gains (or contextualization) is actually computed from data entirely generated online, and therefore subject to processes that occur online, is inherent in the very definition of micro-online gains, whether, or not, they computed from averaged performance.

      We would like to make it clear that the issue raised by the Reviewer with respect to averaging across sequences done in the Griffin et al. (2025) study does not impact our study in any way. The primary skill measure used in all analyses reported in our paper is not temporally averaged. We estimated instantaneous correct sequence speed over the entire trial. Once the first sequence iteration within a trial is completed, the speed estimate is then updated at the resolution of individual keypresses. All micro-online and -offline behavioral changes are measured as the difference in instantaneous speed at the beginning and end of individual practice trials.

      Methods (lines 528-530):

      “The instantaneous correct sequence speed was calculated as the inverse of the average KTT across a single correct sequence iteration and was updated for each correct keypress.”

      The instantaneous speed measure used in our analyses, in fact, maximizes the likelihood of detecting changes in online performance, as the Reviewer indicates.  Despite this optimally sensitive measurement of online changes, our findings remained robust, consistently converging on the same outcome across our original analyses and the multiple controls recommended by the reviewers. Notably, online contextualization changes are significantly weaker than offline contextualization in all comparisons with different measurement approaches.

      Results (lines 302-309)

      “The Euclidian distance between neural representations of Index<sub>OP1</sub> (i.e. - index finger keypress at ordinal position 1 of the sequence) and Index<sub>OP5</sub> (i.e. - index finger keypress at ordinal position 5 of the sequence) increased progressively during early learning (Figure 5A)—predominantly during rest intervals (offline contextualization) rather than during practice (online) (t = 4.84, p < 0.001, df = 25, Cohen's d = 1.2; Figure 5B; Figure 5 – figure supplement 1A). An alternative online contextualization determination equalling the time interval between online and offline comparisons (Trial-based; 10 seconds between Index<sub>OP1</sub> and Index<sub>OP5</sub> observations in both cases) rendered a similar result (Figure 5 – figure supplement 2B).

      Results (lines 316-318)

      “Conversely, online contextualization (using either measurement approach) did not explain early online learning gains (i.e. – Figure 5 – figure supplement 3).”

      Results (lines 318-328)

      “Within-subject correlations were consistent with these group-level findings. The average correlation between offline contextualization and micro-offline gains within individuals was significantly greater than zero (Figure 5 – figure supplement 4, left; t = 3.87, p = 0.00035, df = 25, Cohen's d = 0.76) and stronger than correlations between online contextualization and either micro-online (Figure 5 – figure supplement 4, middle; t = 3.28, p = 0.0015, df = 25, Cohen's d = 1.2) or microoffline gains (Figure 5 – figure supplement 4, right; t = 3.7021, p = 5.3013e-04, df = 25, Cohen's d = 0.69). These findings were not explained by behavioral changes of typing rhythm (t = -0.03, p = 0.976; Figure 5 – figure supplement 5), adjacent keypress transition times (R<sup>2</sup> = 0.00507, F[1,3202] = 16.3; Figure 5 – figure supplement 6), or overall typing speed (between-subject; R<sup>2</sup> = 0.028, p \= 0.41; Figure 5 – figure supplement 7).”

      We disagree with the Reviewer’s statement that “the definition of micro-offline gains (as well as offline contextualization) conflates online and "offline" processes”.  From a strictly behavioral point of view, it is obviously true that one can only measure skill (rather than the absence of it during rest) to determine how it changes over time.  While skill changes surrounding rest are used to infer offline learning processes, recovery of skill decay following intense practice is used to infer “unmeasurable” recovery from fatigue or reactive inhibition. In other words, the alternative processes proposed by the Reviewer also rely on the same inferential reasoning. 

      Importantly, inferences can be validated through the identification of mechanisms. Our experiment constrained the study to evaluation of changes in neural representations of the same action in different contexts, while minimized the impact of mechanisms related to fatigue/reactive inhibition [13, 14]. In this way, we observed that behavioral gains and neural contextualization occurs to a greater extent over rest breaks rather than during practice trials and that offline contextualization changes strongly correlate with the offline behavioral gains, while online contextualization does not. This result was supported by the results of all control analyses recommended by the Reviewers. Specifically:

      Methods (Lines 493-499)

      “The study design followed specific recommendations by Pan and Rickard (2015): 1) utilizing 10-second practice trials and 2) constraining analysis of micro-offline gains to early learning trials (where performance monotonically increases and 95% of overall performance gains occur) that precede the emergence of “scalloped” performance dynamics strongly linked to reactive inhibition effects ( [29, 72]). This is precisely the portion of the learning curve Pan and Rickard referred to when they stated “…rapid learning during that period masks any reactive inhibition effect” [29].”

      And Discussion (Lines 444-448):

      “Finally, caution should be exercised when extrapolating findings during early skill learning, a period of steep performance improvements, to findings reported after insufficient practice [67], post-plateau performance periods [68], or non-learning situations (e.g. performance of non-repeating keypress sequences in  [67]) when reactive inhibition or contextual interference effects are prominent.”

      Next, we show that offline contextualization is greater than online contextualization and predicts offline behavioral gains across all measurement approaches, including all controls suggested by the Reviewer’s comments and recommendations. 

      Results (lines 302-318):

      “The Euclidian distance between neural representations of Index<sub>OP1</sub> (i.e. - index finger keypress at ordinal position 1 of the sequence) and Index<sub>OP5</sub> (i.e. - index finger keypress at ordinal position 5 of the sequence) increased progressively during early learning (Figure 5A)—predominantly during rest intervals (offline contextualization) rather than during practice (online) (t = 4.84, p < 0.001, df = 25, Cohen's d = 1.2; Figure 5B; Figure 5 – figure supplement 1A). An alternative online contextualization determination equalling the time interval between online and offline comparisons (Trial-based; 10 seconds between Index<sub>OP1</sub> and Index<sub>OP5</sub> observations in both cases) rendered a similar result (Figure 5 – figure supplement 2B).

      Offline contextualization strongly correlated with cumulative micro-offline gains (r = 0.903, R<sup>2</sup> = 0.816, p < 0.001; Figure 5 – figure supplement 1A, inset) across decoder window durations ranging from 50 to 250ms (Figure 5 – figure supplement 1B, C). The offline contextualization between the final sequence of each trial and the second sequence of the subsequent trial (excluding the first sequence) yielded comparable results. This indicates that pre-planning at the start of each practice trial did not directly influence the offline contextualization measure [30] (Figure 5 – figure supplement 2A, 1st vs. 2nd Sequence approaches). Conversely, online contextualization (using either measurement approach) did not explain early online learning gains (i.e. – Figure 5 – figure supplement 3).”

      Results (lines 318-324)

      “Within-subject correlations were consistent with these group-level findings. The average correlation between offline contextualization and micro-offline gains within individuals was significantly greater than zero (Figure 5 – figure supplement 4, left; t = 3.87, p = 0.00035, df = 25, Cohen's d = 0.76) and stronger than correlations between online contextualization and either micro-online (Figure 5 – figure supplement 4, middle; t = 3.28, p = 0.0015, df = 25, Cohen's d = 1.2) or microoffline gains (Figure 5 – figure supplement 4, right; t = 3.7021, p = 5.3013e-04, df = 25, Cohen's d = 0.69).”

      Discussion (lines 408-416):

      “Offline contextualization consistently correlated with early learning gains across a range of decoding windows (50–250ms; Figure 5 – figure supplement 1). This result remained unchanged when measuring offline contextualization between the last and second sequence of consecutive trials, inconsistent with a possible confounding effect of pre-planning [30] (Figure 5 – figure supplement 2A). On the other hand, online contextualization did not predict learning (Figure 5 – figure supplement 3). Consistent with these results the average within-subject correlation between offline contextualization and micro-offline gains was significantly stronger than within subject correlations between online contextualization and either micro-online or micro-offline gains (Figure 5 – figure supplement 4).”

      We then show that offline contextualization is not explained by pre-planning of the first action sequence:

      Results (lines 310-316):

      “Offline contextualization strongly correlated with cumulative micro-offline gains (r = 0.903, R<sup>2</sup> = 0.816, p < 0.001; Figure 5 – figure supplement 1A, inset) across decoder window durations ranging from 50 to 250ms (Figure 5 – figure supplement 1B, C). The offline contextualization between the final sequence of each trial and the second sequence of the subsequent trial (excluding the first sequence) yielded comparable results. This indicates that pre-planning at the start of each practice trial did not directly influence the offline contextualization measure [30] (Figure 5 – figure supplement 2A, 1st vs. 2nd Sequence approaches).”

      Discussion (lines 409-412):

      “This result remained unchanged when measuring offline contextualization between the last and second sequence of consecutive trials, inconsistent with a possible confounding effect of pre-planning [30] (Figure 5 – figure supplement 2A).”

      In summary, none of the presented evidence in this paper—including results of the multiple control analyses carried out in response to the Reviewers’ recommendations— supports the Reviewer’s position. 

      Please note that the micro-offline learning "inference" has extensive mechanistic support across species and neural recording techniques (see Introduction, lines 26-56). In contrast, the reactive inhibition "inference," which is the Reviewer's alternative interpretation, has no such support yet [15].

      Introduction (Lines 26-56)

      “Practicing a new motor skill elicits rapid performance improvements (early learning) [1] that precede skill performance plateaus [5]. Skill gains during early learning accumulate over rest periods (micro-offline) interspersed with practice [1, 6-10], and are up to four times larger than offline performance improvements reported following overnight sleep [1]. During this initial interval of prominent learning, retroactive interference immediately following each practice interval reduces learning rates relative to interference after passage of time, consistent with stabilization of the motor memory [11]. Micro-offline gains observed during early learning are reproducible [7, 10-13] and are similar in magnitude even when practice periods are reduced by half to 5 seconds in length, thereby confirming that they are not merely a result of recovery from performance fatigue [11]. Additionally, they are unaffected by the random termination of practice periods, which eliminates the possibility of predictive motor slowing as a contributing factor [11]. Collectively, these behavioral findings point towards the interpretation that microoffline gains during early learning represent a form of memory consolidation [1]. 

      This interpretation has been further supported by brain imaging and electrophysiological studies linking known memory-related networks and consolidation mechanisms to rapid offline performance improvements. In humans, the rate of hippocampo-neocortical neural replay predicts micro-offline gains [6].

      Consistent with these findings, Chen et al. [12] and Sjøgård et al. [13] furnished direct evidence from intracranial human EEG studies, demonstrating a connection between the density of hippocampal sharp-wave ripples (80-120 Hz)—recognized markers of neural replay—and micro-offline gains during early learning. Further, Griffin et al. reported that neural replay of task-related ensembles in the motor cortex of macaques during brief rest periods— akin to those observed in humans [1, 6-8, 14]—are not merely correlated with, but are causal drivers of micro-offline learning [15]. Specifically, the same reach directions that were replayed the most during rest breaks showed the greatest reduction in path length (i.e. – more efficient movement path between two locations in the reach sequence) during subsequent trials, while stimulation applied during rest intervals preceding performance plateau reduced reactivation rates and virtually abolished micro-offline gains [15]. Thus, converging evidence in humans and non-human primates across indirect non-invasive and direct invasive recording techniques link hippocampal activity, neural replay dynamics and offline skill gains in early motor learning that precede performance plateau.”

      That said, absence of evidence, is not evidence of absence and for that reason we also state in the Discussion (lines 448-452):

      A simple control analysis based on shuffled class labels could lend further support to the authors' complex decoding approach. As a control analysis that completely rules out any source of overfitting, the authors could test the decoder after shuffling class labels. Following such shuffling, decoding accuracies should drop to chance-level for all decoding approaches, including the optimized decoder. This would also provide an estimate of actual chance-level performance (which is informative over and beyond the theoretical chance level). During the review process, the authors reported this analysis to the reviewers. Given that readers may consider following the presented decoding approach in their own work, it would have been important to include that control analysis in the manuscript to convince readers of its validity. 

      As requested, the label-shuffling analysis was carried out for both 4- and 5-class decoders and is now reported in the revised manuscript.

      Results (lines 204-207):

      “Testing the keypress state (4-class) hybrid decoder performance on Day 1 after randomly shuffling keypress labels for held-out test data resulted in a performance drop approaching expected chance levels (22.12%± SD 9.1%; Figure 3 – figure supplement 3C).”

      Results (lines 261-264):

      “As expected, the 5-class hybrid-space decoder performance approached chance levels when tested with randomly shuffled keypress labels (18.41%± SD 7.4% for Day 1 data; Figure 4 – figure supplement 3C).”

      Furthermore, the authors' approach to cortical parcellation raises questions regarding the information carried by varying dipole orientations within a parcel (which currently seems to be ignored?) and the implementation of the mean-flipping method (given that there are two dimensions - space and time - it is unclear what the authors refer to when they talk about the sign of the "average source", line 477). 

      The revised manuscript now provides a more detailed explanation of the parcellation, and sign-flipping procedures implemented:

      Methods (lines 604-611):

      “Source-space parcellation was carried out by averaging all voxel time-series located within distinct anatomical regions defined in the Desikan-Killiany Atlas [31]. Since source time-series estimated with beamforming approaches are inherently sign-ambiguous, a custom Matlab-based implementation of the mne.extract_label_time_course with “mean_flip” sign-flipping procedure in MNEPython [78] was applied prior to averaging to prevent within-parcel signal cancellation. All voxel time-series within each parcel were extracted and the timeseries sign was flipped at locations where the orientation difference was greater than 90° from the parcel mode. A mean time-series was then computed across all voxels within the parcel after sign-flipping.”

      Recommendations for the authors: 

      Reviewer #1 (Recommendations for the authors): 

      Comments on the revision: 

      The authors have made large efforts to address all concerns raised. A couple of suggestions remain: 

      - formally show if and how movement artefacts may contribute to the signal and analysis; it seems that the authors have data to allow for such an analysis  

      We have implemented the requested control analyses addressing this issue. They are reported in: Results (lines 207-211 and 261-268), Discussion (Lines 362-368):

      - formally show that the signals from the intra- and inter parcel spaces are orthogonal. 

      Please note that, despite the Reviewer’s statement above, we never claim in the manuscript that the parcel-space and regional voxel-space features show “complete independence”. 

      Furthermore, the machine learning-based decoding methods used in the present study do not require input feature orthogonality, but instead non-redundancy [7], which is a requirement satisfied by our data (see below and the new Figure 2 – figure supplement 2 in the revised manuscript). Finally, our results already show that the hybrid space decoder outperformed all other methods even after input features were fully orthogonalized with LDA or PCA dimensionality reduction procedures prior to the classification step (Figure 3 – figure supplement 2).

      We also highlight several additional results that are informative regarding this issue. For example, if spatially overlapping parcel- and voxel-space time-series only provided redundant information, inclusion of both as input features should increase model overfitting to the training dataset and decrease overall cross-validated test accuracy [8]. In the present study however, we see the opposite effect on decoder performance. First, Figure 3 – figure supplements 1 & 2 clearly show that decoders constructed from hybrid-space features outperform the other input feature (sensor-, whole-brain parcel- and whole-brain voxel-) spaces in every case (e.g. – wideband, all narrowband frequency ranges, and even after the input space is fully orthogonalized through dimensionality reduction procedures prior to the decoding step). Furthermore, Figure 3 – figure supplement 6 shows that hybridspace decoder performance supers when parcel-time series that spatially overlap with the included regional voxel-spaces are removed from the input feature set.  We state in the Discussion (lines 353-356)

      “The observation of increased cross-validated test accuracy (as shown in Figure 3 – Figure Supplement 6) indicates that the spatially overlapping information in parcel- and voxel-space time-series in the hybrid decoder was complementary, rather than redundant [41].”

      To gain insight into the complimentary information contributed by the two spatial scales to the hybrid-space decoder, we first independently computed the matrix rank for whole-brain parcel- and voxel-space input features for each participant (shown in Author response image 1). The results indicate that whole-brain parcel-space input features are full rank (rank = 148) for all participants (i.e. - MEG activity is orthogonal between all parcels). The matrix rank of voxelspace input features (rank = 267± 17 SD), exceeded the parcel-space rank for all participants and approached the number of useable MEG sensor channels (n = 272). Thus, voxel-space features provide both additional and complimentary information to representations at the parcel-space scale.  

      Figure 2—figure Supplement 2 in the revised manuscript now shows that the degree of dependence between the two spatial scales varies over the regional voxel-space. That is, some voxels within a given parcel correlate strongly with the time-series of the parcel they belong to, while others do not. This finding is consistent with a documented increase in correlational structure of neural activity across spatial scales that does not reflect perfect dependency or orthogonality [9]. Notably, the regional voxel-spaces included in the hybridspace decoder are significantly less correlated with the averaged parcel-space time-series than excluded voxels. We now point readers to this new figure in the results.

      Taken together, these results indicate that the multi-scale information in the hybrid feature set is complimentary rather than orthogonal.  This is consistent with the idea that hybridspace features better represent multi-scale temporospatial dynamics reported to be a fundamental characteristic of how the brain stores and adapts memories, and generates behavior across species [9].

      Reviewer #2 (Recommendations for the authors):  

      I appreciate the authors' efforts in addressing the concerns I raised. The responses generally made sense to me. However, I had some trouble finding several corrections/additions that the authors claim they made in the revised manuscript: 

      "We addressed this question by conducting a new multivariate regression analysis to directly assess whether the neural representation distance score could be predicted by the 4-1, 2-4, and 4-4 keypress transition times observed for each complete correct sequence (both predictor and response variables were z-score normalized within-subject). The results of this analysis also affirmed that the possible alternative explanation that contextualization effects are simple reflections of increased mixing is not supported by the data (Adjusted R<sup>2</sup> = 0.00431; F = 5.62).  We now include this new negative control analysis in the revised manuscript."  

      This approach is now reported in the manuscript in the Results (Lines 324-328 and Figure 5-Figure Supplement 6 legend.

      "We strongly agree with the Reviewer that the issue of generalizability is extremely important and have added a new paragraph to the Discussion in the revised manuscript highlighting the strengths and weaknesses of our study with respect to this issue." 

      Discussion (Lines 436-441)

      “One limitation of this study is that contextualization was investigated for only one finger movement (index finger or digit 4) embedded within a relatively short 5-item skill sequence. Determining if representational contextualization is exhibited across multiple finger movements embedded within for example longer sequences (e.g. – two index finger and two little finger keypresses performed within a short piece of piano music) will be an important extension to the present results.”

      "We strongly agree with the Reviewer that any intended clinical application must carefully consider the specific input feature constraints dictated by the clinical cohort, and in turn impose appropriate and complimentary constraints on classifier parameters that may differ from the ones used in the present study. We now highlight this issue in the Discussion of the revised manuscript and relate our present findings to published clinical BCI work within this context."  

      Discussion (Lines 441-444)

      “While a supervised manifold learning approach (LDA) was used here because it optimized hybrid-space decoder performance, unsupervised strategies (e.g. - PCA and MDS, which also substantially improved decoding accuracy in the present study; Figure 3 – figure supplement 2) are likely more suitable for real-time BCI applications.”

      and 

      "The Reviewer makes a good point. We have now implemented the suggested normalization procedure in the analysis provided in the revised manuscript." 

      Results (lines 275-282)

      “We used a Euclidian distance measure to evaluate the differentiation of the neural representation manifold of the same action (i.e. - an index-finger keypress) executed within different local sequence contexts (i.e. - ordinal position 1 vs. ordinal position 5; Figure 5). To make these distance measures comparable across participants, a new set of classifiers was then trained with group-optimal parameters (i.e. – broadband hybrid-space MEG data with subsequent manifold extraction (Figure 3 – figure supplements 2) and LDA classifiers (Figure 3 – figure supplements 7) trained on 200ms duration windows aligned to the KeyDown event (see Methods, Figure 3 – figure supplements 5). “

      Where are they in the manuscript? Did I read the wrong version? It would be more helpful to specify with page/line numbers. Please also add the detailed procedure of the control/additional analyses in the Method. 

      As requested, we now refer to all manuscript revisions with specific line numbers. We have also included all detailed procedures related to any additional analyses requested by reviewers.

      I also have a few other comments back to the authors' following responses: 

      "Thus, increased overlap between the "4" and "1" keypresses (at the start of the sequence) and "2" and "4" keypresses (at the end of the sequence) could artefactually increase contextualization distances even if the underlying neural representations for the individual keypresses remain unchanged. One must also keep in mind that since participants repeat the sequence multiple times within the same trial, a majority of the index finger keypresses are performed adjacent to one another (i.e. - the "4-4" transition marking the end of one sequence and the beginning of the next). Thus, increased overlap between consecutive index finger keypresses as typing speed increased should increase their similarity and mask contextualization- related changes to the underlying neural representations."  "We also re-examined our previously reported classification results with respect to this issue. 

      We reasoned that if mixing effects reflecting the ordinal sequence structure is an important driver of the contextualization finding, these effects should be observable in the distribution of decoder misclassifications. For example, "4" keypresses would be more likely to be misclassified as "1" or "2" keypresses (or vice versa) than as "3" keypresses. The confusion matrices presented in Figures 3C and 4B and Figure 3-figure supplement 3A display a distribution of misclassifications that is inconsistent with an alternative mixing effect explanation of contextualization." 

      "Based upon the increased overlap between adjacent index finger keypresses (i.e. - "4-4" transition), we also reasoned that the decoder tasked with separating individual index finger keypresses into two distinct classes based upon sequence position, should show decreased performance as typing speed increases. However, Figure 4C in our manuscript shows that this is not the case. The 2-class hybrid classifier actually displays improved classification performance over early practice trials despite greater temporal overlap. Again, this is inconsistent with the idea that the contextualization effect simply reflects increased mixing of individual keypress features."  

      As the time window for MEG feature is defined after the onset of each press, it is more likely that the feature overlap is the current and the future presses, rather than the current and the past presses (of course the three will overlap at very fast typing speed). Therefore, for sequence 41324, if we note the planning-related processes by a Roman numeral, the overlapping features would be '4i', '1iii', '3ii', '2iv', and '4iv'. Assuming execution-related process (e.g., 1) and planning-related process (e.g., i) are not necessarily similar, especially in finer temporal resolution, the patterns for '4i' and '4iv' are well separated in terms of process 'i' and 'iv,' and this advantage will be larger in faster typing speed. This also applies to the other presses. Thus, the author's arguments about the masking of contextualization and misclassification due to pattern overlap seem odd. The most direct and probably easiest way to resolve this would be to use a shorter time window for the MEG feature. Some decrease in decoding accuracy in this case is totally acceptable for the science purpose.  

      The revised manuscript now includes analyses carried out with decoding time windows ranging from 50 to 250ms in duration. These additional results are now reported in:

      Results (lines 258-268):

      “The improved decoding accuracy is supported by greater differentiation in neural representations of the index finger keypresses performed at positions 1 and 5 of the sequence (Figure 4A), and by the trial-by-trial increase in 2-class decoding accuracy over early learning (Figure 4C) across different decoder window durations (Figure 4 – figure supplement 2). As expected, the 5-class hybrid-space decoder performance approached chance levels when tested with randomly shuffled keypress labels (18.41%± SD 7.4% for Day 1 data; Figure 4 – figure supplement 3C). Task-related eye movements did not explain these results since an alternate 5-class hybrid decoder constructed from three eye movement features (gaze position at the KeyDown event, gaze position 200ms later, and peak eye movement velocity within this window; Figure 4 – figure supplement 3A) performed at chance levels (crossvalidated test accuracy = 0.2181; Figure 4 – figure supplement 3B, C).”

      Results (lines 310-316):

      “Offline contextualization strongly correlated with cumulative micro-offline gains (r = 0.903, R² = 0.816, p < 0.001; Figure 5 – figure supplement 1A, inset) across decoder window durations ranging from 50 to 250ms (Figure 5 – figure supplement 1B, C). The offline contextualization between the final sequence of each trial and the second sequence of the subsequent trial (excluding the first sequence) yielded comparable results. This indicates that pre-planning at the start of each practice trial did not directly influence the offline contextualization measure [30] (Figure 5 – figure supplement 2A, 1st vs. 2nd Sequence approaches). “

      Discussion (lines 380-385):

      “The first hint of representational differentiation was the highest false-negative and lowest false-positive misclassification rates for index finger keypresses performed at different locations in the sequence compared with all other digits (Figure 3C). This was further supported by the progressive differentiation of neural representations of the index finger keypress (Figure 4A) and by the robust trial-by-trial increase in 2class decoding accuracy across time windows ranging between 50 and 250ms (Figure 4C; Figure 4 – figure supplement 2).”

      Discussion (lines 408-9):

      “Offline contextualization consistently correlated with early learning gains across a range of decoding windows (50–250ms; Figure 5 – figure supplement 1).”

      "We addressed this question by conducting a new multivariate regression analysis to directly assess whether the neural representation distance score could be predicted by the 4-1, 2-4 and 4-4 keypress transition times observed for each complete correct sequence" 

      For regression analysis, I recommend to use total keypress time per a sequence (or sum of 4-1 and 4-4) instead of specific transition intervals, because there likely exist specific correlational structure across the transition intervals. Using correlated regressors may distort the result.  

      This approach is now reported in the manuscript:

      Results (Lines 324-328) and Figure  5-Figure Supplement 6 legend.

      "We do agree with the Reviewer that the naturalistic, generative, self-paced task employed in the present study results in overlapping brain processes related to planning, execution, evaluation and memory of the action sequence. We also agree that there are several tradeoffs to consider in the construction of the classifiers depending on the study aim. Given our aim of optimizing keypress decoder accuracy in the present study, the set of tradeoffs resulted in representations reflecting more the latter three processes, and less so the planning component. Whether separate decoders can be constructed to tease apart the representations or networks supporting these overlapping processes is an important future direction of research in this area. For example, work presently underway in our lab constrains the selection of windowing parameters in a manner that allows individual classifiers to be temporally linked to specific planning, execution, evaluation or memoryrelated processes to discern which brain networks are involved and how they adaptively reorganize with learning. Results from the present study (Figure 4-figure supplement 2) showing hybrid-space decoder prediction accuracies exceeding 74% for temporal windows spanning as little as 25ms and located up to 100ms prior to the KeyDown event strongly support the feasibility of such an approach." 

      I recommend that the authors add this paragraph or a paragraph like this to the Discussion. This perspective is very important and still missing in the revised manuscript. 

      We now included in the manuscript the following sections addressing this point:

      Discussion (lines 334-338)

      “The main findings of this study during which subjects engaged in a naturalistic, self-paced task were that individual sequence action representations differentiate during early skill learning in a manner reflecting the local sequence context in which they were performed, and that the degree of representational differentiation— particularly prominent over rest intervals—correlated with skill gains. “

      Discussion (lines 428-434)

      “In this study, classifiers were trained on MEG activity recorded during or immediately after each keypress, emphasizing neural representations related to action execution, memory consolidation and recall over those related to planning. An important direction for future research is determining whether separate decoders can be developed to distinguish the representations or networks separately supporting these processes. Ongoing work in our lab is addressing this question. The present accuracy results across varied decoding window durations and alignment with each keypress action support the feasibility of this approach (Figure 3—figure supplement 5).”

      "The rapid initial skill gains that characterize early learning are followed by micro-scale fluctuations around skill plateau levels (i.e. following trial 11 in Figure 1B)"  Is this a mention of Figure 1 Supplement 1 A?  

      The sentence was replaced with the following: Results (lines 108-110)

      “Participants reached 95% of maximal skill (i.e. - Early Learning) within the initial 11 practice trials (Figure 1B), with improvements developing over inter-practice rest periods (micro-offline gains) accounting for almost all total learning across participants (Figure 1B, inset) [1].”

      The citation below seems to have been selected by mistake; 

      "9. Chen, S. & Epps, J. Using task-induced pupil diameter and blink rate to infer cognitive load. Hum Comput Interact 29, 390-413 (2014)." 

      We thank the Reviewer for bringing this mistake to our attention. This citation has now been corrected.

      Reviewer #3 (Recommendations for the authors):  

      The authors write in their response that "We now provide additional details in the Methods of the revised manuscript pertaining to the parcellation procedure and how the sign ambiguity problem was addressed in our analysis." I could not find anything along these lines in the (redlined) version of the manuscript and therefore did not change the corresponding comment in the public review.  

      The revised manuscript now provides a more detailed explanation of the parcellation, and sign-flipping procedure implemented:

      Methods (lines 604-611):

      “Source-space parcellation was carried out by averaging all voxel time-series located within distinct anatomical regions defined in the Desikan-Killiany Atlas [31]. Since source time-series estimated with beamforming approaches are inherently sign-ambiguous, a custom Matlab-based implementation of the mne.extract_label_time_course with “mean_flip” sign-flipping procedure in MNEPython [78] was applied prior to averaging to prevent within-parcel signal cancellation. All voxel time-series within each parcel were extracted and the timeseries sign was flipped at locations where the orientation difference was greater than 90° from the parcel mode. A mean time-series was then computed across all voxels within the parcel after sign-flipping.”

      The control analysis based on a multivariate regression that assessed whether the neural representation distance score could be predicted by the 4-1, 2-4 and 4-4 keypress transition times, as briefly mentioned in the authors' responses to Reviewer 2 and myself, was not included in the manuscript and could not be sufficiently evaluated. 

      This approach is now reported in the manuscript: Results (Lines 324-328) and Figure  5-Figure Supplement 6 legend.

      The authors argue that differences in the design between Das et al. (2024) on the one hand (Experiments 1 and 2), and the study by Bönstrup et al. (2019) on the other hand, may have prevented Das et al. (2024) from finding the assumed learning benefit by micro-offline consolidation. However, the Supplementary Material of Das et al. (2024) includes an experiment (Experiment S1) whose design closely follows a large proportion of the early learning phase of Bönstrup et al. (2019), and which, nevertheless, demonstrates that there is no lasting benefit of taking breaks with respect to the acquired skill level, despite the presence of micro-offline gains.  

      We thank the Reviewer for alerting us to this new data added to the revised supplementary materials of Das et al. (2024) posted to bioRxiv. However, despite the Reviewer’s claim to the contrary, a careful comparison between the Das et al and Bönstrup et al studies reveal more substantive differences than similarities and does not “closely follows a large proportion of the early learning phase of Bönstrup et al. (2019)” as stated. 

      In the Das et al. Experiment S1, sixty-two participants were randomly assigned to “with breaks” or “no breaks” skill training groups. The “with breaks” group alternated 10 seconds of skill sequence practice with 10 seconds of rest over seven trials (2 min and 2 sec total training duration). This amounts to 66.7% of the early learning period defined by Bönstrup et al. (2019) (i.e. - eleven 10-second long practice periods interleaved with ten 10-second long rest breaks; 3 min 30 sec total training duration). Also, please note that while no performance feedback nor reward was given in the Bönstrup et al. (2019) study, participants in the Das et al. study received explicit performance-based monetary rewards, a potentially crucial driver of differentiated behavior between the two studies:

      “Participants were incentivized with bonus money based on the total number of correct sequences completed throughout the experiment.”

      The “no breaks” group in the Das et al. study practiced the skill sequence for 70 continuous seconds. Both groups (despite one being labeled “no breaks”) follow training with a long 3-minute break (also note that since the “with breaks” group ends with 10 seconds of rest their break is actually longer), before finishing with a skill “test” over a continuous 50-second-long block. During the 70 seconds of training, the “with breaks” group shows more learning than the “no breaks” group. Interestingly, following the long 3minute break the “with breaks” group display a performance drop (relative to their performance at the end of training) that is stable over the full 50-second test, while the “no breaks” group shows an immediate performance improvement following the long break that continues to increase over the 50-second test.  

      Separately, there are important issues regarding the Das et al study that should be considered through the lens of recent findings not referred to in the preprint. A major element of their experimental design is that both groups—“with breaks” and “no breaks”— actually receive quite a long 3-minute break just before the skill test. This long break is more than 2.5x the cumulative interleaved rest experienced by the “with breaks” group. Thus, although the design is intended to contrast the presence or absence of rest “breaks”, that difference between groups is no longer maintained at the point of the skill test. 

      The Das et al results are most consistent with an alternative interpretation of the data— that the “no breaks” group experiences offline learning during their long 3-minute break. This is supported by the recent work of Griffin et al. (2025) where micro-array recordings from primary and premotor cortex were obtained from macaque monkeys while they performed blocks of ten continuous reaching sequences up to 81.4 seconds in duration (see source data for Extended Data Figure 1h) with 90 seconds of interleaved rest. Griffin et al. observed offline improvement in skill immediately following the rest break that was causally related to neural reactivations (i.e. – neural replay) that occurred during the rest break. Importantly, the highest density of reactivations was present in the very first 90second break between Blocks 1 and 2 (see Fig. 2f in Griffin et al., 2025). This supports the interpretation that both the “with breaks” and “no breaks” group express offline learning gains, with these gains being delayed in the “no breaks” group due to the practice schedule.

      On the other hand, if offline learning can occur during this longer break, then why would the “with breaks” group show no benefit? Again, it could be that most of the offline gains for this group were front-loaded during the seven shorter 10-second rest breaks. Another possible, though not mutually exclusive, explanation is that the observed drop in performance in the “with breaks” group is driven by contextual interference. Specifically, similar to Experiments 1 and 2 in Das et al. (2024), the skill test is conducted under very different conditions than those which the “with breaks” group practiced the skill under (short bursts of practiced alternating with equally short breaks). On the other hand, the “no breaks” group is tested (50 seconds of continuous practice) under quite similar conditions to their training schedule (70 seconds of continuous practice). Thus, it is possible that this dissimilarity between training and test could lead to reduced performance in the “with breaks” group.

      We made the following manuscript revisions related to these important issues: 

      Introduction (Lines 26-56)

      “Practicing a new motor skill elicits rapid performance improvements (early learning) [1] that precede skill performance plateaus [5]. Skill gains during early learning accumulate over rest periods (micro-offline) interspersed with practice [1, 6-10], and are up to four times larger than offline performance improvements reported following overnight sleep [1]. During this initial interval of prominent learning, retroactive interference immediately following each practice interval reduces learning rates relative to interference after passage of time, consistent with stabilization of the motor memory [11]. Micro-offline gains observed during early learning are reproducible [7, 10-13] and are similar in magnitude even when practice periods are reduced by half to 5 seconds in length, thereby confirming that they are not merely a result of recovery from performance fatigue [11]. Additionally, they are unaffected by the random termination of practice periods, which eliminates the possibility of predictive motor slowing as a contributing factor [11]. Collectively, these behavioral findings point towards the interpretation that microoffline gains during early learning represent a form of memory consolidation [1]. 

      This interpretation has been further supported by brain imaging and electrophysiological studies linking known memory-related networks and consolidation mechanisms to rapid offline performance improvements. In humans, the rate of hippocampo-neocortical neural replay predicts micro-offline gains [6]. Consistent with these findings, Chen et al. [12] and Sjøgård et al. [13] furnished direct evidence from intracranial human EEG studies, demonstrating a connection between the density of hippocampal sharp-wave ripples (80-120 Hz)—recognized markers of neural replay—and micro-offline gains during early learning. Further, Griffin et al. reported that neural replay of task-related ensembles in the motor cortex of macaques during brief rest periods— akin to those observed in humans [1, 6-8, 14]—are not merely correlated with, but are causal drivers of micro-offline learning [15]. Specifically, the same reach directions that were replayed the most during rest breaks showed the greatest reduction in path length (i.e. – more efficient movement path between two locations in the reach sequence) during subsequent trials, while stimulation applied during rest intervals preceding performance plateau reduced reactivation rates and virtually abolished micro-offline gains [15]. Thus, converging evidence in humans and non-human primates across indirect non-invasive and direct invasive recording techniques link hippocampal activity, neural replay dynamics and offline skill gains in early motor learning that precede performance plateau.”

      Next, in the Methods, we articulate important constraints formulated by Pan and Rickard (2015) and Bönstrup et al. (2019) for meaningful measurements:

      Methods (Lines 493-499)

      “The study design followed specific recommendations by Pan and Rickard (2015): 1) utilizing 10-second practice trials and 2) constraining analysis of micro-offline gains to early learning trials (where performance monotonically increases and 95% of overall performance gains occur) that precede the emergence of “scalloped” performance dynamics strongly linked to reactive inhibition effects ([29, 72]). This is precisely the portion of the learning curve Pan and Rickard referred to when they stated “…rapid learning during that period masks any reactive inhibition effect” [29].”

      We finally discuss the implications of neglecting some or all of these recommendations:

      Discussion (Lines 444-452):

      “Finally, caution should be exercised when extrapolating findings during early skill learning, a period of steep performance improvements, to findings reported after insufficient practice [67], post-plateau performance periods [68], or non-learning situations (e.g. performance of non-repeating keypress sequences in  [67]) when reactive inhibition or contextual interference effects are prominent. Ultimately, it will be important to develop new paradigms allowing one to independently estimate the different coincident or antagonistic features (e.g. - memory consolidation, planning, working memory and reactive inhibition) contributing to micro-online and micro-offline gains during and after early skill learning within a unifying framework.”

      Personally, given that the idea of (micro-offline) consolidation seems to attract a lot of interest (and therefore cause a lot of future effort/cost public money) in the scientific community, I would find it extremely important to be cautious in interpreting results in this field. For me, this would include abstaining from the claim that processes occur "during" a rest period (see abstract, for example), given that micro-offline gains (as well as offline contextualization) are computed from data obtained during practice, not rest, and may, thus, just as well reflect a change that occurs "online", e.g., at the very onset of practice (like pre-planning) or throughout practice (like fatigue, or reactive inhibition). In addition, I would suggest to discuss in more depth the actual evidence not only in favour, but also against, the assumption of micro-offline gains as a phenomenon of learning.  

      We agree with the reviewer that caution is warranted. Based upon these suggestions, we have now expanded the manuscript to very clearly define the experimental constraints under which different groups have successfully studied micro-offline learning and its mechanisms, the impact of fatigue/reactive inhibition on micro-offline performance changes unrelated to learning, as well as the interpretation problems that emerge when those recommendations are not followed. 

      We clearly articulate the crucial constrains recommended by Pan and Rickard (2015) and Bönstrup et al. (2019) for meaningful measurements and interpretation of offline gains in the revised manuscript. 

      Methods (Lines 493-499)

      “The study design followed specific recommendations by Pan and Rickard (2015): 1) utilizing 10-second practice trials and 2) constraining analysis of micro-offline gains to early learning trials (where performance monotonically increases and 95% of overall performance gains occur) that precede the emergence of “scalloped” performance dynamics strongly linked to reactive inhibition effects ( [29, 72]). This is precisely the portion of the learning curve Pan and Rickard referred to when they stated “…rapid learning during that period masks any reactive inhibition effect” [29].”

      In the Introduction, we review the extensive evidence emerging from LFP and microelectrode recordings in humans and monkeys (including causality of neural replay with respect to micro-offline gains and early learning in the Griffin et al. Nature 2025 publication):

      Introduction (Lines 26-56)

      “Practicing a new motor skill elicits rapid performance improvements (early learning) [1] that precede skill performance plateaus [5]. Skill gains during early learning accumulate over rest periods (micro-offline) interspersed with practice [1, 6-10], and are up to four times larger than offline performance improvements reported following overnight sleep [1]. During this initial interval of prominent learning, retroactive interference immediately following each practice interval reduces learning rates relative to interference after passage of time, consistent with stabilization of the motor memory [11]. Micro-offline gains observed during early learning are reproducible [7, 10-13] and are similar in magnitude even when practice periods are reduced by half to 5 seconds in length, thereby confirming that they are not merely a result of recovery from performance fatigue [11]. Additionally, they are unaffected by the random termination of practice periods, which eliminates the possibility of predictive motor slowing as a contributing factor [11]. Collectively, these behavioral findings point towards the interpretation that microoffline gains during early learning represent a form of memory consolidation [1]. 

      This interpretation has been further supported by brain imaging and electrophysiological studies linking known memory-related networks and consolidation mechanisms to rapid offline performance improvements. In humans, the rate of hippocampo-neocortical neural replay predicts micro-offline gains [6]. Consistent with these findings, Chen et al. [12] and Sjøgård et al. [13] furnished direct evidence from intracranial human EEG studies, demonstrating a connection between the density of hippocampal sharp-wave ripples (80-120 Hz)—recognized markers of neural replay—and micro-offline gains during early learning. Further, Griffin et al. reported that neural replay of task-related ensembles in the motor cortex of macaques during brief rest periods— akin to those observed in humans [1, 6-8, 14]—are not merely correlated with, but are causal drivers of micro-offline learning [15]. Specifically, the same reach directions that were replayed the most during rest breaks showed the greatest reduction in path length (i.e. – more efficient movement path between two locations in the reach sequence) during subsequent trials, while stimulation applied during rest intervals preceding performance plateau reduced reactivation rates and virtually abolished micro-offline gains [15]. Thus, converging evidence in humans and non-human primates across indirect non-invasive and direct invasive recording techniques link hippocampal activity, neural replay dynamics and offline skill gains in early motor learning that precede performance plateau.”

      Following the reviewer’s advice, we have expanded our discussion in the revised manuscript of alternative hypotheses put forward in the literature and call for caution when extrapolating results across studies with fundamental differences in design (e.g. – different practice and rest durations, or presence/absence of extrinsic reward, etc). 

      Discussion (Lines 444-452):

      “Finally, caution should be exercised when extrapolating findings during early skill learning, a period of steep performance improvements, to findings reported after insufficient practice [67], post-plateau performance periods [68], or non-learning situations (e.g. performance of non-repeating keypress sequences in  [67]) when reactive inhibition or contextual interference effects are prominent. Ultimately, it will be important to develop new paradigms allowing one to independently estimate the different coincident or antagonistic features (e.g. - memory consolidation, planning, working memory and reactive inhibition) contributing to micro-online and micro-offline gains during and after early skill learning within a unifying framework.”

      References

      (1) Zimerman, M., et al., Disrupting the Ipsilateral Motor Cortex Interferes with Training of a Complex Motor Task in Older Adults. Cereb Cortex, 2012.

      (2) Waters, S., T. Wiestler, and J. Diedrichsen, Cooperation Not Competition: Bihemispheric tDCS and fMRI Show Role for Ipsilateral Hemisphere in Motor Learning. J Neurosci, 2017. 37(31): p. 7500-7512.

      (3) Sawamura, D., et al., Acquisition of chopstick-operation skills with the nondominant hand and concomitant changes in brain activity. Sci Rep, 2019. 9(1): p. 20397.

      (4) Lee, S.H., S.H. Jin, and J. An, The dieerence in cortical activation pattern for complex motor skills: A functional near- infrared spectroscopy study. Sci Rep, 2019. 9(1): p. 14066.

      (5) Grafton, S.T., E. Hazeltine, and R.B. Ivry, Motor sequence learning with the nondominant left hand. A PET functional imaging study. Exp Brain Res, 2002. 146(3): p. 369-78.

      (6) Buch, E.R., et al., Consolidation of human skill linked to waking hippocamponeocortical replay. Cell Rep, 2021. 35(10): p. 109193.

      (7) Wang, L. and S. Jiang, A feature selection method via analysis of relevance, redundancy, and interaction, in Expert Systems with Applications, Elsevier, Editor. 2021.

      (8) Yu, L. and H. Liu, Eeicient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 2004. 5: p. 1205-1224.

      (9) Munn, B.R., et al., Multiscale organization of neuronal activity unifies scaledependent theories of brain function. Cell, 2024.

      (10) Borragan, G., et al., Sleep and memory consolidation: motor performance and proactive interference eeects in sequence learning. Brain Cogn, 2015. 95: p. 54-61.

      (11) Landry, S., C. Anderson, and R. Conduit, The eeects of sleep, wake activity and timeon-task on oeline motor sequence learning. Neurobiol Learn Mem, 2016. 127: p. 5663.

      (12) Gabitov, E., et al., Susceptibility of consolidated procedural memory to interference is independent of its active task-based retrieval. PLoS One, 2019. 14(1): p. e0210876.

      (13) Pan, S.C. and T.C. Rickard, Sleep and motor learning: Is there room for consolidation? Psychol Bull, 2015. 141(4): p. 812-34.

      (14) , M., et al., A Rapid Form of Oeline Consolidation in Skill Learning. Curr Biol, 2019. 29(8): p. 1346-1351 e4.

      (15) Gupta, M.W. and T.C. Rickard, Comparison of online, oeline, and hybrid hypotheses of motor sequence learning using a quantitative model that incorporate reactive inhibition. Sci Rep, 2024. 14(1): p. 4661.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public Review):

      Summary:

      This paper explores how diverse forms of inhibition impact firing rates in models for cortical circuits. In particular, the paper studies how the network operating point affects the balance of direct inhibition from SOM inhibitory neurons to pyramidal cells, and disinhibition from SOM inhibitory input to PV inhibitory neurons. This is an important issue as these two inhibitory pathways have largely been studies in isolation. Support for the main conclusions is generally solid, but could be strengthened by additional analyses.

      Strengths

      The paper has improved in revision, and the new intuitive summary statements added to the end of each results section are quite helpful. Weaknesses

      The concern about whether the results hold outside of the range in which neural responses are linear remains. This is particularly true given the discontinuity observed in the stability measure. I appreciate the concern (provided in the response to the first round of reviews) that studying nonlinear networks requires a lot of work. A more limited undertaking would be to test the behavior of a spiking network at a few key points identified by your linearization approach. Such tests could use relatively simple (and perhaps imperfect) measures of gain and stability. This could substantially enhance the paper, regardless of the outcome.

      We appreciate the reviewer’s concern and in our resubmission we explore if networks dynamics that operate outside of the case where linearization is possible would continue to show our main result on the (dis)entanglement of stability and gain; the short answer is yes. To this end we have added a new section and Figure to our main text.

      “Gain and stability in stochastically forced E – PV – SOM circuits

      To confirm that our results do not depend on our approach of a linearization around a fixed point, we numerically simulate similar networks as shown above (Figure 2) in which the E and PV population receive slow varying, large amplitude noise (Figure 6A). This leads to noisy rate dynamics sampling a large subspace of the full firing rate grid (r<sub>E</sub>,r<sub>P</sub>) and thus any linearization would fail to describe the network response. In this stochastically forced network we explore how adding an SOM modulation or a stimulus affects this subspace (Figure 6B). To quantify stability without linearization, we assume that a network is more stable the lower the mean and variance of E rates. This is because very stable networks can better quench input fluctuations [Kanashiro et al., 2017; Hennequin et al., 2018]. To quantify gain, we calculate the change in E rates when adding the stimulus, yet having identical noise realizations for stimulated and non-stimulated networks (Methods).

      For the disinhibitory network without feedback a positive SOM modulation decreases stability due to increases of the mean and variance of E rates (Figure 6Ci) while the network gain increases (Figure 6Cii). As seen before (Figure 2A,B), stability and gain change in opposite directions in a disinhibitory circuit without feedback. Adding feedback PV → SOM and applying a negative SOM modulation increases both, stability and gain and therefore disentangles the inverse relation also in a noisy circuit (Figure 6D-F). This gives numerical support that our results do not depend on the assumption of linearization.

      “Methods: Noisy input and numerical measurement of stability and gain

      We consider a temporally smoothed input process ξ<sub>X</sub> with white noise ζ (zero mean, standard deviation one): for populations X ∈{E,P} with timescale τ<sub>ξ</sub> = 50ms, σ<sub>X</sub> \= 6 and fixed mean input IX. To quantify the stability of the network without linearization, we assume that a network is more stable if the mean and variance of excitatory rates are low. To quantify network gain, we freeze the white noise process ζ for the case of with and without stimulus presentation and calculate the difference of E rates at each time point, leading to a distribution of network gains (Figure 6Cii,Fii). Total simulation time is 1000 seconds.”

      We decided against using a spiking network because sufficiently asynchronous spiking network dynamics can still obey a linearized mean field theory (if the fluctuations in population firing rates are small). In our new analysis the firing rate deviations from the time averaged firing rate are sizable, making a linearization ineffective.

      In summary, based on our additional analysis of recurrent circuits with noisy inputs we conclude that our results also hold in fluctuating networks, without the need of assuming realization aroud a stable fixed point.

      Reviewer #2 (Public Review):

      Summary:

      Bos and colleagues address the important question of how two major inhibitory interneuron classes in the neocortex differentially affect cortical dynamics. They address this question by studying Wilson-Cowan-type mathematical models. Using a linearized fixed point approach, they provide convincing evidence that the existence of multiple interneuron classes can explain the counterintuitive finding that inhibitory modulation can increase the gain of the excitatory cell population while also increasing the stability of the circuit’s state to minor perturbations. This effect depends on the connection strengths within their circuit model, providing valuable guidance as to when and why it arises.

      Overall, I find this study to have substantial merit. I have some suggestions on how to improve the clarity and completeness of the paper.

      Strengths:

      (1) The thorough investigation of how changes in the connectivity structure affect the gain-stability relationship is a major strength of this work. It provides an opportunity to understand when and why gain and stability will or will not both increase together. It also provides a nice bridge to the experimental literature, where different gain-stability relationships are reported from different studies.

      (2) The simplified and abstracted mathematical model has the benefit of facilitating our understanding of this puzzling phenomenon. (I have some suggestions for how the authors could push this understanding further.) It is not easy to find the right balance between biologically-detailed models vs simple but mathematically tractable ones, and I think the authors struck an excellent balance in this study.

      We thank the reviewer for their support of our work.

      Weaknesses:

      (1) The fixed-point analysis has potentially substantial limitations for understanding cortical computations away from the steady-state. I think the authors should have emphasized this limitation more strongly and possibly included some additional analyses to show that their conclusions extend to the chaotic dynamical regimes in which cortical circuits often live.

      In the response to reviewer 1 we have included model analyses that addresses the limitations of linearization. Rather than use a chaotic model, which would require significant effort, we opted for a stochastically forced network, where the sizable fluctuations in rate dynamics preclude linearization.

      (2) The authors could have discussed – even somewhat speculatively – how VIP interneurons fit into this picture. Their absence from this modelling framework stands out as a missed opportunity.

      We agree that including VIP neurons into the framework would be an obvious and potentially interesting next step. At this point we only include them as potential modulators of SOM neurons. Modeling their dynamics without them receiving inputs from E, PV, or SOM neurons would be uninteresting. However, including them properly into the circuit would be outside the scope of the paper.

      (3) The analysis is limited to paths within this simple E, PV, SOM circuit. This misses more extended paths (like thalamocortical loops) that involve interactions between multiple brain areas. Including those paths in the expansion in Eqs. 11-14 (Fig. 1C) may be an important consideration.

      We agree that our pathway expansion can be used to study more than just the E – PV – SOM circuit. However, properly investigating full thalamocortcial loops should be done in a subsequent study.

      Comments on revisions:

      I think the authors have done a reasonable job of responding to my critiques, and the paper is in pretty good shape. (Also, thanks for correctly inferring that I meant VIP interneurons when I had written SST in my review! I have updated the public review accordingly.)

      I still think this line of research would benefit substantially from considering dynamic regimes including chaotic ones. I strongly encourage the authors to consider such an extension in future work.

      Please see our response above to Reviewer 1.

      Reviewer #3 (Public Review):

      Summary:

      Bos et al study a computational model of cortical circuits with excitatory (E) and two subtypes of inhibition parvalbumin (PV) and somatostatin (SOM) expressing interneurons. They perform stability and gain analysis of simplified models with nonlinear transfer functions when SOM neurons are perturbed. Their analysis suggests that in a specific setup of connectivity, instability and gain can be untangled, such that SOM modulation leads to both increases in stability and gain, in contrast to the typical direction in neuronal networks where increased gain results in decreased stability.

      Strengths:

      - Analysis of the canonical circuit in response to SOM perturbations. Through numerical simulations and mathematical analysis, the authors have provided a rather comprehensive picture of how SOM modulation may affect response changes.

      - Shedding light on two opposing circuit motifs involved in the canonical E-PV-SOM circuitry - namely, direct inhibition (SOM -¿ E) vs disinhibition (SOM -¿ PV -¿ E). These two pathways can lead to opposing effects, and it is often difficult to predict which one results from modulating SOM neurons. In simplified circuits, the authors show how these two motifs can emerge and depend on parameters like connection weights.

      - Suggesting potentially interesting consequences for cortical computation. The authors suggest that certain regimes of connectivity may lead to untangling of stability and gain, such that increases in network gain are not compromised by decreasing stability. They also link SOM modulation in different connectivity regimes to versatile computations in visual processing in simple models.

      We thank the reviewer for their support of our work.

      Weaknesses

      Computationally, the analysis is solid, but it’s very similar to previous studies (del Molino et al, 2017). Many studies in the past few years have done the perturbation analysis of a similar circuitry with or without nonlinear transfer functions (some of them listed in the references). This study applies the same framework to SOM perturbations, which is a useful computational analysis, in view of the complexity of the high-dimensional parameter space.

      Link to biology: the most interesting result of the paper with regard to biology is the suggestion of a regime in which gain and stability can be modulated in an unconventional way - however, it is difficult to link the results to biological networks:

      - A general weakness of the paper is a lack of direct comparison to biological parameters or experiments. How different experiments can be reconciled by the results obtained here, and what new circuit mechanisms can be revealed? In its current form, the paper reads as a general suggestion that different combinations of gain modulation and stability can be achieved in a circuit model equipped with many parameters (12 parameters). This is potentially interesting but not surprising, given the high dimensional space of possible dynamical properties. A more interesting result would have been to relate this to biology, by providing reasoning why it might be relevant to certain circuits (and not others), or to provide some predictions or postdictions, which are currently missing in the manuscript.

      - For instance, a nice motivation for the paper at the beginning of the Results section is the different results of SOM modulation in different experiments - especially between L23 (inhibition) and L4 (disinhibition). But no further explanation is provided for why such a difference should exist, in view of their results and the insights obtained from their suggested circuit mechanisms. How the parameters identified for the two regimes correspond to different properties of different layers?

      Please see our answer to the previous round of revision.

      - One of the key assumptions of the model is nonlinear transfer functions for all neuron types. In terms of modelling and computational analysis, a thorough analysis of how and when this is necessary is missing (an analysis similar to what has been attempted in Figure 6 for synaptic weights, but for cellular gains). A discussion of this, along with the former analysis to know which nonlinearities would be necessary for the results, is needed, but currently missing from the study. The nonlinearity is assumed for all subtypes because it seems to be needed to obtain the results, but it’s not clear how the model would behave in the presence or absence of them, and whether they are relevant to biological networks with inhibitory transfer functions.

      Please see our answer to the previous round of revision.

      - Tuning curves are simulated for an individual orientation (same for all), not considering the heterogeneity of neuronal networks with multiple orientation selectivity (and other visual features) - making the model too simplistic.

      Please see our answer to the previous round of revision.

      Reviewer #1 (Recommendations For The Authors):

      Introduction, first paragraph, last sentence: suggest ”sense,” -¿ ”sense” (no comma)

      Introduction, second paragraph, first sentence: suggest ”is been” -¿ ”has been”

      Introduction, very end of next to last paragraph: clarify ”modulate the circuit”

      Figure 1 legend: can you make the ”Change ...” in the legend for 1D clearer - e.g. ”strenghen SOM → E connections and eliminate SOM → P connections”.

      Paragraph immediately below Figure 1: In sentence starting ”Specifically ...” can you relate the cases described here back to the equation in Figure 1C?

      Sentence right below equation 2: This sentence does not separate the network gain from the cellular gain as clearly as it could.

      Page 7, second full paragraph: sentence starting ”Therefore, with ...” could be split into two or otherwise made clearer.

      Sentence starting ”Furthermore” right below Figure 5 has an extra comma

      We thank the reviewer for their additional comments, we made the respective changes in the manuscript.

      Reviewer #3 (Recommendations For The Authors):

      There is a long part in the reply letter discussing the link to biology - but the revised manuscript doesn’t seem to reflect that.

      The information in the reply letter discussing the link to biology has been added at multiple points in the discussion. In the section ‘decision of labor between PV and SOM neurons’ we mention Ferguson and Carding 2020, in the section ‘impact of SOM neuron modulation on tuning curves’ we discuss Phillups and Hasenstaub 2016, and in the section ‘limitations and future directions’ we mention Tobin et al., 2023.

      The writing can be improved - for example, see below instances:

      P. 7: Intuitively, the inverse relationship follows for inhibitory and disinhibitory pathways (and their mixture) because the firing rate grid (heatmap) does not depend on how the SOM neurons inhibit the E - PV circuit.

      P.8: We first remark that by adding feedback E connections onto SOM neurons, changes in SOM rates can now affect the underlying heatmaps in the (rE, rP) grid.

      Not clear how ”rates can affect the heatmaps”. It’s too colloquial and not scientifically rigorous or sound.

      We added further explanations at the respective places in the manuscript to improve the writing.

    1. Author Response

      The following is the authors’ response to the previous reviews.

      We appreciate the constructive comments made by the editor and the reviewers. We have corrected errors and provided additional experimental data and analysis to address the latest criticisms raised by the reviewers and provided point-by-point response to the reviewers as below.

      Reviewer #1 (Recommendations For The Authors):

      I do acknowledge the work the authors put into this manuscript and I can accept the fact that the authors decided on a minimum of additional experiments. However, I would recommend the authors to be more concise by adding more information in the method and result sections about how they performed their experiments such as which Nav and AMPAR DNA constructs they used, the age of the mice, how long time they exposed the patches to quinidine, information on how many times they repeated their pull downs etc.

      Answer: We thank the reviewer’s comments. we have incorporated the suggested modifications into our revised manuscript. Specifically, we have included detailed information on the NaV and AMPAR constructs in the Methods section. The age of the homozygous NaV1.6 knockout mice and the wild-type littermate controls is postnatal (P0-P1) (see in Results and Methods section). Prior to the application of step pulses, cells were subjected to the bath solution containing quinidine for approximately one minute (see in Methods section). Additionally, the co-immunoprecipitation assays for Slack and NaV1.6 were repeated three times (see in Methods section).

      Minor detail in line 263: "...KCNT1 (Slack) have been identified to related to seizure..." I guess this should have been "...KCNT1 (Slack) have been identified and related to seizure..."?

      Answer: We thank the reviewer for raising this point. We have corrected it in the revised manuscript.

      Also, and again minor detail, I had a comment about the color coding in Fig 4 and by mistake, I added 4B, but I meant the use of colors in the entire figure, and mainly the use of colors in 4C, G and I.

      Answer: We apologize for the confusion. We have changed the color coding of Figure 4 in the revised manuscript.

      Reviewer #2 (Recommendations For The Authors):

      While the paper is improved, several concerns do not seem to have been addressed. Some may have been missed because there is no response at all, but others may have been unclear because the response does not address the concern, but a related issue. Details are below.

      Answer: We thank the reviewer for the criticisms. We have made changes of our manuscript to address the concerns.

      Original issue:

      3) Remove the term in vivo.

      Answer: We thank the reviewer for raising this point. In our experiments, although we did not conduct experiments directly in living organisms, our results demonstrated the coimmunoprecipitation of NaV1.6 with Slack in homogenates from mouse cortical and hippocampal tissues (Fig. 3C). This result may support that the interaction between Slack and NaV1.6 occurs in vivo.

      New comment from reviewer:

      The argument to use the term in vivo is not well supported by what the authors have said. Just because tissues are used from an animal does not mean experiments were conducted in vivo. As the authors say, they did not conduct experiments in living organisms. Therefore the term in vivo should be avoided. This is a minor point.

      Answer: We thank the reviewer for pointing this out. We have removed the term “in vivo” in the revised manuscript.

      Original:

      4) Figure 1C Why does Nav1.2 have a small inward current before the large inward current in the inset?

      Answer: We apologize for the confusion. We would like to clarify that the small inward current can be attributed to the current of membrane capacitance (slow capacitance or C-slow). The larger inward current is mediated by NaV1.2.

      New comment:

      This is not well argued. Please note why the authors know the current is due to capacitance. Also, how do they know the larger current is due to NaV1.2? Please add that to the paper so readers know too.

      Answer: We thank the reviewer’s comment. To provide a clearer representation of NaV1.2mediated currents in Fig. 1C, we have replaced the original example trace with a new one in which only one inward current is observed.

      Original:

      The slope of the rising phase of the larger sodium current seems greater than Nav1.6 or Nav1.5. Was this examined?

      Answer: Additionally, we did not compare the slope of the rising phase of NaV subtypes sodium currents but primarily focused on the current amplitudes.

      New comment:

      This is not a strong answer. There seems to be an effect that the authors do not mention and evidently did not quantify that argues against their conclusion, which weakens the presentation.

      Answer: We thank the reviewer’s comment. To assess the slope of the rising phase of NaV subtype currents, we compared the activation time constants of NaV1.2, NaV1.5, and NaV1.6 peak currents in HEK293 cells co-expressing NaV channel subtypes with Slack. The results have shown no significant differences (Author response image 1). We have included this analysis (see Fig. S9A) and the corresponding fitting equation (see in Methods section) in the revised manuscript.

      Author response image 1.

      The activation time constants of peak sodium currents in HEK293 cells co-expressing NaV1.2 (n=6), NaV1.5 (n=5), and NaV1.6 (n=5) with Slack, respectively. ns, p > 0.05, one-way ANOVA followed by Bonferroni’s post hoc test.

      Original:

      2D-E For Nav1.5 the sodium current is very large compared to Nav1.6. Is it possible the greater effect of quinidine for Nav1.6 is due to the lesser sodium current of Nav1.6?

      Answer: We thank the reviewer for raising this point. We would like to clarify that our results indicate that transient sodium currents contribute to the sensitization of Slack to quinidine blockade (Fig. 2C,E). Therefore, it is unlikely that the greater effect observed for NaV1.6 in sensitizing Slack is due to its lower sodium currents.

      New comment:

      I am not sure the question I was asking was clear. How can the authors discount the possibility that quinidine is more effective on NaV1.6 because the NaV1.6 current is relatively weak?

      Answer: We thank the reviewer for raising this point. We have examined the sodium current amplitudes of NaV1.5, NaV1.5/1.6 chimeras, and NaV1.6 upon co-expression of NaV with Slack. Our analysis revealed that there are no significant differences between NaV1.5 and NaV1.5/6N, with both exhibiting much larger current amplitudes compared to NaV1.6 (Author response image 2), but only NaV1.5/6N replicates the effect of NaV1.6 in sensitizing Slack to quinidine blockade (Fig. 4H-I), suggesting the observed differences between NaV1.5 and NaV1.6 in sensitizing Slack are unlikely to be attributed to NaV1.6's lower sodium currents but may instead involve NaV1.6's Nterminus-induced physical interaction. We have included this analysis in the revised manuscript (see Fig. S9B).

      Author response image 2.

      Comparison of peak sodium current amplitudes of NaV1.5 (n=9), NaV1.5/6NC (n=13), NaV1.5/6N (n=10), and NaV1.6 (n=8) upon co-expressed with Slack in HEK293 cells. ns, p > 0.05, * p < 0.05, ** p < 0.01, *** p < 0.001, **** p < 0.0001; one-way ANOVA followed by Bonferroni’s post hoc test.

      Original:

      The differences between WT and KO in G -H are hard to appreciate. Could quantification be shown? The text uses words like "block" but this is not clear from the figure. It seems that the replacement of Na+ with Li+ did not block the outward current or effect of quinidine.

      Answer: We apologize for the confusion. We would like to clarify the methods used in this experiment. The lithium ion (Li+) is a much weaker activator of sodium-activated potassium channel Slack than sodium ion (Na+)1,2.

      1. Zhang Z, Rosenhouse-Dantsker A, Tang QY, Noskov S, Logothetis DE. The RCK2 domain uses a coordination site present in Kir channels to confer sodium sensitivity to Slo2.2 channels. J Neurosci. Jun 2 2010;30(22):7554-62. doi:10.1523/JNEUROSCI.0525-10.2010

      2. Kaczmarek LK. Slack, Slick and Sodium-Activated Potassium Channels. ISRN Neurosci. Apr 18 2013;2013(2013)doi:10.1155/2013/354262 Therefore, we replaced Na+ with Li+ in the bath solution to measure the current amplitudes of sodium-activated potassium currents (IKNa)3.

      3. Budelli G, Hage TA, Wei A, et al. Na+-activated K+ channels express a large delayed outward current in neurons during normal physiology. Nat Neurosci. Jun 2009;12(6):745-50. doi:10.1038/nn.2313

      The following equation was used for quantification:

      Furthermore, the remaining IKNa after application of 3 μM quinidine in the bath solution was measured as the following:

      The quantification results were presented in Fig. 1K. The term "block" used in the text referred to the inhibitory effect of quinidine on IKNa.

      New comment:

      The fact remains that the term "block" is too strong for an effect that is incomplete. Also, the authors should add to the paper that Li+ is a weaker activator, so the reader knows some of the caveats to the approach.

      Answer: We thank the reviewer for raising this point. We have added related citations and replaced the term “block” with “inhibit” in the revised manuscript.

      Original:

      1. In K, for the WT, why is the effect of quinidine only striking for the largest currents?

      Answer: We thank the reviewer for raising this point. After conducting an analysis, we found no correlation between the inhibitory effect of quinidine and the amplitudes of baseline IKNa in WT neurons (p = 0.6294) (Author response image 3). Therefore, the effect of quinidine is not solely limited to targeting the larger currents.

      Author response image 3.

      The correlation between the inhibitory effect of quinidine and the amplitudes of baseline IKNa in WT neurons (data from manuscript Fig. 1K). r = 0.1555, p=0.6294, Pearson correlation analysis.

      New comment:

      Please add this to the paper and the figure as Supplemental.

      Answer: We thank the reviewer for raising this point. We have added this figure as Fig.S3B in the revised manuscript.

      Original:

      5) Figure 2 A. The argument could be better made if the same concentration of quinidine were used for Slack and Slack + Nav1.6. It is recognized a greater sensitivity to quinidine is to be shown but as presented the figure is a bit confusing."

      Answer: We apologize for the confusion. We would like to clarify that the presented concentrations of quinidine were chosen to be near the IC50 values for Slack and Slack+NaV1.6.

      New comment:

      Please add this to the paper.

      Answer: We thank the reviewer for raising this point. We have added the clarification about the presented concentrations in the revised manuscript.

      Original:

      2C. Can the authors add the effect of quinidine to the condition where the prepulse potential was 90?"

      Answer: We apologize for the confusion. We would like to clarify that the condition of prepulse potential at -90 mV is the same as the condition in Fig. 1. We only changed one experiment condition where the prepulse potential was changed to -40 mV from -90 mV.

      New comment:

      There was no confusion. The authors should consider adding the condition where the prepulse potential was -90.

      Answer: We thank the reviewer for raising this point. We have added the clarification about the voltage condition in the revised manuscript (see in Fig. 2A caption).

      Original:

      2A. Clarify these 6 panels."

      Answer: We thank the reviewer for raising this point. We have clarified the captions of Fig. 3A in the revised manuscript.

      New comment: Clarification is needed. What is the blue? DAPI? What area of hippocamps? Please label cell layers. What area of cortex? Please label layers.

      Answer: We thank the reviewer for raising this point. We have included the clarification in the Figure caption.

      Original:

      Figure 7. The images need more clarity. They are very hard to see. Text is also hard to see."

      Answer: We apologize for the lack of clarity in the images and text. we would like to provide a concise summary of the key findings shown in this figure.

      Figure 7 illustrates an innovative intervention for treating SlackG269S-induced seizures in mice by disrupting the Slack-NaV1.6 interaction. Our results showed that blocking NaV1.6-mediated sodium influx significantly reduced Slack current amplitudes (Fig. 2D,G), suggesting that the Slack-NaV1.6 interaction contributes to the current amplitudes of epilepsy-related Slack mutant variants, aggravating the gain-of-function phenotype. Additionally, Slack’s C-terminus is involved in the Slack-NaV1.6 interaction (Fig. 5D). We assumed that overexpressing Slack’s C-terminus can disrupt the Slack-NaV1.6 interaction (compete with Slack) and thereby encounter the current amplitudes of epilepsy-related Slack mutant variants.

      In HEK293 cells, overexpression of Slack’s C-terminus indeed significantly reduced the current amplitudes of epilepsy-related SlackG288S and SlackR398Q upon co-expression with NaV1.5/6NC (Fig. 7A,B). Subsequently, we evaluated this intervention in an in vivo epilepsy model by introducing the Slack G269S variant into C57BL/6N mice using AAV injection, mimicking the human Slack mutation G288S that we previously identified (Fig. 7C-G).

      New comment:

      The images do not appear to have changed. Consider moving labels above the images so they can be distinguished better. Please label cell layers. Consider adding arrows to the point in the figure the authors want the reader to notice. The study design and timeline are unclear. What is (1) + (3), (2), etc.?

      Answer: We thank the reviewer for pointing this out. We have modified Figure 7 in the revised manuscript and included the cell layer information in the Figure caption.

      Original:

      It is not clear how data were obtained because injection of kainic acid does not lead to a convulsive seizure every 10 min for several hours, which is what appears to be shown. Individual seizures are just at the beginning and then they merge at the start of status epilepticus. After the onset of status epilepticus the animals twitch, have varied movements, sometime rear and fall, but there is not a return to normal behavior. Therefore one can not call them individual seizures. In some strains of mice, however, individual convulsive seizures do occur (even if the EEG shows status epilepticus is occurring) but there are rarely more than 5 over several hours and the graph has many more. Please explain."

      Answer: We apologize for the confusion. Regarding the data acquisition in relation to kainic acid injection, we initiated the timing following intraperitoneal injection of kainic acid and recorded the seizure scores of per mouse at ten-minute intervals, following the methodology described in previous studies4.

      1. Huang Z, Walker MC, Shah MM. Loss of dendritic HCN1 subunits enhances cortical excitability and epileptogenesis. J Neurosci. Sep 2 2009;29(35):10979-88. doi:10.1523/JNEUROSCI.1531-09.2009

      The seizure scores were determined using a modified Racine, Pinal, and Rovner scale5,6: (1) Facial movements; (2) head nodding; (3) forelimb clonus; (4) dorsal extension (rearing); (5) Loss of balance and falling; (6) Repeated rearing and failing; (7) Violent jumping and running; (8) Stage 7 with periods of tonus; (9) Dead.

      1. Pinel JP, Rovner LI. Electrode placement and kindling-induced experimental epilepsy. Exp Neurol. Jan 15 1978;58(2):335-46. doi:10.1016/0014-4886(78)90145-0

      2. Racine RJ. Modification of seizure activity by electrical stimulation. II. Motor seizure. Electroencephalogr Clin Neurophysiol. Mar 1972;32(3):281-94. doi:10.1016/00134694(72)90177-0

      New comment:

      This was clear. Perhaps my question was not clear. The question is how one can count individual seizures if animals have continuous seizures. It seems like the authors did not consider or observe status epilepticus but individual seizures. If that is true the data are hard to believe because too many seizures were counted. Animals do not have nearly this many seizures after kainic acid.

      Answer: We appreciate the reviewer’s clarification. Our methodology involved assessing the maximum seizure scale during 10-minute intervals per mouse as previously described7, rather than counting individual seizures. For instance, a mouse exhibited the loss of balance and falling multiple times within 30-40 minute interval, we recorded the seizure scale as 5 for that time interval.

      1. Kim EC, Zhang J, Tang AY, et al. Spontaneous seizure and memory loss in mice expressing an epileptic encephalopathy variant in the calmodulin-binding domain of Kv7.2. Proc Natl Acad Sci U S A. Dec 21 2021;118(51)doi:10.1073/pnas.2021265118

      Reviewer #3 (Recommendations For The Authors):

      While the authors have improved the manuscript, several outstanding issues still need to be addressed. Some may have been missed because there is no response at all, but others may have been unclear.

      Answer: We thank the reviewer for the criticisms. We have added additional experimental data and analysis to address the concerns.

      Original issue from Public Review:

      1. Immunolabeling of the hippocampus CA1 suggests sodium channels as well as Slack colocalization with AnkG (Fig 3A). Proximity ligation assay for NaV1.6 and Slack or a super-resolution microscopy approach would be needed to increase confidence in the presented colocalization results. Furthermore, coimmunoprecipitation studies on the membrane fraction would bolster the functional relevance of NaV1.6-Slack interaction on the cell surface.

      Answer: We thank the reviewer for good suggestions. We acknowledge that employing proximity ligation assay and high-resolution techniques would significantly enhance our understanding of the localization of the Slack-NaV1.6 coupling.

      At present, the technical capabilities available in our laboratory and institution do not support highresolution testing. However, we are enthusiastic about exploring potential collaborations to address these questions in the future. Furthermore, we fully recognize the importance of conducting coimmunoprecipitation (Co-IP) assays from membrane fractions. While we have already completed Co-IP assays for total protein and quantified the FRET efficiency values between Slack and NaV1.6 in the membrane region, the Co-IP assays on membrane fractions will be conducted in our future investigations.

      New comment from reviewer: so far, the authors have not demonstrated that Nav1.6 and Slack interact on the cell surface.

      Answer: We thank the reviewer for pointing this out. We acknowledgement that our data did not directly demonstrate interaction between NaV1.6 and Slack on the cell surface and we have removed related terminology in the revised manuscript. Notably, our patch-clamp experiments in Fig. 2D,G and Fig. S10B showed a Na+-mediated membrane current coupling of Slack and NaV1.6. Additionally, the FRET efficiency values between Slack and NaV1.6 were quantified in the membrane region. These findings suggest that membrane-near Slack interacts with NaV1.6.

      1. Although hippocampal slices from Scn8a+/- were used for studies in Fig. S8, it is not clear whether Scn8a-/- or Scn8a+/- tissue was used in other studies (Fig 1J & 1K). It will be important to clarify whether genetic manipulation of NaV1.6 expression (Fig. 1K) has an impact on sodiumactivated potassium current, level of surface Slack expression, or that of NaV1.6 near Slack.

      Answer: We thank the reviewer for pointing this out. In Fig. 1G,J,K, primary cortical neurons from homozygous NaV1.6 knockout (Scn8a-/-) mice were used. We will clarify this information in the revised manuscript. In terms of the effects of genetic manipulation of NaV1.6 expression on IKNa and surface Slack expression, we compared the amplitudes of IKNa measured from homozygous NaV1.6 knockout (NaV1.6-KO) neurons and wild-type (WT) neurons. The results showed that homozygous knockout of NaV1.6 does not alter the amplitudes of IKNa (Author response image 4). The level of surface Slack expression will be tested further.

      Author response image 4.

      The amplitudes of IKNa in WT and NaV1.6-KO neurons (data from manuscript Fig. 1K). ns, p > 0.05, unpaired two-tailed Student’s t test.

      New comment from reviewer: The current version of the manuscrip>t does not contain these pertinent details and needs to be updated to include the information pertaining homozygous NaV1.6 knockouts. What age were these homozygous NaV1.6 knockout mice? These details need to be clearly stated in the manuscript.

      Answer: We thank the reviewer for pointing this out. We have included this analysis in the revised manuscript (see Fig. S3A). The age of homozygous NaV1.6 knockout mice are P0-P1 and we have added this detail in the revised manuscript.

      1. Did the epilepsy-related Slack mutations have an impact on NaV1.6-mediated sodium current?

      Answer: We thank the reviewer’s question. We examined the amplitudes of NaV1.6 sodium current upon expression alone or co-expression of NaV1.6 with epilepsy-related Slack mutations (K629N, R950Q, K985N). The results showed that the tested epilepsy-related Slack mutations do not alter the amplitudes of NaV1.6 sodium current (Author response image 5).

      Author response image 5.

      The amplitudes of NaV1.6 sodium currents upon co-expression of NaV1.6 with epilepsy-related Slack mutant variants (SlackK629N, SlackR950Q, and SlackK985N). ns, p>0.05, oneway ANOVA followed by Bonferroni’s post hoc test.

      New comment from reviewer: Figure with the functional effect of co-expression of NaV1.6 with epilepsy-related Slack mutations should be included in the revised manuscript

      Answer: We thank the reviewer for pointing this out. We have included this analysis in the revised manuscript (see Fig. S10A).

      Original issue from Recommendations For The Authors:

      1. A reference to homozygous knockout is made in the abstract; however, only heterozygous mice are mentioned in the methods section. The genotype of the mice needs to be made clear in the manuscript. Furthermore, at what age were these mice used in the study. Since homozygous knockout of NaV1.6 is lethal at a very young age (<4 wks), it would be important to clarify that point as well.

      Answer: We thank the reviewer for pointing this out. In the revised manuscript, we have included information about the source of the primary cortical neurons used in our study. These neurons were obtained from postnatal homozygous NaV1.6 knockout C3HeB/FeJ mice and their wild-type littermate controls.

      New comment from reviewer: The answer that postnatal homozygous NaV1.6 knockout C3HeB/FeJ mice were used is insufficient. What age were these mice? This needs to be clearly stated in the manuscript.

      Answer: We thank the reviewer for pointing this out. The postnatal homozygous NaV1.6 knockout C3HeB/FeJ mice and their wild-type littermate controls are in P0-P1. We have included this information in the revised manuscript.

      1. How long were the cells exposed to quinidine before the functional measurement were performed?

      Answer: We thank the reviewer for pointing this out. The cells were exposed to the bath solution with quinidine for about one minute before applying step pulses.

      New comment from reviewer: This needs to be clearly stated in the manuscript.

      Answer: We thank the reviewer for pointing this out. We have included this information in the revised manuscript (see in Methods section).

      1. In Fig. 6B-D, it is not clear to what extent co-expression of Slack mutants and NaV1.6 increases sodium-activated potassium current.

      Answer: We thank the reviewer for pointing this out. We notice that the current amplitudes of Slack mutants exhibit a considerable degree of variation, ranging from less than 1 nA to over 20 nA (n =58). To accurately measure the effects of NaV1.6 on increasing current amplitudes of Slack mutants, we plan to apply tetrodotoxin in the bath solution to block NaV1.6 sodium currents upon coexpression of Slack mutants with NaV1.6.

      New comment from reviewer: Were these experiments with TTX completed? If so, they should be added to the revised manuscript.

      Answer: We thank the reviewer for pointing this out. We compared the current amplitudes of epilepsy-related Slack mutant (SlackR950Q) before and after bath-application of 100 nM TTX upon co-expression with NaV1.6 in HEK293 cells. The results showed that bath-application of TTX significantly reduced the current amplitudes of SlackR950Q at +100 mV by nearly 40% (Author response image 6), suggesting NaV1.6 contributes to the current amplitudes of SlackR950Q. We have included this data in the revised manuscript (see Fig. S10B).

      Author response image 6.

      The current amplitudes of SlackR950Q before and after bath-application of 100 nM TTX upon co-expression with NaV1.6 in HEK293 cells (n=5). ***p < 0.001, Two-way repeated measures ANOVA followed by Bonferroni’s post hoc test.

      Additionally, we have corrected some errors in the methods and figure captions section:

      1. Line 513, bath solution “5 glucose” should be “10 glucose.”

      2. Figure 3A caption, the description “hippocampus CA1 (left) and neocortex (right)” was flipped and we have corrected it.

      References

      1. Zhang Z, Rosenhouse-Dantsker A, Tang QY, Noskov S, Logothetis DE. The RCK2 domain uses a coordination site present in Kir channels to confer sodium sensitivity to Slo2.2 channels. J Neurosci. Jun 2 2010;30(22):7554-62. doi:10.1523/JNEUROSCI.0525-10.2010

      2. Kaczmarek LK. Slack, Slick and Sodium-Activated Potassium Channels. ISRN Neurosci. Apr 18 2013;2013(2013)doi:10.1155/2013/354262

      3. Budelli G, Hage TA, Wei A, et al. Na+-activated K+ channels express a large delayed outward current in neurons during normal physiology. Nat Neurosci. Jun 2009;12(6):745-50. doi:10.1038/nn.2313

      4. Huang Z, Walker MC, Shah MM. Loss of dendritic HCN1 subunits enhances cortical excitability and epileptogenesis. J Neurosci. Sep 2 2009;29(35):10979-88. doi:10.1523/JNEUROSCI.1531-09.2009

      5. Pinel JP, Rovner LI. Electrode placement and kindling-induced experimental epilepsy. Exp Neurol. Jan 15 1978;58(2):335-46. doi:10.1016/0014-4886(78)90145-0

      6. Racine RJ. Modification of seizure activity by electrical stimulation. II. Motor seizure. Electroencephalogr Clin Neurophysiol. Mar 1972;32(3):281-94. doi:10.1016/0013-4694(72)90177-0

      7. Kim EC, Zhang J, Tang AY, et al. Spontaneous seizure and memory loss in mice expressing an epileptic encephalopathy variant in the calmodulin-binding domain of Kv7.2. Proc Natl Acad Sci U S A. Dec 21 2021;118(51)doi:10.1073/pnas.2021265118

    1. Author response:

      The following is the authors’ response to the previous reviews.

      eLife assessment

      In this important study, the authors report a novel measurement of the Escherichia coli chemotactic response and demonstrate that these bacteria display an attractant response to potassium, which is connected to intracellular pH level. Whilst the experiments are mostly convincing, there are some confounders regards pH changes and fluorescent proteins that remain to be addressed.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This paper shows that E. coli exhibits a chemotactic response to potassium by measuring both the motor response (using a bead assay) and the intracellular signaling response (CheY phosporylation level via FRET) to step changes in potassium concentration. They find increase in potassium concentration induces a considerable attractant response, with amplitude comparable to aspartate, and cells can quickly adapt (and generally over-adapt). The authors propose that the mechanism for potassium response is through modifying intracellular pH; they find both that potassium modifies pH and other pH modifiers induce similar attractant responses. It is also shown, using Tar- and Tsr-only mutants, that these two chemoreceptors respond to potassium differently. Tsr has a standard attractant response, while Tar has a biphasic response (repellent-like then attractant-like). Finally, the authors use computer simulations to study the swimming response of cells to a periodic potassium signal secreted from a biofilm and find a phase delay that depends on the period of oscillation.

      Strengths:

      The finding that E. coli can sense and adapt to potassium signals and the connection to intracellular pH is quite interesting and this work should stimulate future experimental and theoretical studies regarding the microscopic mechanisms governing this response. The evidence (from both the bead assay and FRET) that potassium induces an attractant response is convincing, as is the proposed mechanism involving modification of intracellular pH. The updated manuscript controls for the impact of pH on the fluorescent protein brightness that can bias the measured FRET signal. After correction the response amplitude and sharpness (hill coefficient) are comparable to conventional chemoattractants (e.g. aspartate), indicating the general mechanisms underlying the response may be similar. The authors suggest that the biphasic response of Tar mutants may be due to pH influencing the activity of other enzymes (CheA, CheR or CheB), which will be an interesting direction for future study.

      Weaknesses:

      The measured response may be biased by adaptation, especially for weak potassium signals. For other attractant stimuli, the response typically shows a low plateau before it recovers (adapts). In the case of potassium, the FRET signal does not have an obvious plateau following the stimuli of small potassium concentrations, perhaps due to the faster adaptation compared to other chemoattractants. It is possible cells have already partially adapted when the response reaches its minimum, so the measured response may be a slight underestimate of the true response. Mutants without adaptation enzymes appear to be sensitive to potassium only at much larger concentrations, where the pH significantly disrupts the FRET signal; more accurate measurements would require development of new mutants and/or measurement techniques.

      We acknowledge and appreciate the reviewer's concerns regarding the potential impact of adaptation on the measured response magnitude. We have estimated the effect of adaptation on the measured response magnitude. The half-time of adaptation at 30 mM KCl was measured to be approximately 80 s, corresponding to a time constant of t = 80/ln(2) = 115.4 s, which is significantly longer than the time required for medium exchange in the flow chamber (less than 10 s). Consequently, the relative effect of adaptation on the measured response magnitude should be less than 1-exp(-10/t) = 8.3%. Even for the fastest adaptation (at the lowest KCl concentration) we measured, the effect should be less than 20%, which is within experimental uncertainties. Nevertheless, we agree that developing new techniques to measure the dose-response curve more precisely would be beneficial.

      Reviewer #2 (Public Review):

      Zhang et al investigated the biophysical mechanism of potassium-mediated chemotactic behavior in E coli. Previously, it was reported by Humphries et al that the potassium waves from oscillating B subtilis biofilm attract P aeruginosa through chemotactic behavior of motile P aeruginosa cells. It was proposed that K+ waves alter PMF of P aeruginosa. However, the mechanism was this behaviour was not elusive. In this study, Zhang et al demonstrated that motile E coli cells accumulate in regions of high potassium levels. They found that this behavior is likely resulting from the chemotaxis signalling pathway, mediated by an elevation of intracellular pH. Overall, a solid body of evidence is provided to support the claims. However, the impacts of pH on the fluorescence proteins need to be better evaluated. In its current form, the evidence is insufficient to say that the fluoresce intensity ratio results from FRET. It may well be an artefact of pH change.

      The authors now carefully evaluated the impact of pH on their FRET sensor by examining the YFP and CFP fluorescence with no-receptor mutant. The authors used this data to correct the impact of pH on their FRET sensor. This is an improvement, but the mathematical operation of this correction needs clarification. This is particularly important because, looking at the data, it is not fully convincing if the correction was done properly. For instance, 3mM KCl gives 0.98 FRET signal both in Fig3 and FigS4, but there is almost no difference between blue and red lines in Fig 3. FigS4 is very informative, but it does not address the concern raised by both reviewers that FRET reporter may not be a reliable tool here due to pH change.

      We apologize for not making the correction process clear. We corrected the impact of pH on the original signals for both CFP and YFP channels by

      where and represent the pH-corrected and original PMT signal (CFP or YFP channel) from the moment of addition of L mM KCl to the moment of its removal, respectively, and  is the correction factor, which is the ratio of PMT signal post- to pre-KCl addition for the no-receptor mutant at L mM KCl, for CFP or YFP channel as shown Fig. S5. The pH-corrected FRET response is then calculated as the ratio of the pH-corrected YFP to the pH-corrected CFP signals, normalized by the pre-stimulus ratio.

      As shown in Author response image1, which represents the same data as Fig. 3A and Fig. S5A, the original normalized FRET responses to 3 mM KCl are 0.967 for the wild-type strain (Fig. 3) and 0.981 for the no-receptor strain (Fig. S5). The standard deviation of the FRET values under steady-state conditions is 0.003. Thus, the difference in responses between the wild-type and no-receptor strains is significant and clearly exceeds the standard deviation. The pH correction factors CpH at 3 mM KCl are 1.004 for the YFP signal and 1.016 for the CFP signal. Consequently, the pH-corrected FRET responses are 0.967´1.016/1.004=0.979 for the wild-type and 0.981´1.016/1.004=0.993 for the no-receptor strain. The reason the pH-corrected FRET response for the no-receptor strain is 0.993 instead of the expected 1.000 is that this value represents the lowest observed response rather than the average value for the FRET response.

      The detailed mathematical operation for correcting the pH impact has now been included in the “FRET assay” section of Materials and Methods.

      Author response image 1.

      Chemotactic response of the wild-type strain (A, HCB1288-pVS88) and the no-receptor strain (B, HCB1414-pVS88) to stepwise addition and removal of KCl. The blue solid line denotes the original normalized signal. Downward and upward arrows indicate the time points of addition and removal of 3 mM KCl, respectively. The horizontal red dashed line denotes the original normalized FRET response value to 3 mM KCl.

      The authors show the FRET data with both KCl and K2SO4, concluding that the chemotactic response mainly resulted from potassium ions. However, this was only measured by FRET. It would be more convincing if the motility assay in Fig1 is also performed with K2SO4. The authors did not address this point. In light of complications associated with the use of the FRET sensor, this experiment is more important.

      We thank the reviewer for the suggestion. We agree that additional confirmation with a motility assay is important. To address this, we have now measured the response of the motor rotational signal to 15 mM K2SO4 using the bead assay and compared it with the response to 30 mM KCl. The results are shown in Fig. S2. The response of motor CW bias to 15 mM K2SO4 exhibited an attractant response, characterized by a decreased CW bias upon the addition of K2SO4, followed by an over-adaptation that is qualitatively similar to the response to 30 mM KCl. However, there were notable differences in the adaptation time and the presence of an overshoot. Specifically, the adaptation time to K2SO4 was shorter compared to that for KCl, and there was a notable overshoot in the CW bias during the adaptation phase. These differences may have resulted from the weaker response to K2SO4 (Fig. S1B) and additional modifications due to CysZ-mediated cellular uptake of sulfate (Zhang et al., Biochimica et Biophysica Acta 1838,1809–1816 (2014)). The faster adaptation and overshoot complicated the chemotactic drift in the microfluidic assay as in Fig. 1, such that we were unable to observe a noticeable drift in a K2SO4 gradient under the same experimental conditions used for the KCl gradient.

      The response of motor rotational signal to 15 mM K2SO4 has been added to Fig. S2.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) The response curve and adaptation level/time in the main text (Fig. 4) should be replaced by the corrected counterparts (currently in Fig. S5). The current version is especially confusing because Fig. 6 shows the corrected response, but the difference from Fig. 4 is not mentioned.

      We thank the reviewer for the suggestion. We have now merged the results of the original Fig. S5 into Fig. 4.

      a. The discussion of the uncorrected response with small hill coefficient and potentially negative cooperativity was left in the text (lines 223-234), but the new measurements show this is not true for the actual response. This should be removed or significantly rephrased.

      We thank the reviewer for the suggestion. We have now removed the statement about potentially negative cooperativity and added the corrected results for the actual response.

      (2) It may be helpful to restate the definition of f_m in the methods (near Eq. 3-4).

      Thank you for the suggestion. We have now restated the definition of fm and fL below Eq. 3-4: “In the denominator on the right-hand side of Eq. 3, the two terms within the parentheses of exponential expression represent the methylation-dependent (fm) and ligand-dependent (fL) free energy, respectively.”

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #2:

      (1) The use of two m<sup>5</sup>C reader proteins is likely a reason for the high number of edits introduced by the DRAM-Seq method. Both ALYREF and YBX1 are ubiquitous proteins with multiple roles in RNA metabolism including splicing and mRNA export. It is reasonable to assume that both ALYREF and YBX1 bind to many mRNAs that do not contain m<sup>5</sup>C.

      To substantiate the author's claim that ALYREF or YBX1 binds m<sup>5</sup>C-modified RNAs to an extent that would allow distinguishing its binding to non-modified RNAs from binding to m<sup>5</sup>C-modified RNAs, it would be recommended to provide data on the affinity of these, supposedly proven, m<sup>5</sup>C readers to non-modified versus m<sup>5</sup>C-modified RNAs. To do so, this reviewer suggests performing experiments as described in Slama et al., 2020 (doi: 10.1016/j.ymeth.2018.10.020). However, using dot blots like in so many published studies to show modification of a specific antibody or protein binding, is insufficient as an argument because no antibody, nor protein, encounters nanograms to micrograms of a specific RNA identity in a cell. This issue remains a major caveat in all studies using so-called RNA modification reader proteins as bait for detecting RNA modifications in epitranscriptomics research. It becomes a pertinent problem if used as a platform for base editing similar to the work presented in this manuscript.

      The authors have tried to address the point made by this reviewer. However, rather than performing an experiment with recombinant ALYREF-fusions and m<sup>5</sup>C-modified to unmodified RNA oligos for testing the enrichment factor of ALYREF in vitro, the authors resorted to citing two manuscripts. One manuscript is cited by everybody when it comes to ALYREF as m<sup>5</sup>C reader, however none of the experiments have been repeated by another laboratory. The other manuscript is reporting on YBX1 binding to m<sup>5</sup>C-containing RNA and mentions PAR-CLiP experiments with ALYREF, the details of which are nowhere to be found in doi: 10.1038/s41556-019-0361-y.<br /> Furthermore, the authors have added RNA pull-down assays that should substitute for the requested experiments. Interestingly, Figure S1E shows that ALYREF binds equally well to unmodified and m<sup>5</sup>C-modified RNA oligos, which contradicts doi:10.1038/cr.2017.55, and supports the conclusion that wild-type ALYREF is not specific m<sup>5</sup>C binder. The necessity of including always an overexpression of ALYREF-mut in parallel DRAM experiments, makes the developed method better controlled but not easy to handle (expression differences of the plasmid-driven proteins etc.)

      Thank you for pointing this out. First, we would like to correct our previous response: the binding ability of ALYREF to m<sup>5</sup>C-modified RNA was initially reported in doi: 10.1038/cr.2017.55, (and not in doi: 10.1038/s41556-019-0361-y), where it was observed through PAR-CLIP analysis that the K171 mutation weakens its binding affinity to m<sup>5</sup>C -modified RNA.

      Our previous experimental approach was not optimal: the protein concentration in the INPUT group was too high, leading to overexposure in the experimental group. Additionally, we did not conduct a quantitative analysis of the results at that time. In response to your suggestion, we performed RNA pull-down experiments with YBX1 and ALYREF, rather than with the pan-DRAM protein, to better validate and reproduce the previously reported findings. Our quantitative analysis revealed that both ALYREF and YBX1 exhibit a stronger affinity for m<sup>5</sup>C -modified RNAs. Furthermore, mutating the key amino acids involved in m<sup>5</sup>C recognition significantly reduced the binding affinity of both readers. These results align with previous studies (doi: 10.1038/cr.2017.55 and doi: 10.1038/s41556-019-0361-y), confirming that ALYREF and YBX1 are specific readers of m<sup>5</sup>C -modified RNAs. However, our detection system has certain limitations. Despite mutating the critical amino acids, both readers retained a weak binding affinity for m<sup>5</sup>C, suggesting that while the mutation helps reduce false positives, it is still challenging to precisely map the distribution of m<sup>5</sup>C modifications. To address this, we plan to further investigate the protein structure and function to obtain a more accurate m<sup>5</sup>C sequencing of the transcriptome in future studies. Accordingly, we have updated our results and conclusions in lines 294-299 and discuss these limitations in lines 109-114.

      In addition, while the m<sup>5</sup>C assay can be performed using only the DRAM system alone, comparing it with the DRAM<sup>mut</sup>C control enhances the accuracy of m<sup>5</sup>C region detection. To minimize the variations in transfection efficiency across experimental groups, it is recommended to use the same batch of transfections. This approach not only ensures more consistent results but also improve the standardization of the DRAM assay, as discussed in the section added on line 308-312.

      (2) Using sodium arsenite treatment of cells as a means to change the m<sup>5</sup>C status of transcripts through the downregulation of the two major m<sup>5</sup>C writer proteins NSUN2 and NSUN6 is problematic and the conclusions from these experiments are not warranted. Sodium arsenite is a chemical that poisons every protein containing thiol groups. Not only do NSUN proteins contain cysteines but also the base editor fusion proteins. Arsenite will inactivate these proteins, hence the editing frequency will drop, as observed in the experiments shown in Figure 5, which the authors explain with fewer m<sup>5</sup>C sites to be detected by the fusion proteins.

      The authors have not addressed the point made by this reviewer. Instead the authors state that they have not addressed that possibility. They claim that they have revised the results section, but this reviewer can only see the point raised in the conclusions. An experiment would have been to purify base editors via the HA tag and then perform some kind of binding/editing assay in vitro before and after arsenite treatment of cells.

      We appreciate the reviewer’s insightful comment. We fully agree with the concern raised. In the original manuscript, our intention was to use sodium arsenite treatment to downregulate NSUN mediated m<sup>5</sup>C levels and subsequently decrease DRAM editing efficiency, with the aim of monitoring m<sup>5</sup>C dynamics through the DRAM system. However, as the reviewer pointed out, sodium arsenite may inactivate both NSUN proteins and the base editor fusion proteins, and any such inactivation would likely result in a reduced DRAM editing. This confounds the interpretation of our experimental data.

      As demonstrated in Appendix A, western blot analysis confirmed that sodium arsenite indeed decreased the expression of fusion proteins. In addition, we attempted in vitro fusion protein purification using multiple fusion tags (HIS, GST, HA, MBP) for DRAM fusion protein expression, but unfortunately, we were unable to obtain purified proteins. However, using the Promega TNT T7 Rapid Coupled In Vitro Transcription/Translation Kit, we successfully purified the DRAM protein (Appendix B). Despite this success, subsequent in vitro deamination experiments did not yield the expected mutation results (Appendix C), indicating that further optimization is required. This issue is further discussed in line 314-315.

      Taken together, the above evidence supports that the experiment of sodium arsenite treatment was confusing and we determined to remove the corresponding results from the main text of the revised manuscript.

      Author response image 1.

      (3) The authors should move high-confidence editing site data contained in Supplementary Tables 2 and 3 into one of the main Figures to substantiate what is discussed in Figure 4A. However, the data needs to be visualized in another way then excel format. Furthermore, Supplementary Table 2 does not contain a description of the columns, while Supplementary Table 3 contains a single row with letters and numbers.

      The authors have not addressed the point made by this reviewer. Figure 3F shows the screening process for DRAM-seq assays and principles for screening high-confidence genes rather than the data contained in Supplementary Tables 2 and 3 of the former version of this manuscript.

      Thank you for your valuable suggestion. We have visualized the data from Supplementary Tables 2 and 3 in Figure 4A as a circlize diagram (described in lines 213-216), illustrating the distribution of mutation sites detected by the DRAM system across each chromosome. Additionally, to improve the presentation and clarity of the data, we have revised Supplementary Tables 2 and 3 by adding column descriptions, merging the DRAM-ABE and DRAM-CBE sites, and including overlapping m<sup>5</sup>C genes from previous datasets.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Summary:

      Liver cancer shows a high incidence in males than females with incompletely understood causes. This study utilized a mouse model that lacks the bile acid feedback mechanisms (FXR/SHP DKO mice) to study how dysregulation of bile acid homeostasis and a high circulating bile acid may underlie the gender-dependent prevalence and prognosis of HCC. By transcriptomics analysis comparing male and female mice, unique sets of gene signatures were identified and correlated with HCC outcomes in human patients. The study showed that ovariectomy procedure increased HCC incidence in female FXR/SHP DKO mice that were otherwise resistant to agedependent HCC development, and that removing bile acids by blocking intestine bile acid absorption reduced HCC progression in FXR/SHP DKO mice. Based on these findings, the authors suggest that gender-dependent bile acid metabolism may play a role in the male-dominant HCC incidence, and that reducing bile acid level and signaling may be beneficial in HCC treatment. 

      strengths:

      (1) Chronic liver diseases often proceed the development of liver and bile duct cancer. Advanced chronic liver diseases are often associated with dysregulation of bile acid homeostasis and cholestasis. This study takes advantage of a unique FXR/SHP DKO model that develop high organ bile acid exposure and spontaneous age-dependent HCC development in males but not females to identify unique HCC-associated gene signatures. The study showed that the unique gene signature in female DKO mice that had lower HCC incidence also correlated with lower grade HCC and better survival in human HCC patients. 2. The study also suggests that differentially regulated bile acid signaling or gender-dependent response to altered bile acids may contribute to gender-dependent susceptibility to HCC development and/or progression. 3. The sex-dependent differences in bile acidmediated pathology clearly exist but are still not fully understood at the mechanistic level. Female mice have been shown to be more sensitive to bile acid toxicity in a few cholestasis models, while this study showed a male dominance of bile acid promotion of HCC. This study used ovariectomy to demonstrate that female hormones are possible underlying factors. Future studies are needed to understand the interaction of sex hormones, bile acids, and chronic liver diseases and cancer. 

      We thank Reviewer 1 for their positive and thorough assessment of our manuscript

      Weaknesses:

      (1) HCC shows heterogeneity, and it is unclear what tissues (tumor or normal) were used from the DKO mice and human HCC gene expression dataset to obtain the gene signature, and how the authors reconcile these gene signatures with HCC prognosis.

      Mice studies: Aged DKO mice develop aggressive tumors (major and minor nodules, See Figure 1), and the entire liver is burdened with multiple tumor nodules. It is technically challenging to demarcate the tumor boundaries as most of the surrounding tissues do not display normal tissue architecture. Therefore, livers from age- and sexmatched wild-type C57/BL6 mice were used as control tissue. All the mice were inbred in our facility. Spatial transcriptomics and longitudinal studies are ongoing to collect tumors at earlier time points wherein we can differentiate tumor and non-tumor tissue. 

      Human Studies: We mined five separate clinical data sets. The human HCC gene expression comprised of samples from the (i) National Cancer Institute (NCI) cohort (GEO accession numbers, GSE1898 and GSE4024) and (ii) Korea, (iii) Samsung, (iv) Modena, and (v) Fudan cohorts as previously described (GEO accession numbers, GSE14520, GSE16757, GSE43619, GSE36376, and GSE54236). We have added a new supplemental table 4, giving details of these datasets. Depending on the cohort, they are primarily HCC samples- surgical resections of HCC, control samples, with some tumors and paired non-tumor tissues.

      (2) The authors identified a unique set of gene expression signatures that are linked to HCC patient outcomes, but analysis of these gene sets to understand the causes of cancer promotion is still lacking. The studies of urea cycle metabolism and estrogen signaling were preliminary and inconclusive. These mechanistic aspects may be followed up in revision or future studies.

      We agree. Experiments to elicit HCC causality and promotion are complex, given the heterogeneous nature of liver cancer. Moreover, the length of time (12 months) needed to spontaneously develop cancer in this DKO mouse model makes it challenging. As mentioned by the reviewer, mechanistic studies are ongoing, and longitudinal time course experiments are actively being pursued to delineate causality. Having said that, we mined the TCGA LIHC (The Cancer Genome Atlas Liver Hepatocellular Carcinoma) database to examine the expression of the individual urea cycle genes and found them suppressed in liver tumorigenesis (new Supplementary Figure 4). We also evaluated if estrogen receptor  (Er) targets altered in DKO females (DKO_Estrogen) correlate with overall survival in HCC (new Supplementary Figure 6). We note that Er expression per se is reduced in males and females upon liver tumorigenesis. Also, DKO_Estrogen signature positively corroborated with better overall survival (new Supplementary Figure 6). These findings further bolster the relevance of urea cycle metabolism and estrogen signaling during HCC. 

      (3) While high levels of bile acids are convincingly shown to promote HCC progression, their role in HCC initiation is not established. The DKO model may be limited to conditions of extremely high levels of organ bile acid exposure. The DKO mice do not model the human population of HCC patients with various etiology and shared liver pathology (i.e. cirrhosis). Therefore, high circulating bile acids may not fully explain the male prevalence of HCC incidence.

      We agree with this comment that our studies do not show bile acids can initiate HCC and may act as one of the many factors that contribute to the high male prevalence of HCC. This is exactly the reason why throughout the manuscript we do not write about HCC initiation. To clarify further, in the revised discussion of the manuscript, we have added a sentence to highlight this aspect, “while this study demonstrates bile acids promote HCC progression it does not investigate or provide evidence if excess bile acids are sufficient for HCC initiation.”

      (4) The authors showed lower circulating bile acids and increased fecal bile acid excretion in female mice and hypothesized that this may be a mechanism underlying the lower bile acid exposure that contributed to lower HCC incidence in female DKO mice. Additional analysis of organ bile acids within the enterohepatic circulation may be performed because a more accurate interpretation of the circulating bile acids and fecal bile acids can be made in reference to organ bile acids and total bile acid pool changes in these mice.

      As shown in this manuscript- we provide BA compositional analyses from the liver, serum, urine, and feces (Figures 5 and 6, new Supplementary Figure 8, Supplementary Tables 4 and 5). Unfortunately, we did not collect the intestinal tissue or gallbladders for BA analysis in this study. Separate cohorts of mice are being aged for future BA analyses from different organs within the enterohepatic loop. We thank you for this suggestion. Nevertheless, we have previously measured and reported BA values to be elevated in the intestines and the gall bladder of young DKO mice (PMC3007143).

      Reviewer #2 (Public review):

      Weaknesses:

      (1) The translational value to human HCC is not so strong yet. Authors show that there is a correlation between the female-selective gene signature and low-grade tumors and better survival in HCC patients overall. However, these data do not show whether this signature is more highly correlated with female tumor burden and survival. In other words, whether the mechanisms of female protection may be similar between humans and mice. In that respect, it would also be good to elaborate on whether women have higher fecal BA excretion and lower serum BA concentration.

      The reviewer poses an interesting question to test if the DKO female-specific signatures are altered differently in male vs. female HCC samples. As we found the urea cycle and estrogen signaling to be protective and enriched in our mouse model, we tested their expression pattern using the TCGA-LIHC RNA-seq data. We found urea cycle genes and Er transcripts broadly reduced in tumor samples irrespective of the sex (new Supplementary Figure 4 and Supplementary Figure 6), indicating that these pathways are compromised upon tumorigenesis even in the female livers. 

      While prior studies have shown (i) a smaller BA pool w synthesis in men than women (PMID: 22003820), we did not find a study that systematically investigated BA excretion between the sexes in HCC context. The reviewer is spot on in suggesting BA analysis from HCC and unaffected human fecal samples from both sexes. Designing and performing such studies in the future will provide concrete proof of whether BA excretion protects female livers from developing liver cancer. We thank you for these suggestions.

      (2) The authors should perform a thorough spelling and grammar check.

      We apologize for the typos, which have been fixed, and as suggested by the reviewer, we have performed a grammar check.

      (3) There are quite some errors and inaccuracies in the result section, figures, and legends. The authors should correct this.

      We apologize for the inadvertent errors in the manuscript, and we have clarified these inaccuracies in the revised version. Thank you.

      Reviewer#1 (Recommendations for the authors).

      (1) Figures 1A-F, This statement of altered liver steatosis needs to be further supported by measurement of liver triglycerides. Lower magnification images of Sirius red stain should be shown for better evaluation of liver fibrosis.

      Unfortunately, we did not measure liver triglycerides and sirius red stained samples have faded, and lower magnification is unavailable at this juncture. We have modified our results accordingly.  

      We did not take the gross picture of WT female and DKO female livers in the same frame as shown below. Since the manuscript is focused on male and female differences in liver cancer incidence, we provided DKO male and female liver images as Figure 1D in the paper.

      Author response image 1.

      Gross liver images of a year-old WT and DKO mice which show prominent hepatocarcinogenesis in DKO male mice

      (2) Can the authors clarify if the gene transcriptomics was performed with normal or tumor tissues of DKO mice?

      Gene transcriptomics were performed with the tumor tissue of DKO mice. We have previously published data from younger non tumor bearing DKO male mice (PMCID: PMC3007143). 

      (3) Supplementary Figure 3C. Could the authors confirm if this is F vs M or just DKO female since it does not seem to match the result description in the main text? It is better practice to indicate the sub-panels of the Supplementary Figures in the main text while describing the results.

      As the reviewer correctly points out Supplementary Figure 3C is DKO F vs M signature not DKO_female signature and this has been clarified in the text. We have also included DKO_F data now to reduce the confusion.

      (4) Figure 3. Legend, the data presented are not well explained in the Legend, especially the labeling and what is being presented and compared.

      As suggested by the reviewer, we have modified the legend accordingly.

      (5) Supplementary Table 4 does not contain total serum bile acid as described in the main text.

      We agree with the reviewer. We provided primary and secondary BA concentrations, Supplementary Table 4 (currently Supplementary Table 5 in the revised version): Rows 20 and 21. but not their added total. We have modified the text accordingly.

      (6) Method section: many experiments lack descriptions of details.

      We have added details to the animal experimental design, ER ChIP-PCR, schematics of experiments are included within the main and supplemental figures, metabolomics and BA analysis have been expanded. 

      Reviewer #2 (Recommendations For The Authors):

      General:

      (1) The authors are advised to do a thorough grammar and spelling check.

      We have performed spelling and grammar check as suggested using an online platform Grammarly. Thank You.

      Results:

      (1) Figure 1 o The authors should show in Figure 1D female WT and female DKO liver.

      See Figure 1 added in our responses to point 1 of reviewer 1’s comment.

      In the Figure legend, (A-E) should be replaced by (A+D). 

      Thank you. We have modified it accordingly.

      The authors do not refer to 1J in the text, please add this reference.

      Thank you for pointing it. We have referenced 1J in the text.

      The description of 1H does not elaborate on the sex differences in ALT/AST levels, as this is the focus of the manuscript.

      We have added a sentence to show that the injury markers are higher in DKO males, which is consistent with an advanced disease. Thanks.

      The authors should use the correct nomenclature in Figure 1I/1J (gene vs protein and capitals vs non-capitals).

      The Figure 1I and 1J show gene expression of Fxr and Shp and hence we used the non-capital italicized nomenclature. Thanks.

      (2) Figure 2:

      The x-axis length is different in Figures 2A and 2B. Please correct to visualize the differences between males and females better.

      The x axis length has been fixed as suggested. Thanks

      (3) Figure 3:

      The authors should elaborate on how the patients were assigned to each gene signature. This is not fully clear.

      The gene set obtained from the WT and DKO mice were used. The process used is shown as a schematic in Supplemental Fig 2C and the gene list is included  in an excel sheet as Supplemental table 1. 

      We are curious how these data (F3A-C) would look when separating male and female human patients.

      We performed an overall survival analysis with a subgroup of patients and provide it. We segregated the HCC cohort data on sex and age (>55 yr, since we assumed 55 as an age for menopause) and evaluated the DKO gene signature. Similar to the original figure 3, we find that irrespective of sex, and age, DKO FvsM gene signature corresponds with better overall survival in men and in women. These findings align with the combined analysis in overall survival shown in original Figure 3 of the manuscript, and therefore we did not modify it. If deemed necessary, we are happy to include the figure below to reviewers in the main manuscript.

      Author response image 2.

      Correlation of gene signatures obtained from WT and DKO mouse model with the survival data of HCC patients segregated by age and sex. The Kaplan Meier Survival graphs were generated based on WT and DKO transcriptome changes using five HCC clinical cohorts. Analysis of OS (Overall Survival) in patients ((A) Men and (B) Women) using the gene signatures representative of either male WT or male DKO, female WT or female DKO, and unique changes observed in female DKO mice but not in male DKO mice.

      What was used as the control signature in Figure 3C? Please specify this.

      For Figure 3C we compared the DKO_M signature to that of DKOF vs M signature. These genes are listed as an Excel Sheet (Supplementary Table 1).

      The authors claim that DKO female mice display chronic cholestasis, similar to their male counterparts. Please refer to previous work or show the data.

      Serum BA levels are elevated in DKO females are reported in supplementary table 5 and we find comparable hepatic BA composition in Figure 5 F.

      (4) Figure 4: Labels for the x-axis are missing in Figure 4C. Please add legends or labels to the bars.

      The x axis label is included in the top Serum BAs in (M)

      In Figure 4I, the percentage of input is quite low. An IgG control would show whether recruitment of ERalpha to the shown loci is significant above background levels. Also, ChIP on the OVX liver could serve as a negative control.

      We did use IgG as control pull down and the signals above this background were considered. We have not performed this in OVX, which would be an excellent negative control for future studies. Thank You.

      The results and legends refer to ChIP-qPCR, while methods only mention ChIP-seq.Please adapt.

      We sincerely apologize for the mistake. We used published ChIP-seq to identify putative binding site and then performed ChIP PCR to validate it. We have clarified and rectified this error. Thank You.

      Significance indications in the figure legend do not correspond with significance indications in the figure. Please explain the used significance symbols in the figure in the legend.

      Thank You. The legends and their significance have been matched.

      (5) Figure 5:

      Authors claim lowered total serum BA in females compared to males, and reference to Supplementary Table 4. However, these data are not provided, only percentages and ratios are displayed.

      In the revised version, this has become Table 5. See response to the same concern noted by Reviewer 1, Point 5 above.

      Figure 5D: Are sulphated BA also elevated in WT females? Please provide these data.

      There is no significant urinary excretion of BAs in WT control animals. We have previously measured and found none. But under cholestatic conditions BAs are observed in urine. Therefore, sulphated BA levels were found only in the DKO mice. 

      Figure 5H: Is the fecal BA excretion in WT females also proportionally higher than in males? Please provide these data.

      We were unable to perform the untargeted metabolomics profiling of WT fecal samples. When we measured for BAs in the feces, as expected very low conc were present irrespective of the sex (~0.01 M) and we did not find any sex difference.  Also, prior studies in 129SVJ strain exhibited comparable fecal excretion (PMC150802). We did not find any clinical studies that measured fecal BA between the sexes.

      (6) Figure 6:

      References in the text of the result section to Figure 6 are wrong. The authors should change this.

      Thank You. This has been rectified.

      Significance indications in the legend do not correspond with significance indications in the figure. Please explain the used significance symbols in the figure in the legend.

      Thank You. The legends and their significance have been matched.

      (7) Supplemental Figure 3:

      Please adapt the title of this figure; the sentence is incorrect. The description of this figure is very poor.

      We have modified the legend and the title of the Supplemental Figure 3 to make it more appropriate. Thanks

      Please explain what the blue and red dots represent.

      Each dot in blue and yellow indicate the Bayesian probability generated from our BCCP model.

      What are the bold horizontal lines representing? Why are there no dots in some box plots? Please elaborate.

      The box represents the interquartile range (IQR), encompassing the middle 50% of the data. The bottom and top edges correspond to the 25th and 75th percentiles, respectively, while the bold horizontal line indicates the median value.

      The absence of visible dots in certain categories—particularly in higher CLIP and TNM stages—is due to the small number of patients, all of whom had similar Bayesian prediction probabilities. As these values cluster tightly around the median, the individual dots may be overlapped and hidden behind the median line.

      The figure is not visually easy to understand, please reconsider the representation.  

      We hope the modified figure legends with the explanation of the lines and the points in the graphs increases the clarity and makes them acceptable.

      Please add the DKO_female signature plot.

      We have added these graph to Supplemental figure 3

      (8) Supplemental 4A:

      Fold change at Z-score is missing. This should be added.

      Thank you we have added this information

      (9) Supplemental 5:

      The scale bar is missing. This should be included.

      The figure is now supplemental figure 8 and the scale bar has been added.

      Methods:

      (1) Did the authors use ChIP-sequencing or ChIP-qPCR? Please describe the correct method.

      We apologize for the error. We have used ChIP-PCR and rectified it in our methods and in our response to a figure 4 query.

      (2) It is unclear how the mouse model was generated. Please refer to earlier publications.

      The mice were generated in house at UIUC, and we have added this sentence to the Methods section. The original reference has been cited in the text (PMCID: PMC3007143).

      Discussion:

      (1) The authors claim in the discussion: 'consistently higher recruitment of ER to the classical BA synthetic genes ...' This is not shown in Figure 4I, only ER recruitment to Cyp7a1 is significantly higher in females. Please rephrase.

      We agree and we have modified the sentence Cyp7A1 accounts for ~75% of BA synthesis and is a rate-limiting gene in the classical BA synthesis pathway. 

      (2) The authors could make their statements stronger if they could elaborate on whether women have more fecal BA excretion, and if there are differences in serum BA concentration in HCC between male and female patients. 

      Unfortunately, we were unable to find clinical studies with appropriate controls which examined and reported serum BA in HCC in a sex specific manner.

      In addition, to understand whether the female-specific protections in humans are similar to mice, it would be nice to show correlations of the female-specific mouse signature with male and female liver signatures.

      At this time, we do not have large n numbers of control or precancerous early-stage patient datasets from both sexes to make such comparisons. Nevertheless, there is translational relevance of these sex-specific signature. Figure 2 included in the reviewer response shows that DKO male signature correlates with poor overall survival in males, whereas neither DKO male nor DKO female signature predict outcome in females. In contrast, DKO female-specific gene signature (DKOFvsM) correlates with better overall survival in both men and in women. 

      (3) The authors state in the discussion: 'Currently we do not know how to reconcile this data other than indicating a potential ER independent mechanism.' We do not understand the reasoning behind this statement. Please clarify.

      We find that increased Erα expression in DKO coincides with CA-mediated suppression of BA synthesis genes in the absence of Fxr and Shp. But we also noticed that in OVX DKO mice, Erα expression is blunted, and so is basal BA synthesis gene expression. Putting together these data, it is intriguing that Erα expression correlates both positively and negatively with BA synthesis genes. To reconcile these contrasting results, we have written the following sentence in the discussion.

      “These findings suggest Erα expression is linked to both positive and negative regulation of BA synthesis genes. But we do not know how ER elicits these differential effects on BA synthesis.”

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Response to Reviewer 1

      Thank you for your recognition of our revised work.

      Response to Reviewer 2

      It would be useful to have a demonstration of where this model outperforms SaProt systematically, and a discussion about what the success of this model teaches us given there is a similar, previously successful model, SaProt.

      As two concurrent works, ProtSSN and SaProt employ different methods to incorporate the structure information of proteins. Generally speaking, for two deep learning models that are developed during a close period, it is challenging to conclude that one model is systematically superior to another. Nonetheless, on DTm and DDG (the two low-throughput datasets that we constructed), ProtSSN demonstrates better empirical performance than SaProt.  

      Moreover, ProtSSN is more efficient in both training and inference compared to SaProt. In terms of training cost, SaProt uses 40 million protein structures for pretraining (requiring 64 A100 GPUs for three months), whereas ProtSSN requires only about 30,000 crystal structures from the CATH database (trained on a single 3090 GPU for two days). Despite SaProt’s significantly higher training cost, its pretrained version does not exhibit superior performance on low-throughput datasets such as DTm, DDG, and Clinvar. Furthermore, the high training cost limits many users from retraining or fine-tuning the model for specific needs or datasets.

      Regarding the inference cost, ProtSSN requires only one embedding computation for a wild-type protein, regardless of the number of mutants (n). In contrast, SaProt computes a separate embedding and score for each mutant. For instance, when evaluating the scoring performance on ProteinGym, ProtSSN only needs 217 inferences, while SaProt needs more than 2M inferences. This inference speed is important in practice, such as high-throughput design and screening.

      Please remove the reference to previous methods as "few shot". This typically refers to their being trained on experimental data, not their using MSAs. A "few shot" model would be ProteinNPT.

      The definition of "few-shot" we used here is following ESM1v [1]. This concept originates from providing a certain number of examples as input to GPT-3 [2]. In the context of protein deep learning models, MSA serves as the wild-type protein examples.

      Also, Reviewer 1 uses the concept in the same way. 

      “Readers should note that methods labelled as "few-shot" in comparisons do not make use of experimental labels, but rather use sequences inferred as homologous; these sequences are also often available even if the protein has never been experimentally tested.”

      In the main text, we also included this definition as well as the reference of ESM-1v in lines 457-458.

      “We extend the evaluation on ProteinGym v0 to include a comparison of our zero-shot ProtSSN with few-shot learning methods that leverage MSA information of proteins (Meier et al., 2021).”

      (1) Meier J, Rao R, Verkuil R, et al. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems, 2021.

      (2) Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020.

      Furthermore, I don't think it is fair to state that your method is not comparable to these models -- one can run an MSA just as one can predict a structure. A fairer comparison would be to highlight particular assays for which getting an MSA could be challenging -- Transcription did this by showing that they outperform EVE when MSAs are shallow.

      We recognize that there are often differences in the definitions and classifications of various methodologies. Here, we follow the definitions provided by ProteinGym. As the most comprehensive and large scale open benchmark in the community, we believe this classification scheme should be widely accepted. All classifications are available on the official website of ProteinGym (https://proteingym.org/benchmarks), which categorizes methods into PLMs, Structure-based models, and Alignment-based models. For example, GEMME is classified as an alignment-based model, and MSA Transformer is considered a hybrid model combining alignment and PLM features.

      We believe that methodologies with different inputs and architectures can lead to inherent unfairness. Also, it is generally believed that models including evolutionary relationships tend to outperform end-to-end models due to the extra information and efforts involved during the training phase. Some empirical evidence and discussions are in the ablation studies of retrieval factors in Tranception [3]. Moreover, the choice of MSA search parameters can introduce uncertainty, which could have positive or negative impacts. 

      We showcase the impact of MSA depth on model performance with an additional analysis below. Author response image 1 visualizes the Spearman’s correlation between the scores of each model and the number of MSAs on 217 ProteinGym assays, where each point represents one of 217 assays. The summary correlation of each model with respect to all assays are reported in Author response table 1. These results demonstrate no clear correlation between MSA depth and model performance even for MSA-based models.

      Author response image 1.

      Scatter plots of the number of MSA sequences and spearman’s correlation.

      Author response table 1.

      Spearmar’s score of the number of MSA sequences and the model’s performance.

      (3) Notin P, Dias M, Frazer J, et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. International Conference on Machine Learning, 2022.

      The authors state that DTm and DDG are conceptually appealing because they come from low-throughput assays with lower experimental noise and are also mutations that are particularly chosen to represent the most interesting regions of the protein. I agree with the conceptual appeal but I don't think these claims have been demonstrated in practice. The cited comparison with Frazer as a particularly noisy source of data I think is particularly unconvincing: ClinVar labels are not only rigorously determined from multiple sources of evidence, Frazer et al demonstrates that these labels are actually more reliable than experiment in some cases. They also state that ProteinGym data doesn't come with environmental conditions, but these can be retrieved from the papers the assays came from. The paper would be strengthened by a demonstration of the conceptual benefit of these new datasets, say a comparison of mutations and signal for a protein that may be in one of these datasets vs ProteinGym.

      In the work by Frazer et al. [4], they mentioned that

      "However, these technologies do not easily scale to thousands of proteins, especially not to combinations of variants, and depend critically on the availability of assays that are relevant to or at least associated with human disease phenotypes." 

      It points out that the results of high-throughput experiments are usually based on the design of specific genes (such as BRCA1 and TP53.) and cannot be easily extended to thousands of other genes. At the same time, due to the complexity of the experiment, there may be problems with reproducibility or deviations from clinical relevance.

      This statement aligns with our perspective that high-throughput experiments inherently involve a significant amount of noise and error. It is important to clarify that the noise we discuss here arises from the limitations of high-throughput experiments themselves, instead of from the reliability of the data sources, such as systematic errors in experimental measurements. This latter issue is a complex problem common to all wetlab experiments and falls outside the scope of our study.

      Under this premise, low-throughput datasets like DTm and DDG can be considered to have less noise than high-throughput datasets, as they have undergone manual curation. As for your suggestion, while valuable, unfortunately, we were unable to identify datasets in DTM and DDG that align with those in ProteinGym after a careful search. Thus, we are unable to conduct this comparative experiment at this stage.

      (4) Frazer J, Notin P, Dias M, et al. Disease variant prediction with deep generative models of evolutionary data. Nature, 2021.

    1. Author Response

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #2 (Public Review):

      I would like to express my appreciation for the authors' dedication to revising the manuscript. It is evident that they have thoughtfully addressed numerous concerns I previously raised, significantly contributing to the overall improvement of the manuscript.

      Response: We appreciate the reviewers’ recognition of our efforts in revising the manuscript.

      My primary concern regarding the authors' framing of their findings within the realm of habitual and goal-directed action control persists. I will try explain my point of view and perhaps clarify my concerns. While acknowledging the historical tendency to equate procedural learning with habits, I believe a consensus has gradually emerged among scientists, recognizing a meaningful distinction between habits and skills or procedural learning. I think this distinction is crucial for a comprehensive understanding of human action control. While these constructs share similarities, they should not be used interchangeably. Procedural learning and motor skills can manifest either through intentional and planned actions (i.e., goal-directed) or autonomously and involuntarily (habitual responses).

      Response: We would like to clarify that, contrary to the reviewer’s assertion of a scientific consensus on this matter, the discussion surrounding the similarities and differences between habits and skills remains an ongoing and unresolved topic of interest among scientists (Balleine and Dezfouli, 2019; Du and Haith, 2023; Graybiel and Grafton, 2015; Haith and Krakauer, 2018; Hardwick et al., 2019; Kruglanski and Szumowska, 2020; Robbins and Costa, 2017). We absolutely agree with the reviewer that “Procedural learning and motor skills can manifest either through intentional and planned actions (i.e., goal-directed) or autonomously and involuntarily (habitual responses)”. But so do habits. Some researchers also highlight the intentional/goal-directed nature of habits (e.g., Du and Haith, 2023, “Habits are not automatic” (preprint) or Kruglanski and Szumowska, 2020, “Habitual behavior is goal-driven”: “definitions of habits that include goal independence as a foundational attribute of habits are begging the question; they effectively define away, and hence dispose of, the issue of whether habits are goal-driven (p 1258).” Therefore, there is no clear consensus concerning the concept of habit.

      While we acknowledge the meaningful distinctions between habits and skills, we also recognize a substantial body of literature supporting the overlap between these concepts (cited in our manuscript), particularly at the neural level. The literature clearly indicates that both habits and skills are mediated by subcortical circuits, with a progressive disengagement of cognitive control hubs in frontal and cingulate cortices as repetition evolves. We do not use these concepts interchangeably. Instead, we simply present evidence supporting the assertion that our trained app sequences meet several criteria for their habitual nature.

      Our choice of Balleine and Dezfouli (2018)'s criteria stemmed from the comprehensive nature of their definitions, which effectively synthesized insights from various researchers (Mazar and Wood, 2018; Verplanken et al., 1998; Wood, 2017, etc). Importantly, their list highlights the positive features of habits that were previously overlooked. However, these authors still included a controversial criterion ("habits as insensitive to changes in their relationship to their individual consequences and the value of those consequences"), even though they acknowledged the problems of using outcome devaluation methods and of relying on a null-effect. According to Kruglanski and Szumowska (2020), this criterion is highly problematic as “If, by definition, habits are goalindependent, then any behavior found to be goal-dependent could not be a habit on sheer logical grounds” (p. 1257). In their definition, “habitual behavior is sensitive to the value of the reward (i.e., the goal) it is expected to mediate and is sensitive to the expectancy of goal attainment (i.e., obtainment of the reward via the behavior, p.1265). In fact, some recent analyses of habitual behavior are not using devaluation or revaluation as a criterion (Du and Haith, 2023). This article, for example, ascertains habits using different criteria and provides supporting evidence for trained action sequences being understood as skills, with both goal-directed and habitual components.

      In the discussion of our manuscript, we explicitly acknowledge that the app sequences can be considered habitual or goal-directed in nature and that this terminology does not alter the fact that our overtrained sequences exhibit clear habitual features.

      Watson et al. (2022) aptly detailed my concerns in the following statements: "Defining habits as fluid and quickly deployed movement sequences overlaps with definitions of skills and procedural learning, which are seen by associative learning theorists as different behaviors and fields of research, distinct from habits."

      "...the risk of calling any fluid behavioral repertoire 'habit' is that clarity on what exactly is under investigation and what associative structure underpins the behavior may be lost." I strongly encourage the authors, at the very least, to consider Watson et al.'s (2022) suggestion: "Clearer terminology as to the type of habit under investigation may be required by researchers to ensure that others can assess at a glance what exactly is under investigation (e.g., devaluationinsensitive habits vs. procedural habits)", and to refine their terminology accordingly (to make this distinction clear). I believe adopting clearer terminology in these respects would enhance the positioning of this work within the relevant knowledge landscape and facilitate future investigations in the field.

      Response: We would like to highlight that we have indeed followed Watson et al (2022)’s recommendations on focusing on other features/criteria of habits at the expense of the outcome devaluation/contingency degradation paradigm, which has been more controversial in the human literature. Our manuscript clearly aligns with Watson et al. (2022) ‘s recommendations: “there are many other features of habits that are not captured by the key metrics from outcome devaluation/contingency degradation paradigms such as the speed at which actions are performed and the refined and invariant characteristics of movement sequences (Balleine and Dezfouli, 2019). Attempts are being made to develop novel behavioral tasks that tap into these positive features of habits, and this should be encouraged as should be tasks that are not designed to assess whether that behavior is sensitive to outcome devaluation, but capture the definition of habits through other measures”.

      Regarding the authors' use of Balleine and Dezfouli's (2018) criteria to frame recorded behavior as habitual, as well as to acknowledgment the study's limitations, it's important to highlight that while the authors labelled the fourth criterion (which they were not fulfilling) as "resistance to devaluation," Balleine and Dezfouli (2018) define it as "insensitive to changes in their relationship to their individual consequences and the value of those consequences." In my understanding, this definition is potentially aligned with the authors' re-evaluation test, namely, it is conceptually adequate for evaluating the fourth criterion (which is the most accepted in the field and probably the one that differentiate habits from skills). Notably, during this test, participants exhibited goaldirected behavior.

      The authors characterized this test as possibly assessing arbitration between goal-directed and habitual behavior, stating that participants in both groups "demonstrated the ability to arbitrate between prior automatic actions and new goal-directed ones." In my perspective, there is no justification for calling it a test of arbitration. Notably, the authors inferred that participants were habitual before the test based on some criteria, but then transitioned to goal-directed behavior based on a different criterion. While I agree with the authors' comment that: "Whether the initiation of the trained motor sequences in experiment 3 (arbitration) is underpinned by an action-outcome association (or not) has no bearing on whether those sequences were under stimulus-response control after training (experiment 1)." they implicitly assert a shift from habit to goal-directed behavior without providing evidence that relies on the same probed mechanism. Therefore, I think it would be more cautious to refer to this test as solely an outcome revaluation test. Again, the results of this test, if anything, provide evidence that the fourth criterion was tested but not met, suggesting participants have not become habitual (or at least undermines this option).

      Response: In our previously revised manuscript, we duly acknowledged that the conventional (perhaps nowadays considered outdated) goal devaluation criterion was not met, primarily due to constraints in designing the second part of the study. We did cite evidence from another similar study that had used devaluation app-trained action sequences to demonstrate habitual qualities (but the reviewer ignored this).

      The reviewer points out that we did use a manipulation of goal revaluation in one of the follow-up tests conducted (although this was not a conventional goal revaluation test inasmuch that it was conducted in a novel context). In this test, please note that we used 2 manipulations: monetary and physical effort. Although we did show that subjects, including OCD patients, were apparently goaldirected in the monetary reward manipulation, this was not so clear when goal re-evaluation involved the physical effort expended. In this effort manipulation, participants were less goaloriented and OCD patients preferred to perform the longer, familiar, to the shorter, novel sequence, thus exhibiting significantly greater habitual tendencies, as compared to controls. Hence, we cannot decisively conclude that the action sequence is goal-directed as the reviewer is arguing. In fact, the evidence is equivocal and may reflect both habitual and goal-directed qualities in the performance of this sequence, consistent with recent interpretations of skilled/habitual sequences (Du and Haith, 2023). Relying solely on this partially met criterion to conclude that the app-trained sequences are goal-directed, and therefore not habitual, would be an inaccurate assessment for several reasons: 1) the action sequences did satisfy all other criteria for being habitual; 2) this approach would rest on a problematic foundation for defining habits, as emphasized by Kruglanski & Szumowska (2020); and 3) it would succumb to the pitfall of subscribing to a zero-sum game perspective, as cautioned by various researchers, including the review by Watson et al. (2022) cited by the referee, thus oversimplifying the nuanced nature of human behavior.

      While we have previously complied with the reviewer’s suggestion on relabelling our follow-up test as a “revaluation test” instead of an “arbitration test”, we have now explicitly removed all mentions of the term “arbitration” (which seems to raise concerns) throughout the manuscript. As the reviewer has suggested, we now use a more refined terminology by explicitly referring to the measured behavior as "procedural habits", as he/she suggested. We have also extensively revised the discussion section of our manuscript to incorporate the reviewer’s viewpoint. We hope that these adjustments enhance the clarity and accuracy of our manuscript, addressing the concerns raised during this review process.

      In essence, this is an ontological and semantic matter, that does not alter our findings in any way. Whether the sequences are consider habitual or goal directed, does not change our findings that 1) Both groups displayed equivalent procedural learning and automaticity attainment; 2) OCD patients exhibit greater subjective habitual tendencies via self-reported questionnaires; 3) Patients who had elevated compulsivity and habitual self-reported tendencies engaged significantly more with the motor habit-training app, practiced more and reported symptom relief at the end of the study; 4) these particular patients also show an augmented inclination to attribute higher intrinsic value to familiar actions, a possible mechanism underlying compulsions.

      Reviewer #2 (Recommendations For The Authors):

      A few more small comments (with reference to the point numbers indicated in the rebuttal):

      (14) I am not entirely sure why the suggested analysis is deemed impractical (i.e., why it cannot be performed by "pretending" participants received the points they should have received according to their performance). This can further support (or undermine) the idea of effect of reward on performance rather than just performance on performance.

      Response: We have now conducted this analysis, generating scores for each trial of practices after day 20, when participants no longer gained points for their performance. This analysis assesses whether participants trial-wise behavioral changes exhibit a similar pattern following simulated relative increases or decrease in scores, as if they had been receiving points at this stage. Note that this analysis has fewer trials available, around 50% less on average.

      Before presenting our results, we wish to emphasize the importance of distinguishing between the effects of performance on performance and the effects of reward on performance. In response to a reviewer's suggestion, we assessed the former in the first revision of our manuscript. We normalized the movement time variable and evaluated how normalized behavioral changes responded to score increments and decrements. The results from the original analyses were consistent with those from the normalized data.

      Regarding the phase where participants no longer received scores, we believe this phase primarily helps us understand the impact of 'predicted' or 'learned' rewards on performance. Once participants have learned the simple association between faster performance and larger scores, they can be expected to continue exhibiting the reward sensitivity effects described in our main analysis. We consider it is not feasible to assess the effects of performance on performance during the reward removal phase, which occurs after 20 days. Therefore, the following results pertain to how the learned associations between faster movement times and scores persist in influencing behavior, even when explicit scores are no longer displayed on the screen.

      Results: The main results of the effect of reward on behavioral changes persist, supporting that relative increases or decreases in scores (real or imagined/inferred) modulate behavioral adaptations trial-by-trial in a consistent manner across both cohorts. The direction of the effects of reward is the same as in the main analyses presented in the manuscript: larger mean behavioral changes (smaller std) following ∆R- . First, concerning changes in “normalized” movement time (MT) trial-by-trial, we conducted a 2 x 2 factorial analysis of the centroid of the Gaussian distributions with the same factors Reward, Group and Bin. This analysis demonstrated a significant main effect of Reward (P = 2e-16), but not of Group (P = 0.974) or Bin (P = 0.281). There were no significant interactions between factors. The main Reward effect can be observed in the top panel of the figure below. The same analysis applied to the spread (std) of the Gaussian distributions revealed a significant main effect of Reward (P = 0.000213), with no additional main effects or interactions.

      Author response image 1.

      Next, conducting the same 2 x 2 factorial analyses on the centroid and spread of the Gaussian distributions fitted to the Consistency data, we also obtained a robust significant main effect of Reward. For the centroid variable, we obtained a significant main effect of Reward (P = 0.0109) and Group (P = 0.0294), while Bin and the factor interactions were non-significant. See the top panel of the figure below.

      On the other hand, Reward also modulated significantly the spread of the Gaussian distributions fitted to the Consistency data, P = 0.00498. There were no additional significant main effects or interactions. See the bottom panel in the figure below.

      Note that here the factorial analysis was performed on the logarithmic transformation of the std.

      Author response image 2.

      (16) I find this result interesting and I think it might be worthwhile to include it in the paper.

      Response: We have now included this result in our revised manuscript (page 28)

      (18) I referred to this sentence: "The app preferred sequence was their preferred putative habitual sequence while the 'any 6' or 'any 3'-move sequences were the goal-seeking sequences." In my understanding, this implies one choice is habitual and another indicates goal-directedness.

      One last small comment:
In the Discussion it is stated: "Moreover, when faced with a choice between the familiar and a new, less effort-demanding sequence, the OCD group leaned toward the former, likely due to its inherent value. These insights align with the theory of goal-direction/habit imbalance in OCD (Gillan et al., 2016), underscoring the dominance of habits in particular settings where they might hold intrinsic value."

      This could equally be interpreted as goal-directed behavior, so I do not think there is conclusive support for this claim.

      Response: The choice of the familiar/trained sequence, as opposed to the 'any 6' or 'any 3'-move sequences cannot be explicitly considered goal-directed: firstly, because the app familiar sequences were associated with less monetary reward (in the any-6 condition), and secondly, because participants would clearly need more effort and time to perform them. Even though these were automatic, it would still be much easier and faster to simply tap one finger sequentially 6 times (any6) or 3 times (any-3). Therefore, the choice for the app-sequence would not be optimal/goaldirected. In this sense, that choice aligns with the current theory of goal-direction/habit imbalance of OCD. We found that OCD patients prefer to perform the trained app sequences in the physical effort manipulation (any-3 condition). While this, on one hand cannot be explicitly considered a goal-directed choice, we agree that there is another possible goal involved here, which links to the intrinsic value associated to the familiar sequence. In this sense the action could potentially be considered goal-directed. This highlights the difficulty of this concept of value and agrees with: 1) Hommel and Wiers (2017): “Human behavior is commonly not driven by one but by many overlapping motives . . . and actions are commonly embedded into larger-scale activities with multiple goals defined at different levels. As a consequence, even successful satiation of one goal or motive is unlikely to also eliminate all the others(p. 942) and 2) Kruglanski & Szumowska (2020)’s account that “habits that may be unwanted from the perspective of an outsider and hence “irrational” or purposeless, may be highly wanted from the perspective of the individual for whom a habit is functional in achieving some goal” (p. 1262) and therefore habits are goal-driven.

      References:

      Balleine BW, Dezfouli A. 2019. Hierarchical Action Control: Adaptive Collaboration Between Actions and Habits. Front Psychol 10:2735. doi:10.3389/fpsyg.2019.02735

      Du Y, Haith A. 2023. Habits are not automatic. doi:10.31234/osf.io/gncsf Graybiel AM, Grafton ST. 2015. The Striatum: Where Skills and Habits Meet. Cold Spring Harb Perspect Biol 7:a021691. doi:10.1101/cshperspect.a021691

      Haith AM, Krakauer JW. 2018. The multiple effects of practice: skill, habit and reduced cognitive load. Current Opinion in Behavioral Sciences 20:196–201. doi:10.1016/j.cobeha.2018.01.015

      Hardwick RM, Forrence AD, Krakauer JW, Haith AM. 2019. Time-dependent competition between goal-directed and habitual response preparation. Nat Hum Behav 1–11. doi:10.1038/s41562019-0725-0

      Hommel B, Wiers RW. 2017. Towards a Unitary Approach to Human Action Control. Trends Cogn Sci 21:940–949. doi:10.1016/j.tics.2017.09.009

      Kruglanski AW, Szumowska E. 2020. Habitual Behavior Is Goal-Driven. Perspect Psychol Sci 15:1256– 1271. doi:10.1177/1745691620917676

      Mazar A, Wood W. 2018. Defining Habit in Psychology In: Verplanken B, editor. The Psychology of Habit: Theory, Mechanisms, Change, and Contexts. Cham: Springer International Publishing. pp. 13–29. doi:10.1007/978-3-319-97529-0_2

      Robbins TW, Costa RM. 2017. Habits. Current Biology 27:R1200–R1206. doi:10.1016/j.cub.2017.09.060

      Verplanken B, Aarts H, van Knippenberg A, Moonen A. 1998. Habit versus planned behaviour: a field experiment. Br J Soc Psychol 37 ( Pt 1):111–128. doi:10.1111/j.2044-8309.1998.tb01160.x

      Watson P, O’Callaghan C, Perkes I, Bradfield L, Turner K. 2022. Making habits measurable beyond what they are not: A focus on associative dual-process models. Neurosci Biobehav Rev 142:104869. doi:10.1016/j.neubiorev.2022.104869

      Wood W. 2017. Habit in Personality and Social Psychology. Pers Soc Psychol Rev 21:389–403. doi:10.1177/1088868317720362

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #2 (Public Review):

      Summary:

      The manuscript by Kelbert et al. presents results on the involvement of the yeast transcription factor Sfp1 in the stabilisation of transcripts whose synthesis it stimulates. Sfp1 is known to affect the synthesis of a number of important cellular transcripts, such as many of those that code for ribosomal proteins. The hypothesis that a transcription factor can remain bound to the nascent transcript and affect its cytoplasmic half-life is attractive. However, the association of Sfp1 with cytoplasmic transcripts remains to be validated, as explained in the following comments:

      A two-hybrid based assay for protein-protein interactions identified Sfp1, a transcription factor known for its effects on ribosomal protein gene expression, as interacting with Rpb4, a subunit of RNA polymerase II. Classical two-hybrid experiments depend on the presence of the tested proteins in the nucleus of yeast cells, suggesting that the observed interaction occurs in the nucleus. Unfortunately, the two-hybrid method cannot determine whether the interaction is direct or mediated by nucleic acids. The revised version of the manuscript now states that the observed interaction could be indirect.

      To understand to which RNA Sfp1 might bind, the authors used an N-terminally tagged fusion protein in a cross-linking and purification experiment. This method identified 264 transcripts for which the CRAC signal was considered positive and which mostly correspond to abundant mRNAs, including 74 ribosomal protein mRNAs or metabolic enzyme-abundant mRNAs such as PGK1. The authors did not provide evidence for the specificity of the observed CRAC signal, in particular what would be the background of a similar experiment performed without UV cross-linking. This is crucial, as Figure S2G shows very localized and sharp peaks for the CRAC signal, often associated with over-amplification of weak signal during sequencing library preparation.

      (1) To rule out possible PCR artifacts, we used a UMI (Unique Molecular Identifier) scan. UMIs are short, random sequences added to each molecule by the 5’ adapter to uniquely tag them. After PCR amplification and alignment to the reference genome, groups of reads with identical UMIs represent only one unique original molecule. Thus, UMIs allow distinguishing between original molecules and PCR duplicates, effectively eliminating the duplicates.

      (2) Looking closely at the peaks using the IGV browser, we noticed that the reads are by no means identical. Each carrying a mutation [probably due to the cross-linking] in a different position and having different length. Note that the reads are highly reproducible in two replicate.

      (3) CRAC+ genes do not all fall into the category of highly transcribed genes.  On the contrary, as depicted in Figure 6A (green dots), it is evident that CRAC+ genes exhibit a diverse range of Rpb3 ChIP and GRO signals. Furthermore, as illustrated in Figure 7A, when comparing CRAC+ to Q1 (the most highly transcribed genes), it becomes evident that the Rpb4/Rpb3 profile of CRAC+ genes is not a result of high transcription levels.

      (4) Only a portion of the RiBi mRNAs binds Sfp1, despite similar expression of all RiBi.

      (5) The CRAC+ genes represent a distinct group with many unique features. Moreover, many CRAC+ genes do not fall into the category of highly transcribed genes.

      (6) The biological significance of the 262 CRAC+ mRNAs was demonstrated by various experiments; all are inconsistent with technical flaws. Some examples are:

      a) Fig. 2a and B show that most reads of CRAC+ mRNA were mapped to specific location – close the pA sites.

      b) Fig. 2C shows that most reads of CRAC+ mRNA were mapped to specific RNA motif.

      c) Most RiBi CRAC+ promoter contain Rap1 binding sites (p= 1.9x10-22), whereas the vast majority of RiBi CRAC- promoters do not contain Rap1 binding site. (Fig. 3C).

      d) Fig. 4A shows that RiBi CRAC+ mRNAs become destabilized due to Sfp1 deletion, whereas RiBi CRAC- mRNAs do not. Fig. 4B shows similar results due to

      e) Fig. 6B shows that the impact of Sfp1 on backtracking is substantially higher for CRAC+ than for CRAC- genes. This is most clearly visible in RiBi genes.

      f) Fig. 7A shows that the Sfp1-dependent changes along the transcription units is substantially more rigorous for CRAC+ than for CRAC-.

      g) Fig. S4B Shows that chromatin binding profile of Sfp1 is different for CRAC+ and CRAC- genes

      In a validation experiment, the presence of several mRNAs in a purified SFP1 fraction was measured at levels that reflect the relative levels of RNA in a total RNA extract. Negative controls showing that abundant mRNAs not found in the CRAC experiment were clearly depleted from the purified fraction with Sfp1 would be crucial to assess the specificity of the observed protein-RNA interactions (to complement Fig. 2D).

      GPP1, a highly expressed genes, is not to be pulled down by Sfp1 (Fig. 2D). GPP1 (alias RHR2) was included in our Table S2 as one of the 264 CRAC+ genes, having a low CRAC value. However, when we inspected GPP1 results using the IGV browser, we realized that the few reads mapped to GPP1 are actually anti-sense to GPP1 (perhaps they belong to the neighboring RPL34B genes, which is convergently transcribed to GPP1) (see Fig. 1 at the bottom of the document). Thus, GPP1 is not a CRAC+ gene and would now serve as a control. See  We changed the text accordingly (see page 11 blue sentences). In light of this observation, we checked other CRAC genes and found that, except for ALG2, they all contain sense reads (some contain both sense and anti-sense reads). ALG2 and GPP1 were removed leaving 262 CRAC+ genes.

      The CRAC-selected mRNAs were enriched for genes whose expression was previously shown to be upregulated upon Sfp1 overexpression (Albert et al., 2019). The presence of unspliced RPL30 pre-mRNA in the Sfp1 purification was interpreted as a sign of co-transcriptional assembly of Sfp1 into mRNA, but in the absence of valid negative controls, this hypothesis would require further experimental validation. Also, whether the fraction of mRNA bound by Sfp1 is nuclear or cytoplasmic is unclear.

      Further experimental validation was provided in some of our figures (e.g., Fig. 5C, Fig. 3B).

      We argue that Sfp1 binds RNA co-transcriptionally and accompanies the mRNA till its demise in the cytoplasm: Co-transcriptional binding is shown in: (I) a drop in the Sfp1 ChIP-exo signal that coincides with the position of Sfp1 binding site in the RNA (Fig. 5C), demonstrating a movement of Sfp1 from chromatin to the transcript, (II) the dependence of Sfp1 RNA-binding on the promoter (Fig. 3B) and binding of intron-containing RNA. Taken together these 3 different experiments demonstrate that Sfp1 binds Pol II transcript co-transcriptionally.  Association of Sfp1 with cytoplasmic mRNAs is shown in the following experiments: (I) Figure 2D shows that Sfp1 pulled down full length RNA, strongly suggesting that these RNA are mature cytoplasmic mRNAs. (II) mRNA encoding ribosomal proteins, which belong to the CRAC+ mRNAs group are degraded by Xrn1 in the cytoplasm (Bresson et al., Mol Cell 2020). The capacity of Sfp1 to regulates this process (Fig. 4A-D) is therefore consistent with cytoplasmic activity of Sfp1. (III) The effect of Sfp1 on deadenylation (Fig. 4D), a cytoplasmic process, is also consistent with cytoplasmic activity of Sfp1. 

      To address the important question of whether co-transcriptional assembly of Spf1 with transcripts could alter their stability, the authors first used a reporter system in which the RPL30 transcription unit is transferred to vectors under different transcriptional contexts, as previously described by the Choder laboratory (Bregman et al. 2011). While RPL30 expressed under an ACT1 promoter was barely detectable, the highest levels of RNA were observed in the context of the native upstream RPL30 sequence when Rap1 binding sites were also present. Sfp1 showed better association with reporter mRNAs containing Rap1 binding sites in the promoter region. Removal of the Rap1 binding sites from the reporter vector also led to a drastic decrease in reporter mRNA levels. Co-purification of reporter RNA with Sfp1 was only observed when Rap1 binding sites were included in the reporter. Negative controls for all the purification experiments might be useful.

      In the swapping experiment, the plasmid lacking RapBS serves as the control for the one with RapBS and vice versa (see Bregman et al., 2011). Remember, that all these contracts give rise to identical RNA. Indeed, RabBS affects both mRNA synthesis and decay, therefore the controls are not ideal. However, see next section.

      More importantly, in Fig. 3B “Input” panel, one can see that the RNA level of “construct F” was higher than the level of “construct E”. Despite this difference, only the RNA encoded by construct E was detected in the IP panel. This clearly shows that the detection of the RNA was not merely a result of its expression level.

      To complement the biochemical data presented in the first part of the manuscript, the authors turned to the deletion or rapid depletion of SFP1 and used labelling experiments to assess changes in the rate of synthesis, abundance and decay of mRNAs under these conditions. An important observation was that in the absence of Sfp1, mRNAs encoding ribosomal protein genes not only had a reduced synthesis rate, but also an increased degradation rate. This important observation needs careful validation,

      Indeed, we do provide validations in Fig. 4C Fig. 4D Fig. S3A and during the revision we included an  additional validation as Fig. S3B. Of note, we strongly suspect that GRO is among the most reliable approaches to determine half-lives (see our response in the first revision letter).

      As genomic run-on experiments were used to measure half-lives, and this particular method was found to give results that correlated poorly with other measures of half-life in yeast (e.g. Chappelboim et al., 2022 for a comparison). As an additional validation, a temperature shift to 42{degree sign}C was used to show that , for specific ribosomal protein mRNA, the degradation was faster, assuming that transcription stops at that temperature. It would be important to cite and discuss the work from the Tollervey laboratory showing that a temperature shift to 42{degree sign}C leads to a strong and specific decrease in ribosomal protein mRNA levels, probably through an accelerated RNA degradation (Bresson et al., Mol Cell 2020, e.g. Fig 5E).

      This was cited. Thank you. 

      Finally, the conclusion that mRNA deadenylation rate is altered in the absence of Sfp1, is difficult to assess from the presented results (Fig. 3D).

      This type of experiment was popular in the past. The results in the literature are similar to ours (in fact, ours are nicer). Please check the papers cited in our MS and a number of papers by Roy Parker.

      The effects of SFP1 on transcription were investigated by chromatin purification with Rpb3, a subunit of RNA polymerase, and the results were compared with synthesis rates determined by genomic run-on experiments. The decrease in polII presence on transcripts in the absence of SFP1 was not accompanied by a marked decrease in transcript output, suggesting an effect of Sfp1 in ensuring robust transcription and avoiding RNA polymerase backtracking. To further investigate the phenotypes associated with the depletion or absence of Sfp1, the authors examined the presence of Rpb4 along transcription units compared to Rpb3. An effect of spf1 deficiency was that this ratio, which decreased from the start of transcription towards the end of transcripts, increased slightly. To what extent this result is important for the main message of the manuscript is unclear.

      Suggestions: a) please clearly indicate in the figures when they correspond to reanalyses of published results.

      This was done.

      b) In table S2, it would be important to mention what the results represent and what statistics were used for the selection of "positive" hits. 

      This was discussed in the text.

      Strengths:

      - Diversity of experimental approaches used.

      - Validation of large-scale results with appropriate reporters.

      Weaknesses:

      - Lack of controls for the CRAC results and lack of negative controls for the co-purification experiments that were used to validate specific mRNA targets potentially bound by Sfp1.

      - Several conclusions are derived from complex correlative analyses that fully depend on the validity of the aforementioned Sfp1-mRNA interactions.

      We hope that our responses to Reviewer 2's thoughtful comments have rulled out concerns regarding the lack of controls.

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      Please review the text for spelling errors. While not mandatory, wig or begraph files for the CRAC results would be very useful for the readers.

      Author response image 1.

      A snapshot of IGV GPP1 locus showing that all the reads are anti-sense (pointing at the opposite direction of the gene (the gene arrows [white arrows over blue, at the bottom] are pointing to the right whereas the reads’ orientations are pointing to the left).

    1. Author Response

      The following is the authors’ response to the current reviews.

      We confirm that that “count-down” parameter, mentioned by reviewer 1, is indeed counted from the first lockdown day and increases continuously, even when we do not have any data – and that this is clearly written in the manuscript.


      The following is the authors’ response to the original reviews.

      Reviewer 1:

      (Note, while these authors do reference Derryberry et al., I thought that there could have been much more direct comparison between the results of the two approaches).

      We added some more discussion of the differences between the papers.

      One important drawback of the approach, which potentially calls into question the authors' conclusions, is that the acoustic sampling only occurred during the pandemic: for several lockdown periods and then for a period of 10 days immediately after the end of the final lockdown period in May of 2020. Several relevant things changed from March to May of 2020, most notably the shift from spring to summer, and the accompanying shift into and through the breeding season (differing for each of the three focal species). Although the statistical methods included an attempt to address this, neither the inclusion of the "count down" variable nor the temperature variable could account for any non-linear effects of breeding phenology on vocal activity. I found the reliance on temperature particularly troubling, because despite the authors' claims that it was "a good proxy of seasonality", an examination of the temperature data revealed a considerable non-linear pattern across much of the study duration. In addition, using a period immediately after the lockdowns as a "no-lockdown" control meant that any lingering or delayed effects of human activity changes in the preceding two months could still have been relevant (not to mention the fact that despite the end of an official lockdown, the pandemic still had dramatic effects on human activity during late May 2020).

      In general, the reviewer is correct, and we reformulated some of the text to more carefully address these points. However, we would like to note two things: (1) Changes occurred rapidly with birds rapidly changing their behavior – this is one of the main conclusions of our study, i.e., that urban dwelling animals are highly plastic in behavior. So that lingering effects were unlikely. (2) Changes occurred in both directions, and thus seasonality (which is expected to have a uni-directional effect) cannot explain everything we observed. We are not sure what the reviewer means by ‘considerable non-linear patterns’ when referring to the temperature. Except for ~5 days with temperatures that exceeded the expected average by 3-4 degrees, the temperature increased approximately linearly during the period as expected from seasonality (see Author response image 1). Following the reviewer’s comment, we tested whether exclusion of data from these days changes the results and found no change.

      We would like to note that in terms of breeding, all birds were within the same state during both the lockdown and the non-lockdown periods. Parakeets and crows have a long breeding season Feb-end of June with one cycle. They will stay around the nest throughout this season and especially in the peak of the season March-May. Prinias start slightly later at the beginning of March with 2-3 cycles till end of June.

      Regarding the comment about human activity, as we now also note in the manuscript, reality in Israel was actually the opposite of the reviewer’s suggestion with people returning to normal behavior towards the end of the lockdown (even before its official removal). We believe that this added noise to our results, and that the effect of the lockdown was probably higher than we observed.

      Author response image 1.

      Another weakness of the current version of the manuscript is the use of a supposed "contradiction" in the existing literature to create the context for the present study. Although the various studies cited do have many differences in their results, those other papers lay out many nuanced hypotheses for those differences. Almost none of the studies cited in this manuscript actually reported blanket increases or decreases in urban birds, as suggested here, and each of those papers includes examples of species that showed different responses. To suggest that they are on opposite sides of a supposed dichotomy is a misrepresentation. Many of those other studies also included a larger number of different species, whereas this study focused on three. Finally, this study was completed at a much finer spatial scale than most others and was examining micro-habitat differences rather than patterns apparent across landscapes. I believe that highlighting differences in scale to explain nuanced differences among studies is a much better approach that more accurately adds to the body of literature.

      We thank the reviewer for this good feedback and revised the manuscript, accordingly, placing more emphasis on the micro-scale of this study.

      Finally a note on L244-247: I would recommend against discounting the possibility that lockdowns resulted in changes to the birds' vocal acoustics, as Derryberry et al. 2020 found, especially while suggesting that their results were the effects of signal processing artifacts. Audio analysis is not my area of expertise, but isn't it possible that the birds did increase call intensity, but were simply not willing (or able) to increase it to the same degree as the additional ambient noise?

      This is an important question. The fact is that when ambient noise increases (at the relevant frequency channels), then the measured vocalizations will also increase. There is no way to separate the two effects. Thus, as scientists, when we cannot measure an effect, it is safer not to suggest an effect. Unfortunately, most studies that claim an increase in vocalizations’ intensity in noise, do not account for this potential artifact (and most of them do not estimate noise at a species-specific level as we have done). This has created a lot of “noise” in the field. We do not want to criticize the Derryberry results without analyzing the data, but from reading their methods it does not seem like they took the noise into account in their acoustic measurements. But if you look at their figure 4A you will see a lot of variability in measuring the minimum frequency – which could be strongly affected by ambient noise.

      In light of the above, we thus prefer to be careful and not to state changes that are probably false. We added some of this information to the manuscript. We also added the linear equations to the graph (in the caption of figure 3) where it can be seen that the slope is always <=1.

      Reviewer 2:

      The explanation of methods can be improved. For example, it is not clear if data were low-pass filtered before resampling to avoid aliasing.

      We edited the methods and hopefully they are clearer now. Regarding the specific question – yes, an LPF was applied to prevent aliasing before the resampling. This information was added to the manuscript.

      It is quite possible that birds move into the trees and further from the recorders with human activity. Since sound level decreases by the square of the distance of the source from the recorders, this could significantly affect the data. As indicated in the Discussion, this is a significant parameter that could not be controlled.

      The reviewer is correct, and we addressed this point. Such biases could arise with any type of surveying including manual transects (except for perhaps, placing tags on the animals). We note that we only analyzed high SNR signals and that the species we selected somewhat overcome this bias – both crows and parakeets are not shy and Prinias are anyway shy and prefer to not be out in the open. We would also expect to see a stronger effect for human speech if this was a central phenomenon, and we did not see this, but of course this might have affected our results.

      In interpreting the data, the authors mention the effect of human activity on bird vocalizations in the context of inter-species predator-prey interactions; however, the presence of humans could also modify intraspecies interactions by acting as triggers for communication of warning and alarm, and/or food calls (as may sometimes be the case) to conspecifics. Along the same lines, it is important to have a better understanding of the behavioral significance of the syllables used to monitor animal activity in the present study.

      We agree with this point and added more discussion of both this potential bias and the type of syllables that were analyzed.

      Another potential effect that may influence the results but is difficult to study, relates to the examination of vocalizations near to the ambient noise level. This is the bandwidth of sound levels where most significant changes may occur, for example, due to the Lombard effect demonstrated in bird and bat species. However, as indicated, these are also more difficult to track and quantify. Moreover, human generated noise, other than speech, may be a more relevant factor in influencing acoustic activity of different bird species. Speech, per se, similar to the vocalizations of many other species, may simply enrich the acoustic environment so that the effects observed in the present study may be transient without significant long-term consequences.

      We note that we already included a noise parameter (in addition to human speech) in the original manuscript. Following the reviewer’s comment, we examined another factor, namely we replaced the previous ambient noise parameter with an estimate of ambient noise under 1kHz which should reflect most anthropogenic noise (not restricted to human speech). This model gave very similar results to the previous one (which is not very surprising as noise is usually correlated). We added this information to the revised manuscript, and we now also added examples of anthropogenic noise to the supplementary materials (Fig. S8). In general, we accept the comments made by the reviewer, but would like to emphasize that we only analyze high SNR vocalization (and not vocalizations that were close to the noise level). This strategy should have overcome biases that resulted from slight changes in ambient noise.

      In general, the authors achieved their aim of illustrating the complexity of the effect of human activity on animal behavior. At the same time, their study also made it clear that estimating such effects is not simple given the dynamics of animal behavior. For example, seasonality, temperature changes, animal migration and movement, as well as interspecies interactions, such as related to predator-prey behavior, and inter/intra-species competition in other respects can all play into site-specific changes in the vocal activity of a particular species.

      We completely agree and tried to further emphasize this in the revised manuscript. This is one of the main conclusions of this study – we should be careful when reaching conclusions.

      Although the methods used in the present study are statistically rigorous, a multivariate approach and visualization techniques afforded by principal components analysis and multidimensional scaling methods may be more effective in communicating the overall results.

      Following this comment, we ran a discriminant function analysis with the parameters of the best model (site category, ambient noise, human activity, temperature and lockdown state) with the task of classifying the level of bird activity. The DFA analysis managed to classify activity significantly above chance and the weights of the parameters revealed some insight about their relative importance. We added this information to the revised manuscript

      Suggestions for improvement:

      In Figure 2, the labeling of the Y-axis in the right panel should be moved to the left, similar to A and C. This will provide clear separation between the two side-to-side panels.

      Revised

      In Figure 3, it will be good to see the regression lines (as dashed lines) separately for the lockdown and no-lockdown conditions in addition to the overall effect.

      Revised

      Editor:

      Limitations

      Scale: The study's limited spatial and temporal scale was not addressed by the authors, which contrasts with the broader scope of other cited studies. To enhance the significance of the study, acknowledging and clearly highlighting this limitation, along with its potential caveats, modifications in the language used throughout the text would be beneficial. Furthermore, although the authors examined slight variations in habitat, it is important to note that all sites were primarily located within an urban landscape.

      We revised the manuscript accordingly.

      Control period: The control period is significantly shorter than the lockdown treatment period and occurs at a different time of year, potentially impacting the vocalization patterns of birds due to different annual cycle stages. It is crucial to consider that the control period falls within the pandemic timeframe despite being shortly after the lockdowns ended.

      Revised – we included a control comparison to periods of equal length within the lockdown. People gradually stopped obeying the lockdown regulations before its removal so in fact, the official removal date is probably an overestimate for the effect of the lockdown. We now explain this.

      Recommendations

      Human-generated noise, beyond speech, might have a greater influence on the acoustic activity of various bird species, but previous studies lacked detailed human activity data. Instead of solely noting the number of human talkers, the authors could quantify other aspects of human activity such as vehicles or overall anthropogenic noise volume. Exploring the relationships between these factors and bird activity at a fine scale, while disentangling them from bird detection, would be compelling. It is important to consider the potential difficulty in resolving other anthropogenic sounds within a specific bandwidth, which could be demonstrated to readers through spectrograms and potential post-pandemic changes. Such information, including daily coefficient of variation/fluctuation rather than absolute frequency spectra, could provide valuable insights.

      We note that we have already included an ambient noise factor (in addition to human speech) in the previous version. Following the reviewers’ comments, we examined another factor, namely we replaced the current ambient noise parameter with the ambient noise under 1kHz which should reflect most of anthropogenic noise (not restricted to human speech). This model gave very similar results to the previous one (which is not surprising as noise is usually correlated). We also added several spectrograms in the Supplementary material that show examples of different types of noise.

      Authors should limit their data interpretation to the impact of lockdown on behavioral responses within small-scale variations in habitat. A key critique is the assumption that activity changes solely resulted from the lockdown, disregarding other environmental factors and phenology.

      Following the editor comment we realized that our conclusion\assertations were not clear. We never claimed that activity changes solely resulted from the lockdown. While revsing the mansucirpt we ensurred that we show a significant effect of temperature, ambient noise and human activity – all of which are not dependent on lockdown. We made an effort to emphasize the complexity of the system. We show that the lockdown seemed to have an additional impact, but we never claimed it was the only factor.

      To address this, the authors could compare acoustic monitoring data within a shorter timeframe before and after the lockdown (20 days), while also controlling for temperature effects, to strengthen the validity of their claims. They would need to explain in their discussion, however, that such a comparison may still be confounded by any carry-over effects from the 10 days of treatment.

      This analysis would be difficult because although the lockdown was officially removed at a specific date, it was gradually less respected by the citizens and thus the last period of the lockdown was somewhere between lockdown and no-lockdown. This is why we chose the approach of taking 10 days randomly from within the lockdown period and comparing them with the 10 post-lockdown days. We now clarify the reason better.

      An option is that authors could frame their analysis as a study of the behavior of wildlife coming out of a lockdown, to draw a distinction from other studies that compared pre-pandemic data to pandemic data.

      Good idea – revised.

    1. Author Response

      The following is the authors’ response to the previous reviews.

      Point to point response for the editors

      We are deeply grateful for the time you have devoted to reviewing this manuscript, and we sincerely thank you. Your insightful feedback has been instrumental in enhancing the quality of our work.

      In the revised version of the manuscript, we have carefully addressed each of the concerns you raised. Below, you will find a detailed summary of how your feedback has been incorporated to improve the overall content and clarity of the document.

      1. P2RX7 effects: In Figure 2, the vehicle treated P2RX7 knockout (panel M) shows an Ashcroft score of about 1.5 after BLM. Comparing this to the Ashcroft score of 3 after BLM in the wildtype (panel C) suggests that P2RX7 deletion is an effective way to reduce fibrosis by half!.

      The argument that HEI3090 also reduces fibrosis by activating P2RX7 is of course very difficult to convey and it seems contradictory that P2RX7 deletion and P2RX7 activation can be both anti-fibrotic. This is an unusual claim and confuses the reviewers as well as the future readers.

      This has many important health implications because activating an inflammatory pathway via P2RX7 and IL-18 could be risky in terms of a fibrosis treatment as inflammatory activation can also worsen fibrosis. The authors' own P2RX7 KO data (untreated vehicle groups) indeed confirms that P2RX7 can be pro-fibrotic.

      We thank the editors for their comment highlighting the lack of clarity in our message. Indeed, we verified whether the antifibrotic action of HEI3090 depends on the expression of P2RX7 by inducing lung fibrosis in P2RX7 KO mice. In doing so, we initially observed that P2RX7 plays a role in the development of BLM-induced lung fibrosis. This is illustrated by a decrease of 50% in the Ashcroft score, as shown in Figure 2M and Supplemental Figure 2C of the revised manuscript.

      To increase the clarity of your message, we added in the text the following paragraph:

      "We further verified whether the antifibrotic action of HEI3090 depends on the expression of P2RX7 by inducing lung fibrosis in p2rx7 knockout (KO) mice. In doing so, we initially observed that P2RX7 plays a role in the development of BLM-induced lung fibrosis. This is illustrated by a decrease of 50% in the Ashcroft score, with a mean value of 1.7 in P2RX7 knockout mice compared to 3 in wild-type mice (Figure 2M and Supplemental Figure 2C). It is important to note that p2rx7 -/- mice still exhibit signs of lung fibrosis, such as thickening of the alveolar wall and a reduction in free air space, in comparison to naïve mice that received PBS instead of BLM (see Supplemental Figure 2A). This result confirms a previous report indicating that BLM-induced lung fibrosis partially depends on the activation of the P2RX7/pannexin-1 axis, leading to the production of IL-1β in the lung. Additionally, in contrast to the observations in WT mice, HEI3090 failed to attenuate the remaining lung fibrosis in p2rx7 -/- mice, as measured by the Ashcroft score (Figure 2M), the percentage of lung tissue with fibrotic lesions, or the intensity of collagen fibers (Supplemental Figure 2D). These results show that P2RX7 alone participates in fibrosis and that HEI3090 exerts a specific antifibrotic effect through this receptor (see Supplemental Figure 2C)."

      Since we used the HEI3090 compound in this study and to be closer to the results, we have replaced the title of 2 chapters in the results section as followed:

      “HEI3090 inhibits the onset of pulmonary fibrosis in the bleomycin mouse model” instead of P2RX7 activation inhibits the onset of pulmonary fibrosis in the bleomycin mouse model and “HEI3090 shapes immune cell infiltration in the lungs" instead of P2RX7 activation shapes immune cell infiltration in the lungs

      We concur that the observation of both anti-fibrotic effects following P2RX7 deletion and P2RX7 activation appears contradictory. This specific aspect has been thoroughly addressed and extensively discussed in the revised manuscript.

      “A major unmet need in the field of IPF is new treatment to fight this uncurable disease. In this preclinical study, we demonstrate the ability of immune cells to limit lung fibrosis progression. Based on the hypothesis that a local activation of a T cell immune response and upregulation of IFN-γ production has antifibrotic proprieties, we used the HEI3090 positive modulator of the purinergic receptor P2RX7, previously developed in our laboratory (Douguet et al., 2021), to demonstrate that activation of the P2RX7/IL-18 pathway attenuates lung fibrosis in the bleomycin mouse model. We have demonstrated that lung fibrosis progression is inhibited by HEI3090 in the fibrotic phase but also in the acute phase of the BLM fibrosis mouse model, i.e. during the period of inflammation. This lung fibrosis mouse model commonly employed in preclinical investigations, has recently been recognized as the optimal model for studying IPF (Jenkins et al., 2017). In this model, the intrapulmonary administration of BLM induces DNA damage in alveolar epithelial type 1 cells, triggering cellular demise and the release of ATP. The extracellular release of ATP from injured cells activates the P2RX7/pannexin 1 axis, initiating the maturation of IL1β and subsequent induction of inflammation and fibrosis. In line with this, mice lacking P2RX7 exhibited reduced neutrophil counts in their bronchoalveolar fluids and decreased levels of IL1β in their lungs compared to WT mice (Riteau et al., 2010). Based on these findings, Riteau and colleagues postulated that the inhibition of P2RX7 activity may offer a potential strategy for the therapeutic control of fibrosis in lung injury. In the present study we provided strong evidence showing that selective activation of P2RX7 on immune cells, through the use of HEI3090, can dampen inflammation and fibrosis by releasing IL-18. The efficacy of HEI3090 to inhibit lung fibrosis was evaluated histologically on the whole lung’s surface by evaluating the severity of fibrosis using three independent approaches applied to the whole lung, the Ashcroft score, quantification of fibroblasts/myofibroblasts (CD140a) and polarized-light microscopy of Sirius Red staining to quantify collagen fibers. All these methods of fibrosis assessment revealed that HEI3090 exerts an inhibitory effect on lung fibrosis, underscoring the necessity for a thorough pre-clinical assessment of HEI3090's mode of action. Notably, HEI3090 functions as an activator, rather than an inhibitor, of P2RX7, further emphasizing the importance of elucidating its intricate mechanisms.”

      We trust that the detailed explanation provided therein will adequately persuade both the reviewers and future readers.

      1. The statistical concerns are based on the phrasing of "the experiment was stopped when significantly statistical results were observed". This is different from the power analysis approach that the authors describe in their latest rebuttal. However, it raises the question why the power analysis was performed using "on a one-way ANOVA analysis comparing in each experiment the vehicle and the treated group". The analyses in the manuscript use the Mann-Whitney test for several comparisons which ahs the assumption that the samples do NOT have a normal distribution. An ANOVA and t-tests have the assumption that samples are normally distributed. If the power analysis and "statistical forecasting" assumed a normal distribution and used an ANOVA, then shouldn't all the analyses also use a statistical test appropriate for normally distributed samples such as ANOVA and t-tests?

      Several of the data points in the figures seem to be normally distributed and therefore t-test for two group comparisons would be more appropriate. The most rigorous approach would be to check for normal distribution before choosing the correct statistical test and using the t-test/ANOVA in normally distributed data as well as Mann-Whitney for non-normally distributed data.

      We described in the Material and Method section of the revised manuscript our approach to determine the size of experimental group.

      “The determination of experimental group sizes involved conducting a pilot experiment with four mice in each group. Subsequently, a power analysis, based on the pilot experiment's findings (which revealed a 40% difference with a standard error of 0.9, α risk of 0.05, and power of 0.8), was performed to ascertain the appropriate group size for studying the effects of HEI3090 on BLM-induced lung fibrosis. The results of the pilot experiment and power analysis indicated that a group size of four mice was sufficient to characterize the observed effects. For each full-scale experiment, we initiated the study with 6 to 8 mice per group, ensuring a minimum of 5 mice in each group for robust statistical analysis. Additionally, we systematically employed the ROULT method to identify and subsequently exclude any outliers present in each experiment before conducting statistical analyses”.

      We now described in the Material and Method section how we carried out the statistical analyses.

      “Quantitative data were described and presented graphically as medians and interquartiles or means and standard deviations. The distribution normality was tested with the Shapiro's test and homoscedasticity with a Bartlett's test. For two categories, statistical comparisons were performed using the Student's t-test or the Mann–Whitney's test. For three and more categories, analysis of variance (ANOVA) or non-parametric data with Kruskal–Wallis was performed to test variables expressed as categories versus continuous variables. If this test was significant, we used the Tukey's test to compare these categories and the Bonferroni’s test to adjust the significant threshold. For the Gene Set Enrichment Analyses (GSEA), bilateral Kolmogorov–Smirnov test, and false discovery rate (FDR) were used. All statistical analyses were performed by biostatistician using Prism8 program from GraphPad software. Tests of significance was two-tailed and considered significant with an alpha level of P < 0.05. (graphically: * for P < 0.05, ** for P < 0.01, *** for P < 0.001).”

      We also added in the legend of each figure, the statistical analysis used to determine each p-values.

      1. Adoptive transfer: The concerns of the reviewers include an unclear analysis of the effects of adoptive transfer itself and the approaches used to analyze the data independent of the HEI3090 effect. For example, in Figure 4, the adoptive transfer IL18-/- cells (vehicle group) leads to an Ashcroft score of about 1 and among the lowest of the BLM exposed mice. Does that mean that IL18 is pro-fibrotic and that its absence is beneficial? If yes, it would go against the core premise of the study that IL18 is beneficial. Statistical comparisons of the all the vehicle conditions in the adoptive transfer would help clarify whether adoptive transfer of NLRP3-/-, IL18-/- in wild-type and P2RX7-/- mice reduces or increases fibrosis. Such multiple comparisons are necessary to fully understand the adoptive transfer studies and would also require the appropriate statistical test with corrections for multiple comparisons such as Kruskal-Wallis for data without normal distribution and ANOVA with post hoc correction for normal distribution.

      We added a new paragraph in the revised version of the manuscript to explain the adoptive transfer approach.

      “We wanted to further investigate the mechanism of action of HEI3090 by identifying the cellular compartment and signaling pathway required for its activity. Since the expression of P2RX7 and the P2RX7-dependent release of IL-18 are mostly associated with immune cells (Ferrari et al., 2006), and since HEI3090 shapes the lung immune landscape (Figure 3), we investigated whether immune cells were required for the antifibrotic effect of HEI3090. To do so, we conducted adoptive transfer experiments wherein immune cells from a donor mouse were intravenously injected one day before BLM administration into an acceptor mouse. The intravenous injection route was chosen as it is a standard method for targeting the lungs, as previously documented (Wei and Zhao, 2014). This approach was previously used with success in our laboratory (Douguet et al., 2021). It is noteworthy that this adoptive transfer approach did not influence the response to HEI3090. This was observed consistently in both p2rx7 -/- mice and p2rx7 -/- mice that received splenocytes of the same genetic background. In both cases, HEI3090 failed to mitigate lung fibrosis, as depicted in Figure 2M and Supplemental Figures 2D and 6A and B.”

      We added the Supplemental Figure 7 showing that the genetic background does not impact lung fibrosis at steady step levels where p-values were analyzed by one-way ANOVA, with Kruskal-Wallis test for multiple comparisons.

      Author response image 1.

      Supplemental Figure 7 : The genetic background does not impact lung fibrosis at steady step levels. p2rx7-/- mice were given 3.106 WT, nlrp3-/ , i118-/ or illb -l- splenocytes i_v_ one day prior to BLM delivery (i_n_ 2.5 LJ/kg) p2rx7-/- mice or p2rx7-/- mice adoptively transferred with splenocytes from indicated genetic background were treated daily i.p with mg/kg HE13090 or vehicle for 14 days. Fibrosis score assessed by the Ashcroft method. P-values were analyzed on all treated and non treated groups by one-way ANOVA, with Kruskal-Wallis test for multiple comparisons. The violin plot illustrates the distribution of Ashcroft scores across indicated experimental groups. The width of the violin at each point represents the density of data, and the central line indicates the median expression level. Each point represents one biological replicate. ns, not significant

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      In this study, Marocco and colleages perform a deep characterization of the complex molecular mechanism guiding the recognition of a particular CELLmotif previously identified in hepatocytes in another publication. Having miR-155-3p with or without this CELLmotif as initial focus, authors identify 21 proteins differentially binding to these two miRNA versions. From these, they decided to focus on PCBP2. They elegantly demonstrate PCBP2 binding to miR-155-3p WT version but not to CELLmotif-mutated version. miR-155-3p contains a hEXOmotif identified in a different report, whose recognition is largely mediated by another RNA-binding protein called SYNCRIP. Interestingly, mutation of the hEXOmotif contained in miR-155-3p did not only blunt SYNCRIP binding, but also PCBP2 binding despite the maintenance of the CELLmotif. This indicates that somehow SYNCRIP binding is a pre-requisite for PCBP2 binding. EMSA assay confirms that SYNCRIP is necessary for PCBP2 binding to miR-155-3p, while PCBP2 is not needed for SYNCRIP binding. Then authors aim to extend these finding to other miRNAs containing both motifs. For that, they perform a small-RNA-Seq of EVs released from cells knockdown for PCBP2 versus control cells, identifying a subset of miRNAs whose expression either increases or decreases. The assumption is that those miRNAs containing PCBP2-binding CELLmotif should now be less retained in the cell and go more to extracellular vesicles, thus reflecting a higher EV expression. The specific subset of miRNAs having both the CELLmotif and hEXOmotif (9 miRNAs) whose expressions increase in EVs due to PCBP2 reduction is also affected by knocking-down SYNCRIP in the sense that reduction of SYNCRIP leads to lower EV sorting. Further experiments confirm that PCBP2 and SYNCRIP bind to these 9 miRNAs and that knocking down SYNCRIP impairs their EV sorting.

      In the revised manuscript, the authors have addressed most of my concerns and questions. I believe the new experiments provide stronger support for their claims. My only remaining concern is the lack of clarity in the replicates for the EMSA experiment. The one shown in the manuscript is clear; however, the other three replicates hardly show that knocking down SYNCRIP has an effect on PCBP2 binding. Even worse is the fact that these replicates do not support at all that PCBP2 silencing has no effect on SYNCRIP binding, as the bands for those types of samples are, in most of the cases, not visible. I think the authors should work on repeating a couple of times EMSA experiment.

      We thank this Reviewer for having appreciated the novelty and the robustness of our data. In accordance with the Reviewer’s concern, we repeated the EMSA assay, specifically to address the PCBP2-independent SYNCRIP binding. In Author response image 1, we report the new EMSA replicates (top), the quantification of each signal (bottom) and the mean of EMSA signals relative to the three independent experiments (right). We hope that the new evidence will meet the required standards.

      Author response image 1.

      Reviewer #2 (Public review):

      Summary:

      The author of this manuscript aimed to uncover the mechanisms behind miRNA retention within cells. They identified PCBP2 as a crucial factor in this process, revealing a novel role for RNAbinding proteins. Additionally, the study discovered that SYNCRIP is essential for PCBP2's function, demonstrating the cooperative interaction between these two proteins. This research not only sheds light on the intricate dynamics of miRNA retention but also emphasizes the importance of protein interactions in regulating miRNA behavior within cells.

      Strengths:

      This paper makes important progress in understanding how miRNAs are kept inside cells. It identifies PCBP2 as a key player in this process, showing a new role for proteins that bind RNA. The study also finds that SYNCRIP is needed for PCBP2 to work, highlighting how these proteins work together. These discoveries not only improve our knowledge of miRNA behavior but also suggest new ways to develop treatments by controlling miRNA locations to influence cell communication in diseases. The use of liver cell models and thorough experiments ensures the results are reliable and show their potential for RNA-based therapies

      Weaknesses:

      The manuscript is well-structured and presents compelling data, but I noticed a few minor corrections that could further enhance its clarity:

      Figure References: In the response to Reviewer 1, the comment states, "It's not Panel C, it's Panel A of Figure 1"-this should be cross-checked for consistency.

      Supplementary Figure 2 is labeled as "Panel A"-please verify if additional panels (B, C, etc.) are intended.

      Western Blot Quality: The Alix WB shows some background noise. A repeat with optimized conditions (or inclusion of a cleaner replicate) would strengthen the data. Adding statistical analysis for all WBs would also reinforce robustness.

      These are relatively small refinements, and the manuscript is already in excellent shape. With these adjustments, it will be even stronger.

      We deeply thank this Reviewer for having considered this new version of the manuscript and for having described its shape as excellent. In order to address the Reviewer’s concerns, we crosschecked the consistency of the described figures’ panels described in the text accordingly. Regarding the qualitative analysis of EV markers, we repeated the western blot analysis with optimized conditions as suggested and included the new panel (Author response image 2) in the supplementary figure 2, allowing to appreciate the signal relative to ALIX expression.

      Author response image 2.

       

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      Careful reading is required to rectify typo errors.

      We thank the Reviewer for this suggestion. We amended the text to rectify typo errors.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review): 

      The reviewer retained most of their comments from the previous reviewing round. In order to meet these comments and to further examine the dynamic nature of threat omission-related fMRI responses, we now re-analyzed our fMRI results using the single trial estimates. The results of these additional analyses are added below in our response to the recommendations for the authors of reviewer 1. However, we do want to reiterate that there was a factually incorrect statement concerning our design in the reviewer’s initial comments. Specifically, the reviewer wrote that “25% of shocks are omitted, regardless of whether subjects are told that the probability is 100%, 75%, 50%, 25%, or 0%.” We want to repeat that this is not what we did. 100% trials were always reinforced (100% reinforcement rate); 0% trials were never reinforced (0% reinforcement rate). For all other instructed probability levels (25%, 50%, 75%), the stimulation was delivered in 25% of the trials (25% reinforcement rate). We have elaborated on this misconception in our previous letter and have added this information more explicitly in the previous revision of the manuscript (e.g., lines 125-129; 223-224; 486-492).   

      Reviewer #1 (Recommendations For The Authors): 

      I do not have any further recommendations, although I believe an analysis of learning-related changes is still possible with the trial-wise estimates from unreinforced trials. The authors' response does not clarify whether they tested for interactions with run, and thus the fact that there are main effects does not preclude learning. I kept my original comments regarding limitations, with the exception of the suggestion to modify the title. 

      We thank the reviewer for this recommendation. In line with their suggestion, we have now reanalyzed our main ROI results using the trial-by-trial estimates we obtained from the firstlevel omission>baseline contrasts. Specifically, we extracted beta-estimates from each ROI and entered them into the same Probability x Intensity x Run LMM we used for the relief and SCR analyses. Results from these analyses (in the full sample) were similar to our main results. For the VTA/SN model, we found main effects of Probability (F = 3.12, p = .04), and Intensity (F = 7.15, p < .001) (in the model where influential outliers were rescored to 2SD from mean). There was no main effect of Run (F = 0.92, p = .43) and no Probability x Run interaction (F = 1.24, p = .28). If the experienced contingency would have interfered with the instructions, there should have been a Probability x Run interaction (with the effect of Probability only being present in the first runs). Since we did not observe such an interaction, our results indicate that even though some learning might still have taken place, the main effect of Probability remained present throughout the task.  

      There is an important side note regarding these analyses: For the first level GLM estimation, we concatenated the functional runs and accounted for baseline differences between runs by adding run-specific intercepts as regressors of no-interest. Hence, any potential main effect of run was likely modeled out at first level. This might explain why, in contrast to the rating and SCR results (see Supplemental Figure 5), we found no main effect of Run. Nevertheless, interaction effects should not be affected by including these run-specific intercepts.

      Note that when we ran the single-trial analysis for the ventral putamen ROI, the effect of intensity became significant (F = 3.89, p = .02). Results neither changed for the NAc, nor the vmPFC ROIs.  

      Reviewer #2 (Public Review): 

      Comments on revised version: 

      I want to thank the authors for their thorough and comprehensive work in revising this manuscript. I agree with the authors that learning paradigms might not be a necessity when it comes to study the PE signals, but I don't particularly agree with some of the responses in the rebuttal letter ("Furthermore, conditioning paradigms generally only include one level of aversive outcome: the electrical stimulation is either delivered or omitted."). This is of course correct description for the conditioning paradigm, but the same can be said for an instructed design: the aversive outcome was either delivered or not. That being said, adopting the instructed design itself is legitimate in my opinion. 

      We thank the reviewer for this comment. We have now modified the phrasing of this argument to clarify our reasoning (see lines 102-104: “First, these only included one level of aversive outcome: the electrical stimulation was either delivered at a fixed intensity, or omitted; but the intensity of the stimulation was never experimentally manipulated within the same task.”).  

      The reason why we mentioned that “the aversive outcome is either delivered or omitted” is because in most contemporary conditioning paradigms only one level of aversive US is used. In these cases, it is therefore not possible to investigate the effect of US Intensity. In our paradigm, we included multiple levels of aversive US, allowing us to assess how the level of aversiveness influences threat omission responding. It is indeed true that each level was delivered or not. However, our data clearly (and robustly across experiments, see Willems & Vervliet, 2021) demonstrate that the effects of the instructed and perceived unpleasantness of the US (as operationalized by the mean reported US unpleasantness during the task) on the reported relief and the omission fMRI responses are stronger than the effect of instructed probability.  

      My main concern, which the authors spent quite some length in the rebuttal letter to address, still remains about the validity for different instructed probabilities. Although subjects were told that the trials were independent, the big difference between 75% and 25% would more than likely confuse the subjects, especially given that most of us would fall prey to the Gambler's fallacy (or the law of small numbers) to some degree. When the instruction and subjective experience collides, some form of inference or learning must have occurred, making the otherwise straightforward analysis more complex. Therefore, I believe that a more rigorous/quantitative learning modeling work can dramatically improve the validity of the results. Of course, I also realize how much extra work is needed to append the computational part but without it there is always a theoretical loophole in the current experimental design. 

      We agree with the reviewer that some learning may have occurred in our task. However, we believe the most important question in relation to our study is: to what extent did this learning influence our manipulations of interest?  

      In our reply to reviewer 1, we already showed that a re-analysis of the fMRI results using the trial-by-trial estimates of the omission contrasts revealed no Probability x Run interaction, suggesting that – overall – the probability effect remained stable over the course of the experiment. However, inspired by the alternative explanation that was proposed by this reviewer, we now also assessed the role of the Gambler’s fallacy in a separate set of analyses. Indeed, it is possible that participants start to expect a stimulation more after more time has passed since the last stimulation was experienced. To test this alternative hypothesis, we specified two new regressors that calculated for each trial of each participant how many trials had passed since the last stimulation (or since the beginning of the experiment) either overall (across all trials of all probability types; hence called the overall-lag regressor) or per probability level (across trials of each probability type separately; hence called the lag-per-probability regressor). For both regressors a value of 0 indicates that the previous trial was either a stimulation trial or the start of experiment, a value of 1 means that the last stimulation trial was 2 trials ago, etc.  

      The results of these additional analyses are added in a supplemental note (see supplemental note 6), and referred to in the main text (see lines 231-236: “Likewise, a post-hoc trial-by-trial analysis of the omission-related fMRI activations confirmed that the Probability effect for the VTA/SN activations was stable over the course of the experiment (no Probability x Run interaction) and remained present when accounting for the Gambler’s fallacy (i.e., the possibility that participants start to expect a stimulation more when more time has passed since the last stimulation was experienced) (see supplemental note 6). Overall, these post-hoc analyses further confirm the PE-profile of omission-related VTA/SN responses”.  

      Addition to supplemental material (pages 16-18)

      Supplemental Note 6: The effect of Run and the Gambler’s Fallacy 

      A question that was raised by the reviewers was whether omission-related responses could be influenced by dynamical learning or the Gambler’s Fallacy, which might have affected the effectiveness of the Probability manipulation.  

      Inspired by this question, we exploratorily assessed the role of the Gambler’s Fallacy and the effects of Run in a separate set of analyses. Indeed, it is possible that participants start to expect a stimulation more when more time has passed since the last stimulation was experienced. To test this alternative hypothesis, we specified two new regressors that calculated for each trial of each participant how many trials had passed since the last stimulation (or since the beginning of the experiment) either overall (across all trials of all probability types; hence called the overall-lag regressor) or per probability level (across trials of each probability type separately; hence called the lag-per-probability regressor). For both regressors a value of 0 indicates that the previous trial was either a stimulation trial or the start of experiment, a value of 1 means that the last stimulation trial was 2 trials ago, etc.  

      The new models including these regressors for each omission response type (i.e., omission-related activations for each ROI, relief, and omission-SCR) were specified as follows:   

      (1) For the overall lag:

      Omission response ~ Probability * Intensity * Run + US-unpleasantness + Overall-lag + (1|Subject).  

      (2) For the lag per probability level:

      Omission response ~ Probability * Intensity * Run + US-unpleasantness + Lag-perprobability : Probability + (1|Subject).  

      Where US-unpleasantness scores were mean-centered across participants; “*” represents main effects and interactions, and “:” represents an interaction (without main effect). Note that we only included an interaction for the lag-per-probability model to estimate separate lag-parameters for each probability level.  

      The results of these analyses are presented in the tables below. Overall, we found that adding these lag-regressors to the model did not alter our main results. That is: for the VTA/SN, relief and omission-SCR, the main effects of Probability and Intensity remained. Interestingly, the overall-lag-effect itself was significant for VTA/SN activations and omission SCR, indicating that VTA/SN activations were larger when more time had passed since the last stimulation (beta = 0.19), whereas SCR were smaller when more time had passed (beta = -0.03). This pattern is reminiscent of the Perruchet effect, namely that the explicit expectancy of a US increases over a run of non-reinforced trials (in line with the gambler’s fallacy effect) whereas the conditioned physiological response to the conditional stimulus declines (in line with an extinction effect, Perruchet, 1985; McAndrew, Jones, McLaren, & McLaren, 2012). Thus, the observed dissociation between the VTA/SN activations and omission SCR might similarly point to two distinctive processes where VTA/SN activations are more dependent on a consciously controlled process that is subjected to the gambler’s fallacy, whereas the strength of the omission SCR responses is more dependent on an automatic associative process that is subjected to extinction. Importantly, however, even though the temporal distance to the last stimulation had these opposing effects on VTA/SN activations and omission SCRs, the main effects of the probability manipulation remained significant for both outcome variables. This means that the core results of our study still hold.   

      Next to the overall-lag effect, the lag-per-probability regressor was only significant for the vmPFC. A follow-up of the beta estimates of the lag-per-probability regressors for each probability level revealed that vmPFC activations increased with increasing temporal distance from the stimulation, but only for the 50% trials (beta = 0.47, t = 2.75, p < .01), and not the 25% (beta = 0.25, t = 1.49, p = .14) or the 75% trials (beta = 0.28, t = 1.62, p = .10).

      Author response table 1.

      F-statistics and corresponding p-values from the overall lag model. (*) F-test and p-values were based on the model where outliers were rescored to 2SD from the mean. Note that when retaining the influential outliers for this model, the p-value of the probability effect was p = .06. For all other outcome variables, rescoring the outliers did not change the results. Significant effects are indicated in bold.

      Author response table 2.

      F-statistics and corresponding p-values from the lag per probability level model. (*) F-test and p-values were based on the model where outliers were rescored to 2SD from the mean. Note that when retaining the influential outliers for this model, the p-value of the Intensity x Run interaction was p = .05. For all other outcome variables, rescoring the outliers did not change the results. Significant effects are indicated in bold.

      As the authors mentioned in the rebuttal letter, "selecting participants only if their anticipatory SCR monotonically increased with each increase in instructed probability 0% < 25% < 50% < 75% < 100%, N = 11 participants", only ~1/3 of the subjects actually showed strong evidence for the validity of the instructions. This further raises the question of whether the instructed design, due to the interference of false instruction and the dynamic learning among trials, is solid enough to test the hypothesis .  

      We agree with the reviewer that a monotonic increase in anticipatory SCR with increasing probability instructions would provide the strongest evidence that the manipulation worked. However, it is well known that SCR is a noisy measure, and so the chances to see this monotonic increase are rather small, even if the underlying threat anticipation increases monotonically. Furthermore, between-subject variation is substantial in physiological measures, and it is not uncommon to observe, e.g., differential fear conditioning in one measure, but not in another (Lonsdorf & Merz, 2017). It is therefore not so surprising that ‘only’ 1/3 of our participants showed the perfect pattern of monotonically increasing SCR with increasing probability instructions. That being said, it is also important to note that not all participants were considered for these follow-up analyses because valid SCR data was not always available.

      Specifically, N = 4 participants were identified as anticipation non-responders (i.e. participant with smaller average SCR to the clock on 100% than on 0% trials; pre-registered criterium) and were excluded from the SCR-related analyses, and N = 1 participant had missing data due to technical difficulties. This means that only 26 (and not 31) participants were considered for the post hoc analyses. Taking this information into account, this means that 21 out of 26 participants (approximately 80%) showed stronger anticipatory SCR following 75% instructions compared to 25% instructions and that  11 out of 26 participants (approximately 40%) even showed the monotonical increase in their anticipatory SCR (see supplemental figure 4). Furthermore, although anticipatory SCR gradually decreased over the course of the experiment, there was no Run x Probability interaction, indicating that the instructions remained stable throughout the task (see supplemental figure 3).  

      Reviewer #2 (Recommendations For The Authors):

      A more operational approach might be to break the trials into different sections along the timeline and examine how much the results might have been affected across time. I expect the manipulation checks would hold for the first one or two runs and the authors then would have good reasons to focus on the behavioral and imaging results for those runs. 

      This recommendation resembles the recommendation by reviewer 1. In our reply to reviewer 1, we showed the results of a re-analysis of the fMRI data using the trial-by-trial estimates of the omission contrasts, which revealed no Probability x Run interaction, suggesting that – overall - the probability effect remained (more or less) stable over the course of the experiment.  For a more in depth discussion of the results of this additional analysis, we refer to our answer to reviewer 1.  

      Reviewer #3 (Public Review): 

      Comments on revised version: 

      The authors were extremely responsive to the comments and provided a comprehensive rebuttal letter with a lot of detail to address the comments. The authors clarified their methodology, and rationale for their task design, which required some more explanation (at least for me) to understand. Some of the design elements were not clear to me in the original paper. 

      The initial framing for their study is still in the domain of learning. The paper starts off with a description of extinction as the prime example of when threat is omitted. This could lead a reader to think the paper would speak to the role of prediction errors in extinction learning processes. But this is not their goal, as they emphasize repeatedly in their rebuttal letter. The revision also now details how using a conditioning/extinction framework doesn't suit their experimental needs. 

      We thank the reviewer for pointing out this potential cause of confusion. We have now rewritten the starting paragraph of the introduction to more closely focus on prediction errors, and only discuss fear extinction as a potential paradigm that has been used to study the role of threat omission PE for fear extinction learning (see lines 40-55). We hope that these adaptations are sufficient to prevent any false expectations. However, as we have mentioned in our previous response letter, not talking about fear extinction at all would also not make sense in our opinion, since most of the knowledge we have gained about threat omission prediction errors to date is based on studies that employed these paradigms.  

      Adaptation in the revised manuscript (lines 40-55):  

      “We experience pleasurable relief when an expected threat stays away1. This relief indicates that the outcome we experienced (“nothing”) was better than we expected it to be (“threat”). Such a mismatch between expectation and outcome is generally regarded as the trigger for new learning, and is typically formalized as the prediction error (PE) that determines how much there can be learned in any given situation2. Over the last two decades, the PE elicited by the absence of expected threat (threat omission PE) has received increasing scientific interest, because it is thought to play a central role in learning of safety. Impaired safety learning is one of the core features of clinical anxiety4. A better understanding of how the threat omission PE is processed in the brain may therefore be key to optimizing therapeutic efforts to boost safety learning. Yet, despite its theoretical and clinical importance, research on how the threat omission PE is computed in the brain is only emerging.  

      To date, the threat omission PE has mainly been studied using fear extinction paradigms that mimic safety learning by repeatedly confronting a human or animal with a threat predicting cue (conditional stimulus, CS; e.g. a tone) in the absence of a previously associated aversive event (unconditional stimulus, US; e.g., an electrical stimulation). These (primarily non-human) studies have revealed that there are striking similarities between the PE elicited by unexpected threat omission and the PE elicited by unexpected reward.”

      It is reasonable to develop a new task to answer their experimental questions. By no means is there a requirement to use a conditioning/extinction paradigm to address their questions. As they say, "it is not necessary to adopt a learning paradigm to study omission responses", which I agree with.  But the authors seem to want to have it both ways: they frame their paper around how important prediction errors are to extinction processes, but then go out of their way to say how they can't test their hypotheses with a learning paradigm.

      Part of their argument that they needed to develop their own task "outside of a learning context" goes as follows: 

      (1) "...conditioning paradigms generally only include one level of aversive outcome: the electrical stimulation is either delivered or omitted. As a result, the magnitude-related axiom cannot be tested." 

      (2) "....in conditioning tasks people generally learn fast, rendering relatively few trials on which the prediction is violated. As a result, there is generally little intra-individual variability in the PE responses" 

      (3) "...because of the relatively low signal to noise ratio in fMRI measures, fear extinction studies often pool across trials to compare omission-related activity between early and late extinction, which further reduces the necessary variability to properly evaluate the probability axiom" 

      These points seem to hinge on how tasks are "generally" constructed. However, there are many adaptations to learning tasks:

      (1) There is no rule that conditioning can't include different levels of aversive outcomes following different cues. In fact, their own design uses multiple cues that signal different intensities and probabilities. Saying that conditioning "generally only include one level of aversive outcome" is not an explanation for why "these paradigms are not tailored" for their research purposes. There are also several conditioning studies that have used different cues to signal different outcome probabilities. This is not uncommon, and in fact is what they use in their study, only with an instruction rather than through learning through experience, per se.

      (2) Conditioning/extinction doesn't have to occur fast. Just because people "generally learn fast" doesn't mean this has to be the case. Experiments can be designed to make learning more challenging or take longer (e.g., partial reinforcement). And there can be intra-individual differences in conditioning and extinction, especially if some cues have a lower probability of predicting the US than others. Again, because most conditioning tasks are usually constructed in a fairly simplistic manner doesn't negate the utility of learning paradigms to address PEaxioms.

      (3) Many studies have tracked trial-by-trial BOLD signal in learning studies (e.g., using parametric modulation). Again, just because other studies "often pool across trials" is not an explanation for these paradigms being ill-suited to study prediction errors. Indeed, most computational models used in fMRI are predicated on analyzing data at the trial level. 

      We thank the reviewer for these remarks. The “fear conditioning and extinction paradigms” that we were referring to in this paragraph were the ones that have been used to study threat omission PE responses in previous research (e.g., Raczka et al., 2011; Thiele et al. 2021; Lange et al. 2020; Esser et al., 2021; Papalini et al., 2021; Vervliet et al. 2017). These studies have mainly used differential/multiple-cue protocols where either one (or two) CS+  and one CS- are trained in an acquisition phase and extinguished in the next phase. Thus, in these paradigms: (1) only one level of aversive US is used; and (2) as safety learning develops over the course of extinction, there are relatively few omission trials during which “large” threat omission PEs can be observed (e.g. from the 24 CS+ trials that were used during extinction in Esser et al., the steepest decreases in expectancy – and thus the largest PE – were found in first 6 trials); and (3) there was never absolute certainty that the stimulation will no longer follow. Some of these studies have indeed estimated the threat omission PE during the extinction phase based on learning models, and have entered these estimates as parametric modulators to CS-offset regressors. This is very informative. However, the exact model that was used differed per study (e.g. Rescorla-Wagner in Raczka et al. and Thiele et al.; or a Rescorla- Wagner–Pearce- Hall hybrid model in Esser et al.). We wanted to analyze threat omission-responses without commitment to a particular learning model. Thus, in order to examine how threat omissionresponses vary as a function of probability-related expectations, a paradigm that has multiple probability levels is recommended (e.g. Rutledge et al., 2010; Ojala et al., 2022)

      The reviewer rightfully pointed out that conditioning paradigms (more generally) can be tailored to fit our purposes as well. Still, when doing so, the same adaptations as we outlined above need to be considered: i.e. include different levels of US intensity; different levels of probability; and conditions with full certainty about the US (non)occurrence. In our attempt to keep the experimental design as simple and straightforward as possible, we decided to rely on instructions for this purpose, rather than to train 3 (US levels) x 5 (reinforcement levels) = 15 different CSs. It is certainly possible to train multiple CSs of varying reinforcement rates (e.g. Grings et al. 1971, Ojala et al., 2022). However, given that US-expectation on each trial would primarily depend on the individual learning processes of the participants, using a conditioning task would make it more difficult to maintain experimental control over the level of USexpectation elicited by each CS. As a result, this would likely require more extensive training, and thus prolong the study procedure considerably. Furthermore, even though previous studies have trained different CSs for different reinforcement rates, most of these studies have only used one level of US. Thus, in order to not complexify our task to much, we decided to rely on instructions rather than to train CSs for multiple US levels (in addition to multiple reinforcement rates).

      We have tried to clarify our reasoning in the revised version of the manuscript (see introduction, lines 100-113):  

      “The previously discussed fear conditioning and extinction studies have been invaluable for clarifying the role of the threat omission PE within a learning context. However, these studies were not tailored to create the varying intensity and probability-related conditions that are required to systematically evaluate the threat omission PE in the light of the PE axioms. First, these only included one level of aversive outcome: the electrical stimulation was either delivered or omitted; but the intensity of the stimulation was never experimentally manipulated within the same task. As a result, the magnitude-related axiom could not be tested. Second, as safety learning progressively developed over the course of extinction learning, the most informative trials to evaluate the probability axiom (i.e. the trials with the largest PE) were restricted to the first few CS+ offsets of the extinction phase, and the exact number of these informative trials likely differed across participants as a result of individually varying learning rates. This limited the experimental control and necessary variability to systematically evaluate the probability axiom. Third, because CS-US contingencies changed over the course of the task (e.g. from acquisition to extinction), there was never complete certainty about whether the US would (not) follow. This precluded a direct comparison of fully predicted outcomes. Finally, within a learning context, it remains unclear whether brain responses to the threat omission are in fact responses to the violation of expectancy itself, or whether they are the result of subsequent expectancy updating.”

      Again, the authors are free to develop their own task design that they think is best suited to address their experimental questions. For instance, if they truly believe that omission-related responses should be studied independent of updating. The question I'm still left puzzling is why the paper is so strongly framed around extinction (the word appears several times in the main body of the paper), which is a learning process, and yet the authors go out of their way to say that they can only test their hypotheses outside of a learning paradigm. 

      As we have mentioned before, the reason why we refer to extinction studies is because most evidence on threat omission PE to date comes from fear extinction paradigms.  

      The authors did address other areas of concern, to varying extents. Some of these issues were somewhat glossed over in the rebuttal letter by noting them as limitations. For example, the issue with comparing 100% stimulation to 0% stimulation, when the shock contaminates the fMRI signal. This was noted as a limitation that should be addressed in future studies, bypassing the critical point. 

      It is unclear to us what the reviewer means with “bypassing the critical point”. We argued in the manuscript that the contrast we initially specified and preregistered to study axiom 3 (fully predicted outcomes elicit equivalent activation) could not be used for this purpose, as it was confounded by the delivery of the stimulation. Because 100% trials aways included the stimulation and 0% trials never included stimulation, there was no way to disentangle activations related to full predictability from activations related to the stimulation as such.   

      Reviewer #3 (Recommendations For The Authors): 

      I'm not sure the new paragraph explaining why they can't use a learning task to test their hypotheses is very convincing, as I noted in my review. Again, it is not a problem to develop a new task to address their questions. They can justify why they want to use their task without describing (incorrectly in my opinion) that other tasks "generally" are constructed in a way that doesn't suit their needs. 

      For an overview of the changes we made in response to this recommendation, we refer to our reply to the public review.   

      We look forward to your reply and are happy to provide answers to any further questions or comments you may have.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #3:

      Comments on current version:

      As mentioned in my first review, this work is significantly underpowered for the following reasons: 1) n=4 for each treatment group.; 2) no randomization of the surgical sites receiving treatments; 3) implants surgically inserted without precision/guided surgery. The authors have not addressed these concerns.

      On a minor note: not sure why the authors present a methodology to evaluate the dynamic bone formation (line 272) but do not present results (i.e. by means of histomorphometrical analyses) utilizing this methodology.

      We sincerely appreciate your thorough review and valuable feedback. We have carefully considered your comments and would like to address them as follows:

      As mentioned in my first review, this work is significantly underpowered for the following reasons:

      (1) n=4 for each treatment group.;

      We acknowledge your concern regarding the limited sample size (n=4 per group). While we understand this may affect statistical power, our choice was influenced by ethical considerations in animal experimentation and resource constraints. Increasing the sample size would undoubtedly strengthen the statistical power of our study. However, the logistical and ethical constraints associated with using a larger number of animals in such invasive procedures were significant limiting factors. Specifically, increasing the number of medium to large experimental animals could raise ethical issues, so we used the minimum number possible. Additionally, our study design was reviewed and approved by the animal IRB, which dictated the minimum number of animals we could use. Nevertheless, we conducted power analysis to ensure that our sample size, although limited, was sufficient to detect significant differences given the high variability typically observed in biological responses. The results obtained from our n=4 samples showed consistent trends and significant differences between groups, indicating the robustness of our findings. I will include this point in the limitations section of the discussion. Thank you.

      (2) no randomization of the surgical sites receiving treatments;

      Thank you for pointing out this issue. We agree that randomization is essential when considering individual differences and the anatomical variations of the jawbone, such as those found in humans. However, this study is an animal experiment where other conditions were controlled, and the interventions were applied after complete bone healing following tooth extraction. Therefore, the impact of randomization of surgical sites was likely minimal, and it is challenging to determine whether it significantly influenced the experimental results. Of course, twelve female OVX beagles were randomly designated into three groups. (Methods section, line 298) However regarding your concern, we would like to present the robustness of histological results from different surgical sites as shown below. Also we will include this point in the limitations section of the discussion.

      Histologic analysis of the different surgical sites showed significant differences in bone formation and osseointegration among the three treatment groups: vehicle control, rhPTH(1-34), and dimeric Cys25PTH(1-34). Goldner trichrome staining (Figure A-C) showed enhanced bone formation in both the rhPTH(1-34) and dimeric Cys25PTH(1-34) groups compared to the vehicle control group. The rhPTH(1-34) group showed the most pronounced bone mass gain around the implant. Both treatment groups showed improved bone-to-implant contact compared to the control group, as indicated by the red arrows.

      Masson trichrome staining (Figure D-F) further confirmed these results, showing an increase in bone matrix (blue staining) in the rhPTH(1-34) and dimeric Cys25PTH(1-34) groups, with the dimeric rhPTH(1-34) group showing the most extensive and dense bone formation.

      TRAP staining (Figure G-I and G'-I') was used to assess osteoclast activity. Interestingly, both the rhPTH(1-34) and dimeric Cys25PTH(1-34) groups showed an increase in TRAP-positive cells compared to the vehicle control, suggesting enhanced bone remodeling activity. The highest number of TRAP-positive cells was observed in the rhPTH(1-34) group and the highest trabecular number, indicating the most active bone remodeling.

      To summarize the results, histological analyses revealed that both rhPTH(1-34) and dimeric Cys25PTH(1-34) treatments significantly enhanced osseointegration and bone formation around titanium implants in a postmenopausal osteoporosis model compared to the control. The rhPTH(1-34) group demonstrated superior outcomes, exhibiting the most substantial increase in bone volume, bone-to-implant contact, and osteoclastic activity, indicating its greater efficacy in promoting bone regeneration and implant integration in this experimental context.

      Author response image 1.

      Histological analysis using Goldner trichrome, Masson trichrome, and TRAP staining

      (3) implants surgically inserted without precision/guided surgery. The authors have not addressed these concerns.

      The primary purpose of precision guides is to prevent damage to various anatomical structures and to ensure perfect placement at the desired location. Even disregarding the potential inaccuracies of precision guides in actual clinical settings, the primary goal of this animal experiment was not to achieve perfect placement or prevent damage to anatomical structures. Instead, the objective was to histologically measure the integrity of the bone surrounding titanium fixture's platform after pharmacological intervention, ensuring it was fully seated in the alveolar bone. To this end, we secured sufficient visibility through periosteal dissection to confirm the perfect placement of the implant and adhered to the principle of maintaining sufficient mesiodistal distance between each fixture. Using such precision guides in this animal experiment, which is not an evaluation of 'implant precision guides,' could potentially introduce inaccuracies and contradict the experimental objectives. Furthermore, since this experiment was conducted on an edentulous ridge where all teeth had been extracted, achieving the same placement as in the presurgical simulation would be impossible, even with the use of precision guides. Thank you once again for your constructive feedback. We will include this point in the limitations section of the discussion.

      On a minor note: not sure why the authors present a methodology to evaluate the dynamic bone formation (line 272) but do not present results (i.e. by means of histomorphometrical analyses) utilizing this methodology.

      As the reviewer mentioned, we confirmed that the sentence was included in the Methods section despite the analysis not actually being performed. We sincerely apologize for this oversight and will make the necessary corrections immediately. Thank you very much for your keen observation.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review):

      Summary:

      Tian et al. describe how TIPE regulates melanoma progression, stemness, and glycolysis. The authors link high TIPE expression to increased melanoma cell proliferation and tumor growth. TIPE causes dimerization of PKM2, as well as translocation of PKM2 to the nucleus, thereby activating HIF-1alpha. TIPE promotes the phosphorylation of S37 on PKM2 in an ERK-dependent manner. TIPE is shown to increase stem-like phenotype markers. The expression of TIPE is positively correlated with the levels of PKM2 Ser37 phosphorylation in murine and clinical tissue samples. Taken together, the authors demonstrate how TIPE impacts melanoma progression, stemness, and glycolysis through dimeric PKM2 and HIF-1alpha crosstalk.

      Strengths:

      The authors manipulated TIPE expression using both shRNA and overexpression approaches throughout the manuscript. Using these models, they provide strong evidence of the involvement of TIPE in mediating PKM2 Ser37 phosphorylation and dimerization. The authors also used mutants of PKM2 at S37A to block its interaction with TIPE and HIF-1alpha. In addition, an ERK inhibitor (U0126) was used to block the phosphorylation of Ser37 on PKM2. The authors show how dimerization of PKM2 by TIPE causes nuclear import of PKM2 and activation of HIF-1alpha and target genes. Pyridoxine was used to induce PKM2 dimer formation, while TEPP-46 was used to suppress PKM2 dimer formation. TIPE maintains stem cell phenotypes by increasing the expression of stem-like markers. Furthermore, the relationship between TIPE and Ser37 PKM2 was demonstrated in murine and clinical tissue samples.

      Weaknesses:

      The evaluation of how TIPE causes metabolic reprogramming can be better assessed using isotope tracing experiments and improved bioenergetic analysis.

      Thank you immensely for your invaluable suggestions. Regrettably, we encountered a significant obstacle in completing the isotope tracing experiments due to an unfortunate shortage of necessary instruments. Furthermore, despite our efforts to consult with several companies, we were unable to secure their assistance, which unfortunately hindered the completion of these experiments. We deeply apologize for this imperfection in our experimental design and have thoroughly discussed this limitation in our manuscript.

      Additionally, we acknowledge our oversight in the previous versions of our manuscripts, where only three metabolites were presented. To rectify this and provide a more comprehensive understanding of the metabolic reprogramming induced by TIPE, we have conducted routine untargeted metabolomics analysis. We are pleased to announce that we have incorporated the detailed results of this analysis into our work as a new supplementary figure, designated as Figure S3. This figure specifically highlights the notable decrease in the glycolysis pathway, particularly in pyruvate and lactic acid levels, following TIPE interference.

      Reviewer #2 (Public Review):

      In this article, Tian et al present a convincing analysis of the molecular mechanisms underpinning TIPE-mediated regulation of glycolysis and tumor growth in melanoma. The authors begin by confirming TIPE expression in melanoma cell lines and identify "high" and "low" expressing models for functional analysis. They show that TIPE depletion slows tumour growth in vivo, and using both knockdown and over-expression approaches, show that this is associated with changes in glycolysis in vitro. Compelling data using multiple independent approaches is presented to support an interaction between TIPE and the glycolysis regulator PKM2, and the over-expression of TIPE-promoted nuclear translocation of PKM2 dimers. Mechanistically, the authors also demonstrate that PKM2 is required for TIPE-mediated activation of HIF1a transcriptional activity, as assessed using an HRE-promoter reporter assay, and that TIPE-mediated PKM2 dimerization is p-ERK dependent. Finally, the dependence of TIPE activity on PKM2 dimerization was demonstrated on tumor growth in vivo and in the regulation of glycolysis in vitro, and ectopic expression of HIF1a could rescue the inhibition of PKM2 dimerization in TIPE overexpressing cells and reduced induction of general cancer stem cell markers, showing a clear role for HIF1a in this pathway. The main conclusions of this paper are well supported by data, but some aspects of the experiments need clarification and some data panels are difficult to read and interpret as currently presented.

      The detailed mechanistic analysis of TIPE-mediated regulation of PKM2 to control aerobic glycolysis and tumor growth is a major strength of the study and provides new insights into the molecular mechanisms that underpin the Warburg effect in cancer cells. However, despite these strengths, some weaknesses were noted, which if addressed will further strengthen the study.

      (1) The analysis of patient samples should be expanded to more directly measure the relationship between TIPE levels and melanoma patient outcome and progression (primary vs metastasis), to build on the association between TIPE levels and proliferation (Ki67) and hypoxia gene sets that are currently shown.

      Thanks for your suggestions. We have expanded the analysis to include the relationship between TIPE levels and melanoma progression, specifically distinguishing between non-lymph node metastasis and lymph node metastasis. In addition, we added the association between TIPE and Ki67 or LDH levels as your advised, as shown in Figure 7.

      However, the relationship between TIPE levels and melanoma patient outcome is not presented in this article. One reason is that the tissue microarray lack of the survival data. Interestingly, the TCGA dataset showed that the higher TIPE expression has a favorable prognosis for melanoma. We are also very curious about this. Our following study indicated that TIPE might serve as a positive regulator of PD-L1. Therefore, the higher expression of TIPE presents more sensitive tendency to immunotherapy, resulting in a favorable prognosis in melanoma. The detailed mechanisms will be discussed in our following article, and we hope that it might as a continuous research topic for TIPE in melanoma.

      We just only disclose a little information that TIPE shares similar survival and immune signature to PD-L1 and PD-1 in melanoma as following:

      Author response image 1.

      (2) The duration of the in vivo experiments was not clearly defined in the figures, however, it was clear from the tumor volume measurements that they ended well before standard ethical endpoints in some of the experiments. A rationale for this should be provided because longer-duration experiments might significantly change the interpretation of the data. For example, does TIPE depletion transiently reduce or lead to sustained reductions in tumor growth?

      Thanks for your suggestions. Actually, we have performed a pre-experiment before the formal experiments, and all the time points were referred to this. Furthermore, we have added the detailed time points into the figure legends as you suggested.

      (3) The analysis of general cancer stem cell markers is solid and interesting, however inclusion of neural crest stem cell markers that are more relevant to melanoma biology would greatly strengthen this aspect of the study.

      Thanks for your advices. We have selected two neural crest stem cell markers including Nestin and Sox10 to test their expression after overexpression of TIPE in G361 cells or interference of TIPE in A375 cells.

      (4) The authors should take care that all data panels are clearly readable in the figures to facilitate appropriate interpretation by the reader.

      Thanks for your suggestions. We have amended the data panels according to you advises to ensure it is clear and professionally presented.

      Reviewer #1 (Recommendations for the authors):

      It would be suggested to improve the image quality of certain panels (please refer to Fig.1A and Fig.S3B-D).

      Thank you for your expert advice. We have optimized the quality of certain panels according to your suggestions.

      Reviewer #2 (Recommendations for the authors):

      Major comments:

      - TCGA survival/patient outcome data relative to TIPE levels should be provided in the supplementary figures, together with TIPE correlation with PKM2.

      - Suggest revising how this point is described in the discussion.

      We have added the results of TIPE expression and prognosis of melanoma patients from the TCGA database as required by the expert, and discussed it appropriately in the article. In addition, the correlation between TIPE and PKM2 expression has already been described in Supplementary Figure 6.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      We would like to first thank the Editor as well as the three reviewers for their enthusiasm and conducting another careful evaluation of our manuscript. We appreciate their thoughtful and constructive comments and suggestions. Some concerns regarding experimental design, data analysis, and over-interpretation of our findings still remains unresolved after the initial revision. Here we endeavored to address these remaining concerns through further refinement of our writing, and inclusion of these concerns in the discussion session. We hope our response can better explain the rationale of our experimental design and data interpretation. In addition, we also acknowledge the limitations of our present study, so that it will benefit future investigations into this topic. Our detail responses are provided below.

      Reviewer #1 (Public Review):

      This study examines whether the human brain uses a hexagonal grid-like representation to navigate in a non-spatial space constructed by competence and trustworthiness. To test this, the authors asked human participants to learn the levels of competence and trustworthiness for six faces by associating them with specific lengths of bar graphs that indicate their levels in each trait. After learning, participants were asked to extrapolate the location from the partially observed morphing bar graphs. Using fMRI, the authors identified brain areas where activity is modulated by the angles of morphing trajectories in six-fold symmetry. The strength of this paper lies in the question it attempts to address. Specifically, the question of whether and how the human brain uses grid-like representations not only for spatial navigation but also for navigating abstract concepts, such as social space, and guiding everyday decision-making. This question is of emerging importance.

      I acknowledge the authors' efforts to address the comments received. However, my concerns persist:

      Thanks very much again for the re-evaluation and comments. Please find our revision plans to each comment below.

      (1) The authors contend that shorter reaction times correlated with increased distances between individuals in social space imply that participants construct and utilize two-dimensional representations. This method is adapted from a previous study by Park et al. Yet, there is a fundamental distinction between the two studies. In the prior work, participants learned relationships between adjacent individuals, receiving feedback on their decisions, akin to learning spatial locations during navigation. This setup leads to two different predictions: If participants rely on memory to infer relationships, recalling more pairs would be necessary for distant individuals than for closer ones. Conversely, if participants can directly gauge distances using a cognitive map, they would estimate distances between far individuals as quickly as for closer ones. Consequently, as the authors suggest, reaction times ought to decrease with increasing decision value, which, in this context, corresponds to distances. However, the current study allowed participants to compare all possible pairs without restricting learning experiences, rendering the application of the same methodology for testing two-dimensional representations inappropriate. In this study, the results could be interpreted as participants not forming and utilizing two-dimensional representations.

      We apologize for not being clear enough about our task design, we have made relevant changes in the methodology section in the manuscript to make it clearer. The reviewer’s concern is that participants learned about all the pairs in the comparison task which makes the distance effect invalid. We would like to clarify that during all the memory test tasks (the comparison task, the collect task and the recall task outside and inside scanner), participants never received feedback on whether their responses were correct or not. Therefore, the comparison task in our study is similar to the previous study by Park et al. (2021). Participants do not have access to correct responses for all possible pairs of comparison prior to or during this task, they would need to make inference based on memory retrieval.

      (2) The confounding of visual features with the value of social decision-making complicates the interpretation of this study's results. It remains unclear whether the observed grid-like effects are due to visual features or are genuinely indicative of value-based decision-making, as argued by the authors. Contrary to the authors' argument, this issue was not present in the previous study (Constantinescu et al.). In that study, participants associated specific stimuli with the identities of hidden items, but these stimuli were not linked to decision-making values (i.e., no image was considered superior to another). The current study's paradigm is more akin to that of Bao et al., which the authors mention in the context of RSA analysis. Indeed, Bao et al. controlled the length of the bars specifically to address the problem highlighted here. Regrettably, in the current paradigm, this conflation remains inseparable.

      We’d like to thank the reviewer for facilitating the discussion on the question of ‘social space’ vs. ‘sensory space’. The task in scanner did not require value-based decision making. It is akin to both the Bao et al. (2019) study and Constantinescu et al. (2016) study in a sense that all three tasks are trying to ask participants to imagine moving along a trajectory in an abstract, non-physical space and the trajectory is grounded in sensory cue. Participants were trained to associate the sensory cue with abstract (social/nonsocial) concepts. We think that the paradigm is a relatively faithful replication of the study by Constantinescu et al. Nonetheless, we agreed that a design similar to Bao et al. (2019) which controls for sensory confounds would be more ideal to address this concern, or adopting a value-based decision-making task in the scanner similar to that by Park et al. (2021), and we have included this limitation in the discussion section.

      (3) While the authors have responded to comments in the public review, my concerns noted in the Recommendation section remain unaddressed. As indicated in my recommendations, there are aspects of the authors' methodology and results that I find difficult to comprehend. Resolving these issues is imperative to facilitate an appropriate review in subsequent stages.

      Considering that the issues raised in the previous comments remain unresolved, I have retained my earlier comments below for review.

      We apologize for not addressing the recommendations properly, please find detailed our response and plans for revision.

      I have some comments. I hope that these can help.

      (1) While the explanation of Fig.4A-C is lacking in both the main text and figure legend, I am not sure if I understand this finding correctly. Did the authors find the effects of hexagonal modulation in the medial temporal gyrus and lingual gyrus correlate with the individual differences in the extent to which their reaction times were associated with the distances between faces when choosing a better collaborator? If so, I am not sure what argument the authors try to draw from these findings. Do the authors argue that these brain areas show hexagonal modulation, which was not supported in the previous analysis (Fig.3)? What is the level of correlation between these behavioral measures and the grid consistency effects in the vmPFC and EC, where the authors found actual grid-like activity? How do the authors interpret this finding? More importantly, how does this finding associate with other findings and the argument of the study?

      We apologize for not being clear enough in the manuscript and we will improve the clarity in our revision. This exploratory analysis reported in Figure 4 aims to use whole-brain analysis to examine: 1) if there is any correlation between the strength of grid-like representation of social value map and behavioral indicators of map-like representation; and 2) if there are any correlation between the strength of grid-like representation of this social value map and participants’ social trait.

      To be more specific, for the behavioral indicator, we used the distance effect in the reaction time of the comparison task outside the scanner. We interpreted stronger distance effect as a behavioral index of having better internal map-like representation. We interpreted stronger grid consistency effect as a neural index of better representation of the 2D social space. Therefore, we’d like to see if there exists correlation between behavioral and neural indices of map-like representation.

      To achieve this goal, behavioral indicators are entered as covariates in second-level analysis of the GLM testing grid consistency effect (GLM2). Figure3 showed results from GLM2 without the covariates. Figure4 showed results of clusters whose neural indices of map-like representation covaried with that from behavior and survived multiple-comparison correction. Indeed, in these regions, the grid consistency effect was not significant at group level (so not shown in Figure 3). We tried to interpret this finding in our discussion (line 374-289 for temporal lobe correlation, line 395-404 for precuneus correlation).

      Finally, we would like to point out that including the covariates in GLM2 did not change results in Figure3, the clusters in Figure3 still survives correction. Meanwhile, these clusters in Figure 3 did not show correlation with behavioral indicators of map-like representation.

      Author response image 1.

      (2) There are no behavioral results provided. How accurately did participants perform each of the tasks? How are the effects of grid consistency associated with the level of accuracy in the map test?

      Why did participants perform the recall task again outside the scanner?

      We will endeavor to improve signposting the corresponding figures in the main text. For the behavioral results, we reported the stats in section “Participants construct social value map after associative learning of avatars and corresponding characteristics” in the main text, and the plots are shown in Figure 1. Particularly, figure 1F showed accuracy of tasks in training, as well as the recall task in the scanner. For the correlation, we did not find significant correlation between behavioural accuracy and grid consistency effect. We will make it clearer in the result section.

      (3) The methods did not explain how the grid orientation was estimated and what the regressors were in GLM2. I don't think equations 2 and 3 are quite right.

      For the grid orientation estimation method, we provided detailed description in the Supplementary methods 2.2.2. We will add links to this section in the main text.

      Equation 2 and 3 describes how the parametric regressors entered into GLM2 were formed and provided prerequisites on calculation of grid orientations. Equation 2 was the results of directly applying the angle addition and subtraction theorems so they should be correct. We will try to make the rationale clearer in the supplementary text.

      (4) With the increase in navigation distances, more grid cells would activate. Therefore, in theory, the activity in the entorhinal cortex should increase with the Euclidean distances, which has not been found here. I wonder if there was enough variability in the Euclidean distances that can be captured by neural correlates. This would require including the distributions of Euclidean distances according to their trajectory angles. Regarding how Fig.1E is generated, I don't understand what this heat map indicates. Additionally, it needs to be confirmed if the grid effects remain while controlling for the Euclidean distances of navigation trajectories.

      We did not specifically control for the trajectory length, we only controlled for the distribution of trajectory to be uniform. We have included a figure of the distribution of Euclidean distances in Figure S9 and the distribution of trajectory direction in Figure S8.

      Author response image 2.

      As for Figure 1E, we aim to reproduce the findings from Figure 1F in Constantinescu et al. (2016) where they showed that participants progressively refined the locations of the outcomes through training. We divided the space into 15×15 subregions and computed the amount of time spent in each subregion and plotted Figure 1E. Brighter color in Figure 1E indicate greater amount of time spent in the corresponding subregion. Note that all these timing indices were computed as a percentage of the total time spent in the explore task in a given session. If participants were well-acquainted with the space and avatars, they would spend more time at the avatar (brighter color in avatar locations) in the review session compared to the learning session.

      As for the effect of distances on grid-like representation, we did not include the distance as a parametric modulator in grid consistency effect GLM (GLM2) due to insufficient trials in each bin (6-8 trials). But there is side evidence that could potentially rule out this confound. In the distance representation analysis, we did not find distance representation in any of the clusters that have significant grid-like representation (regions in Figure 2).

      Reviewer #2 (Public Review):

      Summary:

      In this work, Liang et al. investigate whether an abstract social space is neurally represented by a grid-like code. They trained participants to 'navigate' around a two-dimensional space of social agents characterized by the traits warmth and competence, then measured neural activity as participants imagined navigating through this space. The primary neural analysis consisted of three procedures: 1) identifying brain regions exhibiting the hexagonal modulation characteristic of a grid-like code, 2) estimating the orientation of each region's grid, and 3) testing whether the strength of the univariate neural signal increases when a participant is navigating in a direction aligned with the grid, compared to a direction that is misaligned with the grid. From these analyses, the authors find the clearest evidence of a grid-like code in the prefrontal cortex and weaker evidence in the entorhinal cortex.

      Strengths:

      The work demonstrates the existence of a grid-like neural code for a socially-relevant task, providing evidence that such coding schemes may be relevant for a variety of two-dimensional task spaces.

      Weaknesses:

      In the revised manuscript, the authors soften their claims about finding a grid code in the entorhinal cortex and provide additional caveats about limitations in their findings. It seems that the authors and reviewers are in agreement about the following weaknesses, which were part of my original review: Claims about a grid code in the entorhinal cortex are not well-supported by the analyses presented. The whole-brain analysis does not suggest that the entorhinal cortex exhibits hexagonal modulation; the strength of the entorhinal BOLD signal does not track the putative alignment of the grid code there; multivariate analyses do not reveal any evidence of a grid-like representational geometry.

      In the authors' response to reviews, they provide additional clarification about their exploratory analyses examining whether behavior (i.e., reaction times) and individual difference measures (i.e., social anxiety and avoidance) can be predicted by the hexagonal modulation strength in some region X, conditional on region X having a similar estimated grid alignment with some other region Y. My guess is that readers would find it useful if some of this language were included in the main text, especially with regard to an explanation regarding the rationale for these exploratory studies.

      Thank you very much again for your careful re-evaluation and suggestions. We have tried to improve our writing and incorporate the suggestions in the new revision.

      Reviewer #3 (Public Review):

      Liang and colleagues set out to test whether the human brain uses distance and grid-like codes in social knowledge using a design where participants had to navigate in a two-dimensional social space based on competence and warmth during an fMRI scan. They showed that participants were able to navigate the social space and found distance-based codes as well as grid-like codes in various brain regions, and the grid-like code correlated with behavior (reaction times).

      On the whole, the experiment is designed appropriately for testing for distant-based and grid-like codes, and is relatively well powered for this type of study, with a large amount of behavioral training per participant. They revealed that a number of brain regions correlated positively or negatively with distance in the social space, and found grid-like codes in the frontal polar cortex and posterior medial entorhinal cortex, the latter in line with prior findings on grid-like activity in entorhinal cortex. The current paper seems quite similar conceptually and in design to previous work, most notably Park et al., 2021, Nature Neuroscience.

      (1) The authors claim that this study provides evidence that humans use a spatial / grid code for abstract knowledge like social knowledge.

      This data does specifically not add anything new to this argument. As with almost all studies that test for a grid code in a similar "conceptual" space (not only the current study), the problem is that, when the space is not a uniform, square/circular space, and 2-dimensional then there is no reason the code will be perfectly grid like, i.e., show six-fold symmetry. In real world scenarios of social space (as well as navigation, semantic concepts), it must be higher dimensional - or at least more than two dimensional. It is unclear if this generalizes to larger spaces where not all part of the space is relevant. Modelling work from Tim Behrens' lab (e.g., Whittington et al., 2020) and Bradley Love's lab (e.g., Mok & Love, 2019) have shown/argued this to be the case. In experimental work, like in mazes from the Mosers' labs (e.g., Derdikman et al., 2009), or trapezoid environments from the O'Keefe lab (Krupic et al., 2015), there are distortions in mEC cells, and would not pass as grid cells in terms of the six-fold symmetry criterion.

      The authors briefly discuss the limitations of this at the very end but do not really say how this speaks to the goal of their study and the claim that social space or knowledge is organized as a grid code and if it is in fact used in the brain in their study and beyond. This issue deserves to be discussed in more depth, possibly referring to prior work that addressed this, and raise the issue for future work to address the problem - or if the authors think it is a problem at all.

      Thanks very much again for your careful re-evaluation and comments. We have tried to incorporate some of the suggested papers into our discussion. In summary, we agree that there is more to six-fold symmetric code that can be utilized to represent “conceptual space”. We think that the next step for a stronger claim would be to find the representation of more spontaneous non-spatial maps.

      References

      Bao, X., Gjorgieva, E., Shanahan, L. K., Howard, J. D., Kahnt, T., & Gottfried, J. A. (2019). Grid-like Neural Representations Support Olfactory Navigation of a Two-Dimensional Odor Space. Neuron, 102(5), 1066-1075 e1065. https://doi.org/10.1016/j.neuron.2019.03.034

      Constantinescu, A. O., O'Reilly, J. X., & Behrens, T. E. J. (2016). Organizing conceptual knowledge in humans with a gridlike code. Science, 352(6292), 1464-1468. https://doi.org/10.1126/science.aaf0941

      Park, S. A., Miller, D. S., & Boorman, E. D. (2021). Inferences on a multidimensional social hierarchy use a grid-like code. Nat Neurosci, 24(9), 1292-1301. https://doi.org/10.1038/s41593-02100916-3

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review):

      Summary:

      This paper presents a compelling and comprehensive study of decision-making under uncertainty. It addresses a fundamental distinction between belief-based (cognitive neuroscience) formulations of choice behavior with reward-based (behavioral psychology) accounts. Specifically, it asks whether active inference provides a better account of planning and decision making, relative to reinforcement learning. To do this, the authors use a simple but elegant paradigm that includes choices about whether to seek both information and rewards. They then assess the evidence for active inference and reinforcement learning models of choice behavior, respectively. After demonstrating that active inference provides a better explanation of behavioral responses, the neuronal correlates of epistemic and instrumental value (under an optimized active inference model) are characterized using EEG. Significant neuronal correlates of both kinds of value were found in sensor and source space. The source space correlates are then discussed sensibly, in relation to the existing literature on the functional anatomy of perceptual and instrumental decision-making under uncertainty.

      We are deeply grateful for your careful review of our work and your suggestions. Your insights have helped us identify areas where we can strengthen the arguments and clarify the methodology. We hope to apply the idea of active inference to our future work, emphasizing the integrity of perception and action.

      Reviewer #1 (Recommendations For The Authors):

      Many thanks for attending to my previous suggestions. I think your presentation is now much clearer and nicely aligned with the active inference literature.

      There is one outstanding issue. I think you have overinterpreted the two components of epistemic value in Equation 8. The two components that you have called the value of reducing risk and the value of reducing ambiguity are not consistent with the normal interpretation. These two components are KL divergences that measure the expected information gain about parameters and states respectively.

      If you read the Schwartenbeck et al paper carefully, you will see that the first (expected information gain about parameters) is usually called novelty, while the second (expected information gain about states) is usually called salience.

      This means you can replace "the value of reducing ambiguity" with "novelty" and "the value of reducing risk" with "salience".

      For your interest, "risk" and "ambiguity" are alternative ways of decomposing expected free energy. In other words, you can decompose expected free energy into (negative) expected information gain and expected value (as you have done). Alternatively, you can rearrange the terms and express expected free energy as risk and ambiguity. Look at the top panel of Figure 4 in:

      https://www.sciencedirect.com/science/article/pii/S0022249620300857

      I hope that this helps.

      We deeply thank you for your recommendations about the interpretation of the epistemic value in Equation 8. We have now corrected them to Novelty and Salience:

      In addition, in order to avoid terminology conflicts with active inference and to describe these two different uncertainties, we replaced Ambiguity in the article with Novelty, referring to the uncertainty that can be reduced by sampling, and replaced Risk with Variability, referring to the uncertainty inherent in the environment (variance).

      Reviewer # 2 (Public Review):

      Summary:

      Zhang and colleagues use a combination of behavioral, neural, and computational analyses to test an active inference model of exploration in a novel reinforcement learning task..

      Strengths:

      The paper addresses an important question (validation of active inference models of exploration). The combination of behavior, neuroimaging, and modeling is potentially powerful for answering this question.

      I appreciate the addition of details about model fitting, comparison, and recovery, as well as the change in some of the methods.

      We are deeply grateful for your careful review of our work and your suggestions. And we are also very sorry that in our last responses, there were a few suggestions from you that we did not respond them appropriately in our manuscript. We hope to be able to respond to these suggestions well in this revision. Thank you for your contribution to ensuring the scientificity and reproducibility of the work.

      The authors do not cite what is probably the most relevant contextual bandit study, by Collins & Frank (2018, PNAS), which uses EEG.

      The authors cite Collins & Molinaro as a form of contextual bandit, but that's not the case (what they call "context" is just the choice set). They should look at the earlier work from Collins, starting with Collins & Frank (2012, EJN).

      We deeply thank you for your comments. Now we add the relevant citations in the manuscript (line 46):

      “These studies utilized different forms of multi-armed bandit tasks, e.g the restless multi-armed bandit tasks (Daw et al., 2006; Guha et al., 2010), risky/safe bandit tasks (Tomov et al., 2020; Fan et al., 2022; Payzan et al., 2013), contextual multi-armed bandit tasks (Collins & Frank, 2018; Schulz et al., 2015; Collins & Frank, 2012)”

      Daw, N. D., O'doherty, J. P., Dayan, P., Seymour, B., & Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441(7095), 876-879.

      Guha, S., Munagala, K., & Shi, P. (2010). Approximation algorithms for restless bandit problems. Journal of the ACM (JACM), 58(1), 1-50.

      Tomov, M. S., Truong, V. Q., Hundia, R. A., & Gershman, S. J. (2020). Dissociable neural correlates of uncertainty underlie different exploration strategies. Nature communications, 11(1), 2371.

      Fan, H., Gershman, S. J., & Phelps, E. A. (2023). Trait somatic anxiety is associated with reduced directed exploration and underestimation of uncertainty. Nature Human Behaviour, 7(1), 102-113.

      Payzan-LeNestour, E., Dunne, S., Bossaerts, P., & O’Doherty, J. P. (2013). The neural representation of unexpected uncertainty during value-based decision making. Neuron, 79(1), 191-201.

      Collins, A. G., & Frank, M. J. (2018). Within-and across-trial dynamics of human EEG reveal cooperative interplay between reinforcement learning and working memory. Proceedings of the National Academy of Sciences, 115(10), 2502-2507.

      Schulz, E., Konstantinidis, E., & Speekenbrink, M. (2015, April). Exploration-exploitation in a contextual multi-armed bandit task. In International conference on cognitive modeling (pp. 118-123).

      Collins, A. G., & Frank, M. J. (2012). How much of reinforcement learning is working memory, not reinforcement learning? A behavioral, computational, and neurogenetic analysis. European Journal of Neuroscience, 35(7), 1024-1035.

      Placing statistical information in a GitHub repository is not appropriate. This needs to be in the main text of the paper. I don't understand why the authors refer to space limitations; there are none for eLife, as far as I'm aware.

      We deeply thank you for your comments. We calculated the average t-value of the brain regions with significant results over the significant time, and added the t-value results to the main text and supplementary materials.

      In answer to my question about multiple comparisons, the authors have added the following: "Note that we did not attempt to correct for multiple comparisons; largely, because the correlations observed were sustained over considerable time periods, which would be almost impossible under the null hypothesis of no correlations." I'm sorry, but this does not make sense. Either the authors are doing multiple comparisons, in which case multiple comparison correction is relevant, or they are doing a single test on the extended timeseries, in which case they need to report that. There exist tools for this kind of analysis (e.g., Gershman et al., 2014, NeuroImage). I'm not suggesting that the authors should necessarily do this, only that their statistical approach should be coherent. As a reference point, the authors might look at the aforementioned Collins & Frank (2018) study.

      We deeply thank you for your comments. We have now replaced all our results with the results after false discovery rate correction and added relevant descriptions (line 357,358):

      “The significant results after false discovery rate (FDR) (Benjamini et al., 1995, Gershman et al., 2014) correction were shown in shaded regions. Additional regression results can be found in Supplementary Materials.”

      Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1), 289-300.

      Gershman, S. J., Blei, D. M., Norman, K. A., & Sederberg, P. B. (2014). Decomposing spatiotemporal brain patterns into topographic latent sources. NeuroImage, 98, 91-102.

      After FDR correction, our results have changed slightly. We have updated our Results and Discussion section.

      It should be acknowledged that the changes in these results may represent a certain degree of error in our data (perhaps because the EEG data is too noisy or because of the average template we used, ‘fsaverage’). Therefore, we added relevant discussion in the Discussion section (line527-529):

      “It should be acknowledged that our EEG-based regression results are somewhat unstable, and the brain regions with significant regression are inconsistent before and after FDR correction. In future work, we should collect more precise neural data to reduce this instability.”

      I asked the authors to show more descriptive comparison between the model and the data. Their response was that this is not possible, which I find odd given that they are able to use the model to define a probability distribution on choices. All I'm asking about here is to show predictive checks which build confidence in the model fit. The additional simulations do not address this. The authors refer to figures 3 and 4, but these do not show any direct comparison between human data and the model beyond model comparison metrics.

      We deeply thank you for your comments. We now compare the participants’ behavioral data and the model’s predictions trial by trial (Figure 5). We can clearly see the participants’ behavioral strategies in different states and trials and the model’s prediction accuracy. We have added the discussion related to Figure 5 (line 309-318):

      “Figure 5 shows the comparison between the active inference model and the behavioral data, where we can see that the model can fit the participants behavioral strategies well. In the “Stay-Cue" choice, participants always tend to choose to ask the ranger and rarely choose not to ask. When the context was unknown, participants chose the “Safe" option or the “Risky" option very randomly, and they did not show any aversion to variability. When given “Context 1", where the “Risky" option gave participants a high average reward, participants almost exclusively chose the “Risky" option, which provided more information in the early trials and was found to provide more rewards in the later rounds. When given “Context 2", where the “Risky" option gave participants a low average reward, participants initially chose the “Risky" option and then tended to choose the “Safe" option. We can see that participants still occasionally chose the “Risky" option in the later trials of the experiment, which the model does not capture. This may be due to the influence of forgetting. Participants chose the “Risky" option again to establish an estimate of the reward distribution.”

      Reviewer # 2 (Recommendations For The Authors):

      In the supplement, there are missing references ("[?]").

      Thank you very much for pointing out this. We have now fixed this error.

      Reviewer # 3 (Public review):

      Summary:

      This paper aims to investigate how the human brain represents different forms of value and uncertainty that participate in active inference within a free-energy framework, in a two-stage decision task involving contextual information sampling, and choices between safe and risky rewards, which promotes shifting between exploration and exploitation. They examine neural correlates by recording EEG and comparing activity in the first vs second half of trials and between trials in which subjects did and did not sample contextual information, and perform a regression with free-energy-related regressors against data "mapped to source space."

      Strengths:

      This two-stage paradigm is cleverly designed to incorporate several important processes of learning, exploration/exploitation and information sampling that pertain to active inference. Although scalp/brain regions showing sensitivity to the active-inference related quantities do not necessary suggest what role they play, they are illuminating and useful as candidate regions for further investigation. The aims are ambitious, and the methodologies impressive. The paper lays out an extensive introduction to the free energy principle and active inference to make the findings accessible to a broad readership.

      Weaknesses:

      In its revised form the paper is complete in providing the important details. Though not a serious weakness, it is important to note that the high lower-cutoff of 1 Hz in the bandpass filter, included to reduce the impact of EEG noise, would remove from the EEG any sustained, iteratively updated representation that evolves with learning across trials, or choice-related processes that unfold slowly over the course of the 2-second task windows.

      We are deeply grateful for your careful review of our work and your suggestions. We are very sorry that we did not modify our filter frequency (it would be a lot of work to modify it). Thank you very much for pointing this out. We noticed the shortcoming of the high lower-cutoff of 1 Hz in the bandpass filter. We will carefully consider the filter frequency when preprocessing data in future work. Thank you very much!

    1. Author Response

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      The proposed study provides an innovative framework for the identification of muscle synergies taking into account their task relevance. State-of-the-art techniques for extracting muscle interactions use unsupervised machine-learning algorithms applied to the envelopes of the electromyographic signals without taking into account the information related to the task being performed. In this work, the authors suggest including the task parameters in extracting muscle synergies using a network information framework previously proposed. This allows the identification of muscle interactions that are relevant, irrelevant, or redundant to the parameters of the task executed.

      The proposed framework is a powerful tool to understand and identify muscle interactions for specific task parameters and it may be used to improve man-machine interfaces for the control of prostheses and robotic exoskeletons.

      With respect to the network information framework recently published, this work added an important part to estimate the relevance of specific muscle interactions to the parameters of the task executed. However, the authors should better explain what is the added value of this contribution with respect to the previous one, also in terms of computational methods.

      It is not clear how the well-known phenomenon of cross-talk during the recording of electromyographic muscle activity may affect the performance of the proposed technique and how it may bias the overall outcomes of the framework.

      We thank reviewer 1 for their useful commentary on this manuscript.

      Reviewer #2 (Public Review):

      This paper is an attempt to extend or augment muscle synergy and motor primitive ideas with task measures. The authors idea is to use information metrics (mutual information, co-information) in 'synergy' constraint creation that includes task information directly. By using task related information and muscle information sources and then sparsification, the methods construct task relevant network communities among muscles, together with task redundant communities, and task irrelevant communities. This process of creating network communities may then constrain and help to guide subsequent synergy identification using the authors published sNM3F algorithm to detect spatial and temporal synergies.

      The revised paper is much clearer and examples are helpful in various ways. However, figure 2 as presented does not convincingly show why task muscle mutual information helps in separating synergies, though it is helpful in defining the various network communities used in the toy example.

      The impact of the information theoretic constraints developed as network communities on subsequent synergy separation are posited to be benign and to improve over other methods (e.g., NNMF). However, not fully addressed are the possible impacts of the methods on compositionality links with physiological bases, and the possibility remains of the methods sometimes instead leading to modules that represent more descriptive ML frameworks that may not support physiological work easily. Accordingly, there is a caveat. This is recognized and acknowledged by the authors in their rebuttal of the prior review. It will remain for other work to explore this issue, likely through testing on detailed high degree of freedom artificial neuromechanical models and tasks. This possible issue with the strategy here likely needs to be fully acknowledged in the paper.

      The approach of the methods seeks to identify task relevant coordinative couplings. This is a meta problem for more classical synergy analyses. Classical analyses seek compositional elements stable across tasks. These elements may then be explored in causal experiments and generative simulations of coupling and control strategies. However, task-based understanding of synergy roles and functional uses is significant and is clearly likely to be aided by methods in this study.

      Information based separation has been used in muscle synergy analyses using infomax ICA, which is information based at core. Though linear mixing of sources is assumed in ICA, minimized mutual information among source (synergy) drives is the basis of the separation and detects low variance synergy contributions (e.g., see Yang, Logan, Giszter, 2019). In the work in this paper, instead, mutual information approaches are used to cluster muscles and task features into network communities preceding the SNM3F algorithm use for separation, rather than using minimized information in separation. This contrast of an accretive or agglomerative mutual information strategy here used to cluster into networks, versus a minimizing mutual information source separation used in infomax ICA epitomizes a key difference in approach here.

      Physiological causal testing of synergy ideas is neglected in the literature reviews in the paper. Although these are only in animal work (Hart and Giszter, 2010; Takei and Seki, 2017), the clear connection of muscle synergy analysis choices to physiology is important, and eventually these issues need to be better managed and understood in relation to the new methods proposed here, even if not in this paper.

      Analyses of synergies using the methods the paper has proposed will likely be very much dependent on the number and quality of task variables included and how these are managed, and the impacts of these on the ensuing sparsification and network communities used prior to SNM3F. The authors acknowledge this in their response. This caveat should likely be made very explicit in the paper.

      It would be useful in the future to explore the approach described with a range of simulated data to better understand the caveats, and optimizations for best practices in this approach.

      A key component of the reviewers’ arguments here is their reductionist view of muscle synergies vs the emergentist view presented in our work here. In the reductionist lens, muscle groupings are the units (‘building blocks’) of coordinated movement and thus the space of intermuscular interactions is of particular interest for understanding movement construction. On the other hand, the emergentist view suggests that muscle groupings emerge from interactions between constituent parts (as quantified here using information theory, synergistic information is the information found when both activities are observed together). This is in line with recent work in the field showing modular control at the intramuscular level, exemplifying a scale-free phenomena. Nonetheless, we consider these approaches to muscle synergy research as complementary and beneficial for the field overall going forward.

      Reviewer #3 (Public Review):

      In this study, the authors developed and tested a novel framework for extracting muscle synergies. The approach aims at removing some limitations and constraints typical of previous approaches used in the field. In particular, the authors propose a mathematical formulation that removes constraints of linearity and couples the synergies to their motor outcome, supporting the concept of functional synergies and distinguishing the task-related performance related to each synergy. While some concepts behind this work were already introduced in recent work in the field, the methodology provided here encapsulates all these features in an original formulation providing a step forward with respect to the currently available algorithms. The authors also successfully demonstrated the applicability of their method to previously available datasets of multi-joint movements.

      Preliminary results positively support the scientific soundness of the presented approach and its potential. The added values of the method should be documented more in future work to understand how the presented formulation relates to previous approaches and what novel insights can be achieved in practical scenarios and confirm/exploit the potential of the theoretical findings.

      In their revision, the authors have implemented major revisions and improved their paper. The work was already of good quality and now it has improved further. The authors were able to successfully:

      • improve the clarity of the writing (e.g.: better explaining the rationale and the aims of the paper);

      • extend the clarification of some of the key novel concepts introduced in their work, like the redundant synergies;

      • show a scenario in which their approach might be useful for increasing the understanding of motor control in patients with respect to traditional algorithms such as NMF. In particular, their example illustrates why considering the task space is a fundamental step forward when extracting muscle synergies, improving the practical and physiological interpretation of the results.

      We thank reviewer 3 for their constructive commentary on this manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Figure 3 should report the distances between reaching points in panel A and the actual length distances of the walking paths in panel C.

      The caption of fig.3 concerning the experimental setup of the datasets analysed has been updated with the following for dataset 1: “(A) Dataset 1 consisted of participants executing table-top point-to-point reaching movements (40cm distance from starting point P0) across four targets in forward (P1-P4) and backwards (P5-P8) directions at both fast and slow speeds (40 repetitions per task) [25]. The muscles recorded included the finger extensors (FE), brachioradialis (BR), biceps brachii (BI), medial-triceps (TM), lateral-triceps (TL), anterior deltoid (AD), posterior deltoid (PD), pectoralis major (PE), latissimus dorsi (LD) of the right, reaching arm.”. For dataset 3, to the best of the authors knowledge, this information was not given in the original paper.

      Figure 4, what is the unit of the data shown?

      The unit of bits is now mentioned in the toy example figure caption and in the caption of fig.5

      Figure 4, the characteristics of the interactions are not fully clear, and the graphical representation should be improved.

      We have made steps to improve the clarity of the figures presented.

      For dataset 3, τ was the movement kinematics, but it is not specified how the task parameters were formulated. Did the authors use the data from all 32 kinematic markers, 4 IMUs, and force plates? If yes, it should be specified why all these signals were used. For sure, there will be signals included that are not relevant to the specific task. Did the authors select specific signals based on their relevance to the task (e.g., ankle kinematics)?

      We have now clarified this in the text as follows: “For datasets 1 and 2, we determine the MI between vectors with respect to several discrete task parameters representing specific task attributes (e.g. reaching direction, speed etc.), while for dataset 3 we determined the task-relevant and -irrelevant muscles couplings in an unassuming way by quantifying them with respect to all available kinematic, dynamic and inertial motion unit (IMU) features.”

      How did the authors endure that crosstalk did not affect their analysis, particularly between, e.g., finger extensors and brachioradialis and posterior deltoid and anterior deltoid (dataset 1)?

      We have addressed this point in the previous round of reviews and made an explicit statement regarding cross-talk in the discussion section: “Although distinguishing task-irrelevant muscle couplings may capture artifacts such as EMG crosstalk, our results convey several physiological objectives of muscles including gross motor functions [66], the maintenance of internal joint mechanics and reciprocal inhibition of contralateral limbs [19,51].”

      It would be informative to add some examples of not trivial/obvious task-related synergistic muscle combinations that have been extracted in the three datasets. Most of the examples reported in the manuscript are well-known biomechanically and quite intuitive, so they do not improve our understanding of synergistic muscle control in humans.

      Our framework improves our understanding of synergistic motor control by enabling the formal quantification of synergistic muscle interactions, a capability not present among current approaches. Regarding the implications of this advance in terms of concrete examples, we have further clarified our examples presented in the results section, for example:

      “Across datasets, many the muscle networks could be characterised by the transmission of complementary task information between functionally specialised muscle groups, many of which identified among the task-redundant representations (Fig.9-10 and Supp. Fig.2). The most obvious example of this is the S3 synergist muscle network of dataset 2 (Fig.11), which captures the complementary interaction between task-redundant submodules identified previously (S3 (Fig.9)).”

      The description shows how our framework can extract the cross-module interactions that align with the higher-level objectives of the system, here the synergistic connectivity between the upper and lower body modules. Current approaches can only capture redundant and task-irrelevant interactions. Thus our framework provides additional insight into movement control.

      The number of participations in dataset 2 is very limited and should be increased. We appreciate the reviewer's comment and would like to point out that for dataset 2 our aim was to increase the number of muscles (30), tasks (72) and trials for each task (30) which produced a very large dataset for each participant. This came at the expense of low number of participants, however all our statistical analyses here can be performed at the single-participant level. Furthermore, dataset 3 includes 25 participants and it enables us to demonstrate the reliability of the findings across participants.

      Reviewer #2 (Recommendations For The Authors):

      I believe it is important in the future to explore the approach proposed with a range of simulation data and neuromechanical models, to explore the issues I have raised and that you have acknowledged, though I agree it is likely out of scope for the paper here.

      We agree with the reviewer that this would be valuable future work and indeed plan to do this in our future research.

      The Github code for this paper should likely include the various data sets used in the paper and figures, appropriately anonymized, in order to allow the data to be explored and analyses replicated and package demonstrated to be exercised fully by a new user.

      We thank the reviewer for this suggestion. Dataset3 is already available online at https://doi.org/10.1016/j.jbiomech.2021.110320. We will also make the other 2 datasets publicly available on our lab website very soon. Until then, as stated in the manuscript, we will make them available to anyone upon reasonable request.

      Reviewer #3 (Recommendations For The Authors):

      I have the following open points to suggest to the authors:

      First, I recommend improving the quality of the figures: in the pdf version I downloaded, some writings are impossible to read.

      We fully agree with the reviewer and note that in the pdf version of the paper, the figures are a lot worse than in the submitted word document submitted. Nevertheless, we will make further improvements on the figures as requested.

      Even though the manuscript has improved, I still feel that some points were not addressed or were only partially addressed. In particular:

      • The proposed comparison with NMF helps understanding why incorporating the task space is useful (and I fully agree with the authors about this point as the main reason to propose their contribution). However, the comparison does not help the reader to understand whether the synergies incorporating the task space are biased by the introduction of the task variables.

      This question can be also reformulated as: are muscle synergies modified when task space variables are incorporated? Is the "weight" on task coefficients affecting the composition of muscle synergies? If so, the added interpretational power is achieved at the cost of losing the information regarding the neural substrate of synergies? I understand this point is not immediate to show, but it would increase the quality of the work.

      • Reference to previous approaches that aimed at including task variables into synergy extraction are still missing in the paper. Even though it is not required to provide quantitative comparisons with other available approaches, there are at most 2-3 available algorithms in the literature (kinematics-EMG; force-EMG), that should not be neglected in this work. What did previous approaches achieve? What was improved with this approach? What was not improved?

      Previous attempts of extracting synergies with non-linear approaches could also be described more.

      In the latest version of the manuscript, we have referenced both the mixed NMF and autoencoders based algorithms. In both the introduction and discussion section of the manuscript, we also specify that our framework quantifies and decomposes muscle interactions in a novel way that cannot be done by other current approaches. In the results section we use examples from 3 different datasets to make this point clear, providing intuition on the use cases of our framework.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer 1:

      Comments on revisions:

      This manuscript is in some ways improved - mainly by toning down the conclusions - but a few major weaknesses have not been addressed. I do not agree that it is not justified to perform experiments to investigate the sterility of single CDK8 knockout mice since this could be important and given that the new data show that while there is some overlap in expression of the two prologues, there are also significant differences in the testis. At the least, it would have been interesting and easy to do to show the expression of CDK8 and CDK19 in the single cell transcriptomics, since this might help to identify the different populations.

      Certainly, we tried to analyse Cdk8/Cdk19 in single cell transcriptomics. However, we were unable to draw a clear conclusion. Due to a limited sensitivity of single cell sequencing, especially for low abundant transcripts, such as transcription factors (for 10x technology used in our study) (Chuang et al., 2024), it is challenging to establish with certainty CDK8/19 positive and -negative tissues from single cell data because both transcripts are minor. Nevertheless, the majority of cell types showed some expression of CDK8/19, with maximum expression in pachytene/diplotene spermatocytes. We do not include these data to the manuscript particularly as we were successful to assess Cdk8/19 expression patterns using IF approaches.

      Author response image 1.

      The only definitive way of concluding a kinase-independent phenotype is to rescue with a kinase dead mutant. While I agree that the inhibitors have been well validated, since they did not have any effects, it is hard to be sure that they actually reached their targets in the tissue concerned. This could have been done by cell thermal shift assay. In the absence of any data on this, the conclusion of a kinase-independent effect is weak.

      We totally agree with this point, but it takes several years to produce mice with inducible expression of KD CDK8 mice on the DKO background. These experiments are already underway in our lab, however, their results will be published in our future works.

      Figure 2 legend includes (G) between (B) and (C), and appears to, in fact, refer to Fig 1E, for which the legend is missing the description.

      Thank you, we corrected this.

      Finally, Figure S1C appears wrong. Goblet cells are not in the crypt but on the villi (so the graph axis label is wrong), and there are normally between 5 and 15 per villus, so the iDKO figure is normal, but there are a surprisingly high number of goblet cells in the controls. And normally there are 10-15 Paneth cells/crypt, so it looks like these have been underestimated everywhere. I wonder how the counting was done - if it is from images such as those shown here then I am not surprised as the quality is insufficient for quantification. How many crypts and villi were counted? Given the difficulty in counting and the variability per crypt/villus, with quantitative differences like this it is important to do quantifications blind. I personally wouldn't conclude anything from this data and I would recommend to either improve it or not include it. If these data are shown, then data showing efficient double knockout in this tissue should also accompany it, by IF, Western or PCR. Otherwise, given a potentially strong phenotype, repopulation of the intestine by unrecombined crypts might have occurred - this is quite common (see Ganuza et al, EMBO J. 2012).

      We added fig. S1C with Western blot showing presence of CDK8 and CCNC in WT intestine and  their absence in the DKO intestine. We also corrected that the part of the intestine analyzed was the duodenum, not ileum. We also replaced intestine sections photos with the ones of better quality and higher magnification (200X) and corrected Y axis legend. We apologize for the confusion, and thank the reviewer for careful analysis of our data, which allowed us to make this correction. The numbers of cells were counted on 600x magnification, and the magnification given in the article is for presentation purposes only. Our number of goblet cells was indeed calculated per villus, not crypt, and the resulting number is similar to ones reported in Dannapel et al (Dannappel et al., 2022). As for Paneth cells their numbers correspond to several articles that use the c57bl6 strain (Brischetto et al., 2021; King et al., 2013), as the number of Paneth cells differs between different part of the intestine and different mouse strains (Nakamura et al., 2020). 

      Reviewer 2:

      This reviewer appreciated the authors' effort in improving the quality of this manuscript during their revision. While some concerns remain, the revision is a much improved work and the authors addressed most of my major concerns.

      Figure 2E CDK8 and CDK19 immunofluorescent staining images seem to show CDK8 and CDK19 location are completely distinct and in different cells, the authors need to elaborate on this results and discuss what such a distinct location means in line of their double knockout data.

      We thank the reviewer for this suggestion. We had expanded the discussion in the lines 518 and 529 and included a better quality picture of the 200x magnification. Our main line of reasoning is that despite distinct expression in different cell types, high magnification show a certain level of expression of both proteins in most cells, so single knockouts will not demonstrate more than a slight phenotype, while the full knockout will have the full effect. This is especially true if our hypothesis that CCNC stabilization is important here, as both kinases can stabilize the protein.

      Minor comments:

      Supplemental figure 1(C) legend typo : (C) Periodic acid-Schiff stained sections of ilea of tamoxifen treated R26/Cre/ERT2 and DKO mice.

      Thank you, we corrected this.

      While the effort to identify and generate new antibodies is appreciated, the specificity of the antibodies used should be examined and presented if available.

      The specificity of the antibodies for the western blot is confirmed in figure S1F. We added fig. S1G with IF staining of CDK19 KO testes proving our CDK19 antibody specificity.

      References:

      Brischetto C., Krieger K., Klotz C., et.al. 2021. NF-κB determines Paneth versus goblet cell fate decision in the small intestine. Development 148. doi:10.1242/dev.199683

      Chuang H.-C., Li R., Huang H., et.al. 2024. Single-cell sequencing of full-length transcripts and T-cell receptors with automated high-throughput Smart-seq3. BMC Genomics 25:1127. doi:10.1186/s12864-024-11036-0

      Dannappel M.V., Zhu D., Sun X., et.al. 2022. CDK8 and CDK19 regulate intestinal differentiation and homeostasis via the chromatin remodeling complex SWI/SNF. J Clin Invest 132. doi:10.1172/JCI158593

      King S.L., Mohiuddin J.J., Dekaney C.M.. 2013. Paneth cells expand from newly created and preexisting cells during repair after doxorubicin-induced damage. Am J Physiol Gastrointest Liver Physiol 305:G151–62. doi:10.1152/ajpgi.00441.2012

      Nakamura K., Yokoi Y., Fukaya R., et.al. 2020. Expression and localization of Paneth cells and their α-defensins in the small intestine of adult mouse. Front Immunol 11:570296. doi:10.3389/fimmu.2020.570296

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study by Wang et al. identifies a new type of deacetylase, CobQ, in Aeromonas hydrophila. Notably, the identification of this deacetylase reveals a lack of homology with eukaryotic counterparts, thus underscoring its unique evolutionary trajectory within the bacterial domain.

      Strengths:

      The manuscript convincingly illustrates CobQ's deacetylase activity through robust in vitro experiments, establishing its distinctiveness from known prokaryotic deacetylases. Additionally, the authors elucidate CobQ's potential cooperation with other deacetylases in vivo to regulate bacterial cellular processes. Furthermore, the study highlights CobQ's significance in the regulation of acetylation within prokaryotic cells.

      Weaknesses:

      The problem I raised has been well resolved. I have no further questions.

      Thanks for your valuable comments very much.

      Reviewer #2 (Public review):

      In recent years, lots of researchers tried to explore the existence of new acetyltransferase and deacetylase by using specific antibody enrichment technologies and high resolution mass spectrometry. Here is an example for this effort. Yuqian Wang et al. studied a novel Zn2+- and NAD+-independent KDAC protein, AhCobQ, in Aeromonas hydrophila. They studied the biological function of AhCobQ by using biochemistry method and MS identification technology to confirm it. These results extended our understanding of the regulatory mechanism of bacterial lysine acetylation modifications. However, I find this conclusion is a little speculative, and unfortunately it also doesn't totally support the conclusion as the authors provided.

      Major concerns:

      - It is a little arbitrary to come to the title "Aeromonas hydrophila CobQ is a new type of NAD+- and Zn2+-independent protein lysine deacetylase in prokaryotes." It should be modified to delete the "in the prokaryotes" except that the authors get new more evidence in the other prokaryotes for the existence of the AhCobQ.

      Thank you for your suggestion. However, I believe there has been some confusion regarding the title. In the revised manuscript we have already updated the title to: "Aeromonas hydrophila CobQ is a new type of NAD+- and Zn2+-independent protein lysine deacetylase."

      This title does not include the phrase "in prokaryotes," as you mentioned. We kindly suggest verifying the version of the manuscript that was reviewed to ensure you are reviewing the most recent changes.

      - I was confused about the arrangement of the supplementary results. Because there are no citations for Figures S9-S19.

      Thank you for your feedback. It appears there may have been a misunderstanding, possibly due to reviewing an outdated version of the manuscript. In the revised manuscript we revised the supplementary figures and now have only 12 figures, all of which are correctly cited in the manuscript on pages 12 to 15. Below is a detailed list of the updated figure citations:

      Figures S1: page 8, line 148;

      Figures S2: page 9, line 168;

      Figures S3 and S4: page 10, line 178;

      Figures S5: page 10, line 186;

      Figures S6: page 10, line 189;

      Figures S7: page 12, line 221;

      Figures S8-S10: page 13, line 245;

      Figures S11: page 11, line 282;

      Figures S12: page 15, line 286

      - Same to the above, there are no data about Tables S1-S6.

      Thank you for your attention to the supplementary materials. As with the figures, we have already uploaded the data for Tables S1-S6 in the revised manuscript on November 19, 2024, and properly cited Tables S1 – S6 in the manuscript. Below is the citation information:

      Tables S1: page 10, line 194;

      Tables S2: page 13, line 245;

      Tables S3: page 21, line 438;

      Tables S4: page 22, line 439;

      Tables S5: page 22, line 445;

      Tables S6: page 27, line 564.

      Please note that Tables S3 – S4 include the chemical reagents, primers, and other experimental materials, which are not intended to be cited in the results section.)

      - All the load control is not integrated. Please provide all of the load controls with whole PAGE gel or whole membrane western blot results. Without these whole results, it is not convincing to come the conclusion as the authors mentioned in the context.

      Thank you for your comment. Please note that the full membrane western blot results were included in the revised manuscript. We hope this satisfies your request. If you need further clarification or additional data, please do not hesitate to let us know.

      - Thoroughly review the materials & methods section. It is unclear to me what exactly the authors describe in the method. All the experimental designs and protocols should be described in detail, including growth conditions, assay conditions, and purification conditions, etc.

      Thank you for your valuable suggestion. In response to your comment and previous feedback, we have alredy revised the Materials & Methods section thoroughly in the revised manuscript. The experimental details, including growth conditions, assay protocols, and purification procedures, are described in full on pages 22 to 30 of the revised manuscript.

      - Include relevant information about the experiments performed in the figure legends, such as experimental conditions, replicates, etc. Often it is not clear what was done based on the figure legend description.

      Thank you very much for your detailed feedback and suggestions. We have made sure to describe what each data point represents in the figure legends, as per the previous feedback. However, we would like to clarify that while we have provided detailed descriptions in the legends, the inclusion of every specific experimental condition in the figure legends could result in redundancy, as these details are already thoroughly outlined in the Materials & Methods section.

      We hope this explanation addresses your concern.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      I have no further revision comments.

      Thank you very much.

      Reviewer #2 (Recommendations for the authors):

      I carefully read the point-to-point response from the author. Although they listed lots of the reasons for the ugly results, it still can not persuade me to accept their conclusions. While, as I know, it is impossible to reject their work in eLife as it was sent out for peer-review. I also can't accuse them of being wrong, but I have my opinion on this point. That is not the results, but the attitude.

      Thank you for your feedback. However, I must express some concerns regarding the nature of your comments. Based on the issues you've raised, it seems that you may have reviewed an outdated version of the manuscript. In the updated revision we addressed all the points you've raised, including the figure and table citations, experimental methods, and data integration.

      We understand that differing opinions are part of the peer-review process, but we respectfully believe that your conclusion regarding our attitude is based on a misunderstanding, possibly caused by reviewing an incorrect version of the manuscript. We have always strived to approach this manuscript with utmost professionalism and have diligently responded to each of your concerns.

      We sincerely suggest reviewing the latest version of our manuscript, and we welcome any further constructive feedback. We hope this clarifies any misunderstandings and look forward to your continued support.

      Thank you for your time and thoughtful consideration.


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study by Wang et al. identifies a new type of deacetylase, CobQ, in Aeromonas hydrophila. Notably, the identification of this deacetylase reveals a lack of homology with eukaryotic counterparts, thus underscoring its unique evolutionary trajectory within the bacterial domain.

      Strengths:

      The manuscript convincingly illustrates CobQ's deacetylase activity through robust in vitro experiments, establishing its distinctiveness from known prokaryotic deacetylases. Additionally, the authors elucidate CobQ's potential cooperation with other deacetylases in vivo to regulate bacterial cellular processes. Furthermore, the study highlights CobQ's significance in the regulation of acetylation within prokaryotic cells.

      Weaknesses:

      The problem I raised has been well resolved. I have no further questions.

      Reviewer #2 (Public review):

      In recent years, lots of researchers tried to explore the existence of new acetyltransferase and deacetylase by using specific antibody enrichment technologies and high resolution mass spectrometry. Here is an example for this effort. Yuqian Wang et al. studied a novel Zn2+- and NAD+-independent KDAC protein, AhCobQ, in Aeromonas hydrophila. They studied the biological function of AhCobQ by using biochemistry method and MS identification technology to confirm it. These results extended our understanding of the regulatory mechanism of bacterial lysine acetylation modifications. However, I find this conclusion is a little speculative, and unfortunately, it also doesn't totally support the conclusion as the authors provided.

      Reviewer #3 (Public review):

      Summary:

      This study reports on a novel NAD+ and Zn2+-independent protein lysine deacetylase (KDAC) in Aeromonas hydrophila, termed as AhCobQ (AHA_1389). This protein is annotated as a CobQ/CobB/MinD/ParA family protein and does not show similarity with known NAD+-dependent or Zn2+-dependent KDACs. The authors showed that AhCobQ has NAD+ and Zn2+-independent deacetylase activity with acetylated BSA by western blot and MS analyses. They also provided evidence that the 195-245 aa region of AhCobQ is responsible for the deacetylase activity, which is conserved in some marine prokaryotes and has no similarity with eukaryotic proteins. They identified target proteins of AhCobQ deacetylase by proteomic analysis and verified the deacetylase activity using site-specific Kac proteins. Finally, they showed that AhCobQ activates isocitrate dehydrogenase by deacetylation at K388.

      Strengths:

      The finding of a new type of KDAC has a valuable impact on the field of protein acetylation. The characters (NAD+ and Zn2+-independent deacetylase activity in an unknown domain) shown in this study are very unexpected.

      Weaknesses:

      (1) The characters (NAD+ and Zn2+-independent deacetylase activity in an unknown domain) shown in this study are very unexpected. To convince readers, MSMS data must be necessary to accurately detect (de)acetylation at the target site in the deacetylase activity assay. The authors showed the MSMS data in assays with acetylated BSA, but other assays only rely on western blot.

      (2) They prepared site-specific Kac proteins and used them in deacetylase activity assays. Incorporation of acetyllysine at the target site should be confirmed by MSMS and shown as supplementary data.

      (3) The authors imply that the 195-245 aa region of AhCobQ may represent a new domain responsible for deacetylase activity. The feature of the region would be of interest but is not sufficiently described in Figure 5. The amino acid sequence alignments with representative proteins with conserved residues would be informative. It would be also informative if the modeled structure predicted by AlphaFold is shown and the structural similarity with known deacetylases is discussed.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The problem I raised has been well resolved. I have no further questions.

      Reviewer #2 (Recommendations for the authors):

      Questions to response of"-The load control is not all integrated. All of the load controls with whole PAGE gel or whole membrane western blot results should be provided. Without these whole results, it is not convincing to come to the conclusion that the authors have."

      Just as the Authors answered. The Coomassie Blue R-350 staining outcomes from the PVDF membranes. That is a good control for the experiment. However, I still have several questions about it:

      (1) The first is the quality of these Western blot. Why all the bands of these Western blot is so ugly? To tell the truth, it is very difficult to come to a conclusion from these poor western blots.

      We appreciate your feedback regarding the quality of the Western blots presented in Figure 7. We believe the “ugly bands” you referred to reflect our results validating the functions of CobQ through the use of recombinant site-specific Kac protein substrates.

      In our study, we meticulously engineered these recombinant site-specific Kac proteins using a two-plasmid system, based on foundational research published in Nature Chemical Biology (2017, 13(12): 1253-1260), which introduced the genetic encoding of Nε-acetyllysine into recombinant proteins. However, we faced a common challenge: protein truncation due to premature translation termination at the reassigned codon. This issue not only hampers protein yields, as discussed in ChemBioChem (2017, 18(20): 1973-1983), but also contributes to the suboptimal appearance of the Western blot results.

      Despite conducting at least two independent repetitions for the Western blot analysis of the site-specific Kac proteins, which yielded consistent results, we recognize that the overall quality remains less than ideal. This variability is inherently related to the characteristics of the target proteins. Nevertheless, the primary aim of our manuscript is to validate the novel deacetylase activity of CobQ. We have provided multiple lines of evidence, including mass spectrometry (MS/MS) and Western blot analyses, to substantiate this claim. In response to your comments, we have decided to remove the ambiguous Western blot results from Figure 7, retaining only four figures that demonstrate significant differences across at least two independent replicates (Author response images 1-5). Additionally, we have included four biological replicates of the Western blot results for ICD Kac388 + CobQ in the supplementary materials (Author response image 5) to further validate the deacetylase function of CobQ.

      Author response image 1.

      Western blot validation of the Kac26 AcrA-2 protein substrates regulated by the three KDACs in two biological replicates.

      Author response image 2.

      Western blot validation of the Kac48 Sun protein substrates regulated by the three KDACs in two biological replicates.

      Author response image 3.

      Western blot validation of the Kac103 Sun protein substrates regulated by the three KDACs in two biological replicates.

      Author response image 4.

      Western blot validation of the Kac195 Eno protein substrates regulated by the three KDACs in three biological replicates.

      Author response image 5.

      Western blot validation of the Kac388 ICD protein substrates regulated by AhCobQ in this study. Each sample was independently repeated at least three time.

      (2) The second is why some of the results are not from the same PVDF by comparing the Coommassie staining with the WB results just as authors responded. For example, the HrpA-K816 (ac), Eno-K195 (ac), ArcA-2-K26 (ac), ArcA-2-K26 (ac), IscS-K93(ac), A0KJ75-K81(ac), GyrB-K331(ac), GyrB-K449(ac), FtsA-K320(ac), FtsA-K409(ac), RecA-K279(ac), and the RecA-K306(ac). All of them are clearly not from the same staining results of PVDF membrane but from a new PVDF membrane.

      We assure you that the R-350 stained PVDF membranes originate from the same Western blot membranes. However, we acknowledge that visual discrepancies may arise due to differences in imaging techniques. The Western blot results were scanned using a ChemiDoc MP (Bio-Rad, Hercules, CA, USA), while the Coomassie R-350 stained PVDF membranes were captured using a standard camera. These differences can create a misleading appearance, making it seem as though they come from different membranes.

      It is also important to note that the intensity of the protein marker cannot be directly compared between the two imaging methods. As illustrated in Author response image 6, the protein marker at 70 kDa is clearly detectable in the Coomassie R-350 image, whereas it may not be as apparent in the Western blot result due to inherent differences in detection sensitivity.

      Author response image 6.

      The comparison of Western blotting and R-350 strained results of same protein marker in the same PVDF membrane. The protein marker located at 70 kDa can be detected easily in Coomassie R-350, while is difficult to display in WB result.

      Additionally, we have removed some of the so-called "ugly" Western blot results in the updated manuscript and provided the original full film of the relevant images as an attachment. This documentation demonstrates that all the data you referenced originate from the same film, as shown in Figures 1-5.

      (3) The third is why there is no replication for all these WB results. We should draw a conclusion with serious attitude, but not from the only one repeat, even say nothing about the poor results.

      Thank you for your valuable suggestion. In the second version of the manuscript, we have included the original full film of the relevant images. While we previously explained the reasons behind the "ugly" Western blot results, we have decided to remove some, or even all, of these results from Figure 7 in the updated version. The related images will be updated in the supplementary materials (Figures 1-5 in responding letter and Figure 7 in the revised manuscript).

      Furthermore, we have provided a more detailed discussion regarding the poor results in the updated manuscript to ensure clarity and transparency. We appreciate your understanding and hope these changes meet your expectations.

      Questions to response of " L174-187, L795 (Please show the whole membrane (or PAGE gel) of the loading control of CobB, and CobQ, except for the Kac-BSA)".

      (1) As we all educated that there is no control, and no biology. Where is the band of CobQ? Why do not stain the same PVDF membrane with R-350 staining but with a new membrane?

      Thank you for your insightful feedback. As noted in our previous response, the absence of visible bands for AhCobQ and AhCobB on the Coomassie R-350 stained PVDF membrane is primarily due to the low loading amounts and protein loss during the Western blotting process.

      To reinforce our findings, we repeated the analysis of the protein samples via SDS-PAGE, using the same loading quantity as in the previous Western blot shown in Figure 2 of the manuscript. As illustrated in Author response image 7, the bands for CobB and CobQ are discernible, albeit with significantly lower intensities compared to the Kac-BSA bands. Upon examining the full Coomassie R-350 stained PVDF membranes provided in Supplementary Material 1, we observe that the CobB and CobQ bands are not easily visible. This aligns with your observations and can be attributed to potential protein loss during the transfer from SDS-PAGE to the PVDF membrane.

      Author response image 7.

      The SDS-PAGE gel displayed the loading amounts of Kac-BSA and CobB/CobQ.

      To enhance the visibility of the CobQ/CobB bands, we increased the loading of CobQ/CobB in a new Western blot experiment, using 2 µg of Kac-BSA in combination with 0.8 µg of CobQ/CobB. As shown in Figure 8, while the increasing amounts of Kac-BSA resulted in a more blurred signal, the bands for the recombinant CobQ and CobB proteins were clearly detectable. This indicates that both proteins were indeed involved in the in vitro protein deacetylation assay.

      Author response image 8.

      Western blot verified the deacetylase activity assay of AhCobQ and AhCobB on Kac-BSA.

      Furthermore, we conducted a mass spectrometry analysis comparing Kac-BSA and Kac-BSA incubated with CobQ, as well as BSA without acetylation, against the A. hydrophila database with a cut-off of unique matched peptides >1. It is challenging to completely avoid contaminant detection during protein purification, especially when using high-resolution mass spectrometry. Our findings revealed that CobQ has the highest number of unique matched peptides (Author response table 1), while contaminants such as AHA_3036, AHA_0497, AHA_1279, and valS could be excluded, as they were present in Kac-BSA or BSA samples. Additionally, Tuf1, RplQ, GroEL, RpsF, RpsU, RpsB, RpsO, and RpsJ are known ribosomal subunits or chaperonins that are abundantly expressed in cells and may interact with various proteins, leading to contaminant detection.

      Author response table 1.

      LC MS/MS results of selected peptide quantification among Kac-BSA and Kac-BSA incubated with CobQ and BSA without acetylation against A. hydrophila database (unique matched peptides>1).

      Although AceE, a pyruvate dehydrogenase E1 component, theoretically possesses deacetylase activity, this possibility is low. First, in the SDS-PAGE gel of the purified recombinant protein, CobQ is the major band, with other proteins present at very low levels (less than 1/10 of CobQ). This suggests that significant deacetylation by contaminants is unlikely. Second, we purified His-tagged AhCobQ and GST-fused AhCobQ separately and tested their deacetylase activities. As shown in Figure S4 of the updated manuscript, both purified AhCobQ proteins exhibited deacetylase activity, while the negative control (purified GST protein only) did not, further supporting our conclusion that enzyme activity is not attributable to contaminating proteins (Figure S5).

      (2) Without the CobB and CobQ bands, it is impossible to say the function of CobQ is a new deacetylase. To avoid this confusion, it is easy to run a new gel and stain it with anti-His antibody to show these deacetylases.

      Thank you very much for your suggestion. We have performed the experiment in the comment (1) as your suggestion.

      (3) The explanation about the CobB/CobQ bands are not visible is not acceptable. Because the molecular weight of the CobB and CobQ is smaller than that of BSA, it is impossible that these bands will be loss during membrane transfer.

      Thank you for your valuable feedback. I completely agree that the loss of CobB and CobQ proteins during membrane transfer is unlikely due to their smaller molecular weight compared to BSA. As shown in Figure 7, the bands for CobB and CobQ are detectable in the SDS-PAGE gel but not visible on the Coomassie R-350 stained PVDF membrane.

      Several factors could contribute to this issue. One possibility is that the detection sensitivity of Coomassie R-350 may be lower than that of Coomassie R-250 used in the gel. Additionally, the Western blot results using an anti-His antibody further indicate low loading amounts of CobB and CobQ proteins on the PVDF membrane (Figure 8). This suggests that the observed low levels may indeed be due to protein loss during the membrane transfer process, despite their relatively small size.

      Reviewer #3 (Recommendations for the authors):

      (1) I found Tables S1 and S2 in the revised manuscript. It is strange to me that the intensity of Kac-BSA+CobQ is zero, completely nothing. Typically, a portion of the acetylated peptide remains after the deacetylation reaction.

      Thank you for your observation. When we report an intensity of zero, it does not imply a complete absence of signal; rather, it indicates that the signal for the target peptide is below the detectable threshold. This is likely due to the minimum cut-off setting in the MaxQuant (MQ) software, which is determined by parameters like "peptide_mass_tolerance" (as discussed in MQ user groups online, though it may not be explicitly listed in the parameters file).

      In our study, we performed a deacetylase assay that demonstrated CobQ's rapid activity; for instance, it can deacetylate ICD-K388ac within just four minutes. This leads me to hypothesize that the CobQ + Kac-BSA sample may have undergone near-complete enzymatic hydrolysis during the reaction.

      Furthermore, Table S1 in manuscript presents only a selection of the mass spectrometry results to illustrate CobQ's activity. In addition to the 15 acetylated peptides shown, there are many more (27 peptides) that exhibit significantly reduced acetylation levels without reaching zero intensity. The overall acetylation level of BSA peptides incubated with CobQ is calculated to be only 0.13 times that of Kac-BSA (Diagnostic peak: yes, peptide score: >100, Localization probability: >0.95) (Author response image 9).

      Based on these findings, we believe our mass spectrometry results are reliable and effectively support our conclusions. Thank you for your understanding.

      Author response image 9.

      The intensities of all Kac peptides of Kac-BSA with or without AhCobQ incubation in LC MS/MS.

      (2) It would be better to provide the information about ArcA and ArcA-2 as mentioned in the authors' response. It would be helpful for readers to understand that they are different proteins.

      Thank you for your suggestion. In the A. hydrophila ATCC 7966 dataset, there are indeed two distinct proteins referred to as ArcA: ArcA-1, which functions as an aerobic respiration control protein, and ArcA-2, which acts as an arginine deiminase. Importantly, these two proteins do not share any sequence homology; they are only similarly named due to their acronyms. While we believe this distinction does not require extensive explanation in the current study, we appreciate your input. Additionally, in response to Reviewer 2’s feedback, we have decided to remove the Western blot result for ArcA-2 due to its poor quality in the updated manuscript.

      (3) Line 409-416. Despite my comment, the citation of related papers on ICD acetylation in E. coli is still missing.

      Thank you for your suggestion. It has been added and highlighted in red. (Venkat S, et al, 2018, 430(13): 1901-1911)

      (4) The image resolution of Figure 3C and 3D is still bad. I could not evaluate that Kac was exactly incorporated at the target site.

      Thank you for your feedback regarding the image resolution of Figures 3C and 3D. We have now displayed these figures with improved clarity, as you suggested.

      To further validate the reliability of our MS2 data, we employed Proteome Discoverer 2.4 (Thermo) to analyze the raw data and provide theoretical mass information. As shown in Author response images 10-13, the MS2 spectra and fragment match lists for both unmodified and acetylated peptides offer additional confirmation of the reliability of our mass spectrometry results.

      Author response image 10.

      MS2 spectrum of unmodified peptide using PD v2.4 software.

      Author response image 11.

      The theoretical mass of unmodified peptide by PD 2.4

      Author response image 12.

      MS2 spectrum of acetylated peptide using PD v2.4 software.

      Author response image 13.

      The theoretical mass of acetylated peptide by PD 2.4.

      (5) Again, in Figure 8D, it should be shown the significance between ICD-Kac388 and ICD-Kac388+AhCobB to support the authors' conclusion that AhCobQ activates ICD by deacetylation at K388.

      Thanks for your suggestion, we have updated the figure in Figure 8D in updated manuscript.

      (6) It was nice that the authors presented the mass spectrum data of ICD-K388 acetylation (Figure 2 in responding letter). However, the data did not convince me that K388 is acetylated. In the figure, two b-ion peaks are detected, 285.1557 and 386.2034, which may correspond to NK (theoretical mass, 260.15) and NKT (theoretical mass, 361.20) peptides, respectively. If K388 is acetylated, an increase in the mass of 42 should be observed, but the difference between the detected and theoretical mass is 25. I also could not understand what the peak of 126.0913 mass is, indicated with acK* in red.

      Thank you for your detailed observation. The data presented in the MS2 spectrum for ICD-K388 acetylation in Figure 2 of the previous response letter were generated using Proteome Discoverer 2.4 (PD, Thermo) to ensure accurate mass calculations. Similar to the results from MaxQuant, ICD-K388 was identified again (Author response image 14).

      Regarding the b-ion peaks you mentioned, the values 285.1557 and 386.2034 correspond to NK<sup>ac</sup> and NK<sup>ac</sup>T peptides, respectively. The theoretical masses for these peptides are as follows: NK<sup>ac</sup> (285.15 = 115.05020 + 128.095 + 42.01) and NK<sup>ac</sup>T (386.20 = NK<sup>ac</sup> + 101.04768). The differences between the theoretical and detected masses for the relevant b-ions (b2*-NK, b52+-NH3, and b3) are minimal, at 0.00 Da and 2.1 ppm, respectively, which is consistent with the incorporation of an NH3 group (Author response image 15).

      Author response image 14.

      The MS2 of ICD-K388 peptide by PD 2.4.

      Author response image 15.

      The theoretical mass of ICD-K388 peptide by PD 2.4.

      The peak at 126.0913 m/z, indicated as acK*, represents immonium ions of ε-N-acetyllysine, which are generated during the fragmentation of acetyllysine. This diagnostic ion is widely recognized as a marker for identifying acetylated peptides (Nakayasu, et al,. A method to determine lysine acetylation stoichiometries. International journal of proteomics. 2014;2014(1):730725; Trelle et al., Utility of immonium ions for assignment of ε-N-acetyllysine-containing peptides by tandem mass spectrometry. Analytical chemistry. 2008;80(9):3422-30). Additionally, it is a default parameter in MaxQuant for identifying Kac peptides (Author response image 16).

      Based on these findings, we believe the evidence supporting ICD-K388 acetylation is robust.

      Author response image 16.

      The default parameter in Kac peptide identification in Maxquant v1.6 software

      (7) As mentioned by other reviewers, some of the figures and tables are incomplete. Some panels (ex. Figure 7C and 7D) and explanations (ex. What are lanes 1, 2, and 3 in Figure S3) are still missing.

      Thank you for your suggestion. It has been added.

    1. Author Response

      The following is the authors’ response to the previous reviews

      We would like to thank you again for your thorough review of the manuscript. We have taken all comments into account in the revised version of the manuscript. Please find below our detailed responses to your comments.

      Reviewing Editor

      The manuscript has been improved, but there are some remaining issues that need to be addressed, as outlined in the reviewers' comments. In particular, please pay attention to Figures 1A and 2A as they appear to be the same. Moreover, the original gel images for Western blots should be made available given the concerns raised by Reviewer #1.

      Thank you for your recommendations. We have carefully considered all comments and made the requested revisions to improve the manuscript.

      Reviewer #1 (Public Review):

      In this manuscript, the authors aimed to compare, from testis tissues at different ages from mice in vivo and after culture, multiple aspects of Leydig cells. These aspects included mRNA levels, proliferation, apoptosis, steroid levels, protein levels, etc. A lot of work was put into this manuscript in terms of experiments, systems, and approaches. The technical aspects of this work may be of interest to labs working on the specific topics of in vitro spermatogenesis for fertility preservation.

      Second review:

      The authors should be commended for substantial improvement in their manuscript for resubmission.

      Thank you very much for this second review and your help to improve this manuscript.

      Recommendations For The Authors:

      Going forward, the authors would be well-served to put a similar amount of effort on first drafts as well, which would both increase reviewer enthusiasm and reduce reviewer workload to document all the deficiencies! Abstract is much improved, and clearly articulates the point of the study.

      We are very grateful for all your constructive comments, which have greatly contributed to the improvement of our manuscript.

      1) 54 - replace "could be" with was

      “could be” was replaced by “was”

      2) 75 - delete "being"

      “being” was deleted.

      3) 103 - would say "indirectly promotes" since Rhox5 is a transcription factor that presumably activates genes in Sertoli cells whose products then affect neighboring germ cells, either by direct action or by influencing Sertoli cell behavior changes

      “indirectly” was added in the sentence.

      4) 139, 155, elsewhere - haven't seen dpp italicized before, certainly not the norm

      In dpp (days post-partum), “pp” is italicized as it is a Latin word.

      5) 265 - delete "found"

      “found” was deleted.

      6) 263-273 - Is the CYP19 protein referred to encoded by the Cyp19a1 gene (line 263)? Should standardize nomenclature...

      The CYP19 protein (aromatase) is indeed encoded by the Cyp19a1 gene. The nomenclature was standardized: “CYP19” was replaced by “CYP19A1” in the entire manuscript.

      7) 280 - "homolog" doesn't seem like the right word, as it has a very specific meaning with regards to the evolutionary genetic relatedness of genes. Maybe analog?

      “homolog” was replaced by “analog”.

      8) 306 - would reword to something like "proportions of seminiferous tubules containing round and elongating spermatids" - the because the tubules don't reach spermatid stages

      This sentence was reworded as suggested.

      9) 310 - delete "resulted in", unnecessary

      “resulted in” was deleted.

      10) Why are the images shown in Figures 1A and 2A the same? That seems odd - was that intentional? Curious overall why the data is presented in such a way that it's done twice...

      We mistakenly presented immunofluorescence images twice. Duplicate images have been removed. In the modified version of this manuscript, Figure 1A shows 3-HSD immunofluorescence staining in cultures of fresh testicular tissues and in their in vivo counterparts while Figure 1 – figure supplement 1A (not Figure 2A) shows 3-HSD immunofluorescence staining in cultures of frozen/thawed testicular tissues.

      11) In all the western blots, the cropping is done awfully close to the bands - why is this? Can full gels be shown in a Supplement? And especially in the westerns in Fig. 5C, esp for CYP17A1, the cropping is unacceptable. This reviewer is wondering whether this is an oversight, or whether there is another band below that one that is being masked? Again, should show whole blot for transparency and to ensure Rigor and Reproducibility.

      Full gels are shown in the Supplementary File 2. For CYP17A1, we have shown that only one band of the expected molecular weight is obtained with the antibody (Please see photo below). After this verification, the nitrocellulose membranes were cut at the 55 kDa molecular weight band in order to reveal CYP17A1 expression in the upper part of the membranes and the protein used for normalization in the lower part of the membranes.

      Author response image 1.

      12) For all figures, wondering why the font sizes are so disparate? This will need to be addressed before publication so it looks more professional.

      All figures have been reworked as requested.

      Reviewer #3 (Public Review):

      Moutard, Laura, et al. investigated the gene expression and functional aspects of Leydig cells in a cryopreservation/long-term culture system. The authors found that critical genetic markers for Leydig cells were diminished when compared to the in-vivo testis. The testis also showed less androgen production and androgen responsiveness. Although they did not produce normal testosterone concentrations in basal media conditions, the cultured testis still remained highly responsive to gonadotrophin exposure, exhibiting a large increase in androgen production. Even after the hCG-dependent increase in testosterone, genetic markers of Leydig cells remained low, which means there is still a missing factor in the culture media that facilitates proper Leydig cell differentiation. Optimizing this testis culture protocol to help maintain proper Leydig cell differentiation could be useful for future human testis biopsy cultures, which will help preserve fertility and child cancer patients.

      Overall, the authors addressed most comments and questions from the previous review. The additional data regarding the necrotic area is helpful for interpreting the quality of the cultures. The authors did not conduct a multiple comparison tests although there are multiple comparisons conducted on for a single dependent variable (Fig 2J, Fig 3F, among many others), however, the addition of this multiple comparison is unlikely to change the conclusions of the paper or the figure and, thus is a minor technical detail in this case.

      Thank you very much for this second review and your help to improve this manuscript.

    1. Author response:

      The following is the authors’ response to the previous reviews

      We have thoroughly addressed all the reviewers’ comments and meticulously revised the manuscript. Key modifications include the following:

      (a) Organizing the Logic and Highlighting Key Findings: We have revised the manuscript to emphasize key findings (especially the distinctions between the SEC and WOI groups) according to the following logic: constructing a receptive endometrial organoid, comparing its molecular characteristics with those of the receptive endometrium, highlighting its main features (hormone response, enhanced energy metabolism, ciliary assembly and motility, epithelial-mesenchymal transition), and exploring the function involved in embryo interaction.

      (b) Clarity and Better Description of Bioinformatic Analyses: We have revised the sections involving bioinformatic analyses to provide a more streamlined and comprehensible explanation. Instead of overwhelming the reader with excessive details, we focused on the most important findings, and performed additional experimental validation.

      (c) Rationale for Gene Selection: We have clarified the rationale for selecting certain genes and pathways for inclusion in the analysis and manuscript. The associated gene expression data for all figures have been provided in the attached Dataset.

      (d) In the response letter, we have provided the detailed presentation of the methodological optimization for constructing this endometrial assembloids, along with optimization and comparison of endometrial organoid culture media. Furthermore, in the Limitations section, we have explicitly stated that stromal cells and immune cells gradually diminish with increasing passage numbers. Therefore, this study primarily utilized endometrial assembloids within the first three passages for all investigations.

      Below, we provide a point-by-point response to each comment, with all modifications highlighted in the revised manuscript. We respectfully hope that these revisions effectively address the concerns raised by the reviewers.

      Public Reviews:

      Reviewer #1 (Public Review):

      This study generated 3D cell constructs from endometrial cell mixtures that were seeded in the Matrigel scaffold. The cell assemblies were treated with hormones to induce a "window of implantation" (WOI) state. The authors did their best to revise their study according to the reviewers' comments. However, the study remains unconvincing and at the same time too dense and not focused enough.

      (1) The use of the term organoids is still confusing and should be avoided. Organoids are epithelial tissue-resembling structures. Hence, the multiplecell aggregates developed here are rather "coculture models" (or "assembloids"). It is still unexpected (unlikely) that these structures containing epithelial, stromal and immune cells can be robustly passaged in the epithelial growth conditions used. All other research groups developing real organoids from endometrium have shown that only the epithelial compartment remains in culture at passaging (while the stromal compartment is lost). If authors keep to their idea, they should perform scRNA-seq on both early and late (passage 6-10) "organoids". And they should provide details of culturing/passaging/plating etc that are different with other groups and might explain why they keep stromal and immune cells in their culture for such a long time. In other words, they should then in detail compare their method to the standard method of all other researchers in the field, and show the differences in survival and growth of the stromal and immune cells.

      (1) We appreciate your feedback and have revised the term 'organoids' to 'assembloids'. 2)

      I. Due to budget constraints, this study did not perform scRNA-seq on both early and late passages (P6-P10). Instead, immunofluorescence staining confirmed the persistence of stromal cells at passage 6 (as shown below).

      Author response image 1.

      Whole-mount immunofluorescence showed that Vimentin+ F-actin+ cells (stromal cells) were arranged around the glandular spheres that were only F-actin+(passage 6).

      II. Improvements in this study include the following.

      a. Optimization of endometrial tissue processing: The procedures for tissue collection, pretreatment, digestion, and culture were refined to maximize the retention of endometrial epithelial cells, stromal cells, and immune cells (detailed optimizations are provided in Response Table 1).

      b. Enhanced culture medium formulation: Based on previous protocols, WNT3A was added to promote organoid development and differentiation (PMID: 27315476), while FGF2 was supplemented to improve stromal cell survival (PMID: 35224622) (see Response Table 2 for medium comparisons). Representative culture outcomes are shown in the figure below.

      We acknowledge that the stromal and immune cells in this system still exhibit differences compared to their in vivo counterparts. In this study, we utilized the first three passages, which offer optimal cell diversity and viability, to meet experimental needs. However, replicating and maintaining the full complexity of endometrial cell types in vitro remains a major challenge in the field—one that we are actively working to address.

      Author response table 1.

      Methodological Optimization of Endometrial Organoids (Construction, Passaging, and Cryopreservation)

      Author response table 2.

      Optimization and comparison of endometrial organoid culture media

      Author response image 2.

      Bright-field microscopy captures the expansion of glands and surrounding stromal cells across passages 0 to 2 (scar bar=200μm) (Yellow arrows: stromal cells; White arrows: glands).

      (2) The paper is still much too dense, touching upon all kind of conclusions from the manifold bioinformatic analyses. The latter should be much clearer and better described, and then some interesting findings (pathways/genes) should be highlighted without mentioning every single aspect that is observed. The paper needs a lot of editing to better focus and extract take-home messages, not bombing the reader with a mass of pathways, genes etc which makes the manuscript just not readable or 'digest-able'. There is no explanation whatever and no clear rationale why certain genes are included in a list while others are not. There is the impression that mass bioinformatics is applied without enough focus.

      Thanks for your suggestions. We have made improvements and revisions in the following areas:

      (a) Clarity and Better Description of Bioinformatic Analyses: We have revised the sections involving bioinformatic analyses to provide a more streamlined and comprehensible explanation. Instead of overwhelming the reader with excessive details, we focused on the most important findings.

      (b) Organizing the Logic and Highlighting Key Findings: We have revised the manuscript to emphasize key findings according to the following logic: constructing a receptive endometrial organoid, comparing its molecular characteristics with those of the receptive endometrium, highlighting its main features (hormone response, enhanced energy metabolism, ciliary assembly and motility, epithelial-mesenchymal transition), and exploring the function involved in embryo interaction.

      (c) Rationale for Gene Selection: We have clarified the rationale for selecting certain genes and pathways for inclusion in the analysis and manuscript.

      We hope these revisions address your concerns and improve the overall quality and clarity of the manuscript. Thank you once again for your valuable input.

      (3) The study is much too descriptive and does not show functional validation or exploration (except glycogen production). Some interesting findings extracted from the bioinformatics must be functionally tested.

      Thanks for your suggestions. We have restructured the logic and revised the manuscript, incorporating functional validation. The focus is on the following points: highlighting its main features (hormone response, enhanced energy metabolism, ciliary assembly and motility, epithelial-mesenchymal transition), and exploring the functions involved in embryo interaction.

      (4) In contrast to what was found in vivo (Wang et al. 2020), no abrupt change in gene expression pattern is mentioned here from the (early-) secretory to the WoI phase. Should be discussed. Although the bioinformatic analyses point into this direction, there are major concerns which must be solved before the study can provide the needed reliability and credibility for revision.

      To further investigate the abrupt change, the Mfuzz algorithm was utilized to analyze gene expression across the three groups, focusing on gene clusters that were progressively upregulated or downregulated. It was observed that mitochondrial and cilia-related genes exhibited the highest expression levels in WOI endometrial organoids, as well as cell junction and negative regulation of cell differentiation were downregulated (Figure 4A).

      (5) All data should be benchmarked to the Wang et al 2020 and Garcia-Alonso et al. 2021 papers reporting very detailed scRNA-seq data, and not only the Stephen R. Quake 2020 paper.

      We appreciate your suggestion. By integrating data from Garcia-Alonso et al. (2021) (shown in the figure below), we observed that both WOI organoids and SEC organoids exhibit increased glandular secretory epithelium and developed ciliated epithelium, mirroring features of mid-secretory endometrium. The findings exhibit parallels when contrasting these two papers.

      Author response image 3.

      UMAP visualization of integrated scRNA-seq data (our dataset and Garcia-Alonso et al. 2021) showing: (A) cell types, (B) WOI-org, (C)CTRL-org, (D)SEC-org versus published midsecretory samples.

      (6) Fig. 2B: Vimentin staining is not at all clear. F-actin could be used to show the typical morphology of the stromal cells?

      We appreciate your suggestion. We performed additional staining for F-actin based on Vimentin, and found that Vimentin+ F-actin+ cells (stromal cells) were arranged around the glandular spheres that were only F-actin+.

      (7) Where does the term "EMT-derived stromal cells" come from? On what basis has this term been coined?

      Within endometrial biology, stromal cells in the transition from epithelial to mesenchymal phenotype are specifically referred to as 'stromal EMT transition cells' (PMID: 39775038, PMID: 39968688).

      In certain cancers or fibrotic diseases, epithelial cells can transition into a mesenchymal phenotype, contributing to the stromal environment that supports tumor growth or tissue remodeling (PMID: 20572012).

      (8) CD44 is shown in Fig. 2D but the text mentions CD45 (line 159)?

      In Fig 2D, T cells are defined as a cluster of CD45+CD3+ cells, further subdivided into CD4+ and CD8+ T cells based on their expression of CD4 and CD8. This figure does not include data on CD44.

      (9) All quantification experiments (of stainings etc) should be in detail described how this was done. It looks very difficult (almost not feasible) when looking at the provided pictures to count the stained cells.

      a. Manual Measurement:

      For TEM-observed pinopodes, glycogen particles, microvilli, and cilia, manual region-of-interest (ROI) selection was performed using ImageJ software for quantitative analysis of counts, area, and length. Twenty randomly selected images per experimental group were analyzed for each morphological parameter.

      b. Automated Measurement:

      We quantified the fluorescence images using ImageJ. Firstly, preprocess them by adjusting brightness and contrast, and removing background noise with the “Subtract Background” feature.

      Secondly, set the threshold to highlight the cells, then select the regions of interest (ROI) using selection tools. Thirdly, as for counting the cells, navigate to Analyze > Analyze Particles. AS for measuring the influence intensity and area, set the “Measurement” options as mean gray value. Adjust parameters as needed, and view results in the “Results” window. Save the data for further analysis and ensure consistency throughout your measurements for reliable results.

      For 3D fluorescence quantification, ZEN software (Carl Zeiss) was exclusively used, with 11 images analyzed per experimental group. This part has been incorporated into “Supporting Information”

      Line 94-100.

      c. Normalization Method:

      For fluorescence quantification, DAPI was used as an internal reference for normalization, where both DAPI and target fluorescence channel intensities were quantified simultaneously. The normalized target signal intensity (target/DAPI ratio) was then compared across experimental groups. A minimum of 15 images were analyzed for each parameter per group. This part has been incorporated into “Supporting Information” Line 101-104.

      (10) Fig. 3C: it is unclear how quantification can be reliably done. Moreover, OLFM4 looks positive in all cells of Ctrl, but authors still see an increase?

      (a) Fluorescence images were quantitatively analyzed using ImageJ by measuring the mean gray values. For normalization, DAPI staining served as an internal reference, with simultaneous measurement of mean gray values in both the target fluorescence channel and the DAPI channel. The relative fluorescence intensity was then calculated as the ratio of target channel to DAPI signal for inter-group quantitative comparisons.

      (b) OLFM4 is an E2-responsive gene. Its expression in endometrial organoids of the CTRL group is physiologically normal (PMID: 31666317). However, its fluorescence intensity (quantified as mean gray value) was significantly stronger in both the SEC and WOI groups compared to the CTRL group (quantitative method as described above).

      (11) Fig. 3F: Met is downregulated which is not in accordance with the mentioned activation of the PI3K-AKT pathway.

      We appreciate your careful review. Our initial description was imprecise. In the revised manuscript, this statement has been removed entirely.

      (12) Lines 222-226: transcriptome and proteome differences are not significant; so, how meaningful are the results then? Then, it is very hard to conclude an evolution from secretory phase to WoI.

      We appreciate your feedback. The manuscript has been comprehensively revised, and the aforementioned content has been removed.

      (13) WoI organoids show an increased number of cilia. However, some literature shows the opposite, i.e. less ciliated cells in the endometrial lining at WoI (to keep the embryo in place). How to reconcile?

      Thank you for raising this question. We conducted a statistical analysis of the proportion of ciliated cells across endometrial phases.

      (a) Based on the 2020 study by Stephen R. Quake and Carlos Simon’s team published in Nature Medicine (PMID: 32929266), the mid-secretory phase (Days 19–23) exhibited a higher proportion of ciliated cells compared to the early-secretory (Days 15–18) and late-secretory phases (Days 24– 28) (Fig. R13 A).

      (b) According to the 2021 study by Roser Vento-Tormo’s team in Nature Genetics, ciliated cell abundance peaked in the early-to-mid-secretory endometrium across all phases (Fig. R13 B-C).

      Data were sourced from the Reproductive Cell Atlas.

      (14) How are pinopodes distinguished from microvilli? Moreover, Fig. 3 does not show the typical EM structure of cilia.

      Thank you for this insightful question.

      (a) Pinopodes are large, bulbous protrusions with a smooth apical membrane. Under transmission electron microscopy (TEM), it can be observed that the pinopodes contain various small particles, which are typically extracellular fluid and dissolved substances.

      Microvilli are elongated, finger-like projections that typically exhibit a uniform and orderly arrangement, forming a "brush border" structure. Under transmission electron microscopy, dense components of the cytoskeleton, such as microfilaments and microtubules, can be seen at the base of the microvilli.

      (b) You may refer to the ciliated TEM structure shown in the current manuscript's Fig. 2E (originally labeled as Fig. 2H in the draft). The cilium is composed of microtubules. The cross-section shows that the periphery of the cilium is surrounded by nine pairs of microtubules arranged in a ring. The longitudinal section shows that the cilium has a long cylindrical structure, with the two central microtubules being quite prominent and located at the center of the cilium.

      (15) There is a recently published paper demonstrating another model for implantation. This paper should be referenced as well (Shibata et al. Science Advances, 2024).

      Thanks for your valuable comments. We have cited this reference in the manuscript at Line 77-78.

      (16) Line 78: two groups were the first here (Turco and Borreto) and should both be mentioned.

      Thanks for your valuable comments. We have cited this reference in the manuscript at Line 74-76.

      (17) Line 554: "as an alternative platform" - alternative to what? Authors answer reviewers' comments by just changing one word, but this makes the text odd.

      Thank you for your review. Here, we propose that this WOI organoid serves as an alternative research platform for studying endometrial receptivity and maternal-fetal interactions, compared to current secretory-phase organoids. In the revised manuscript, we have supplemented the data by co-culturing this WOI organoid with blastoid, demonstrating its robust embryo implantation potential.

      Reviewer #2 (Public Review):

      In this research, Zhang et al. have pioneered the creation of an advanced organoid culture designed to emulate the intricate characteristics of endometrial tissue during the crucial Window of Implantation (WOI) phase. Their method involves the incorporation of three distinct hormones into the organoid culture, coupled with additives that replicate the dynamics of the menstrual cycle. Through a series of assays, they underscore the striking parallels between the endometrial tissue present during the WOI and their crafted organoids. Through a comparative analysis involving historical endometrial tissue data and control organoids, they establish a system that exhibits a capacity to simulate the intricate nuances of the WOI. The authors made a commendable effort to address the majority of the statements. Developing an endometrial organoid culture methodology that mimics the window of implantation is a game-changer for studying the implantation process. However, the authors should strive to enhance the results to demonstrate how different WOI organoids are from SEC organoids, ensuring whether they are worth using in implantation studies, or a proper demonstration using implantation experiments.

      Thank you for your valuable suggestions. The WOI organoids differ from SEC organoids in the following aspects.

      (1) Structurally, WOI endometrial organoids exhibit subcellular features characteristic of the implantation window: densely packed pinopodes on the luminal side of epithelial cells, abundant glycogen granules, elongated and tightly arranged microvilli, and increased cilia (Figure 2F).

      (2) At the molecular level, WOI organoids show enlarged and functionally active mitochondria, enhanced ciliary assembly and motility, and single-cell signatures resembling mid-secretory endometrium.

      Specifically, mitochondrial- and cilia-related genes/proteins are most highly expressed in WOI organoids (Figure 4A,B). TEM analysis revealed that WOI organoids have the largest average mitochondrial area (Figure 4C). Mitochondrial-related genes display an increasing trend across the three organoid groups, and WOI organoids produce more ATP and IL-8 (Figure 4D,E).

      For cilia, WOI organoids upregulate genes/proteins involved in ciliary assembly, basal bodies, and motile cilia, while downregulating non-motile cilia markers (Figure 5A-C).

      Single-cell analysis further confirms that WOI organoids recapitulate mid-secretory endometrial traits in mitochondrial metabolism and cell adhesion (Figure 2G).

      (3) Functionally, WOI organoids demonstrate superior embryo implantation potential. Given the scarcity and ethical constraints of human embryos, we used blastoids for implantation assays (Figure 6A). These blastoids successfully grew within endometrial organoids, established interactions (Figure 6B), and exhibited normal trilineage differentiation (epiblast: OCT4; hypoblast: GATA6; trophoblast: KRT18) (Figure 6C). WOI organoids achieved significantly higher blastoid survival (66% vs. 19% in CTRL and 28% in SEC) and interaction rates (90% vs. 47% in CTRL and 53% in SEC), confirming their robust embryo-receptive capacity (Figure 6D,E).

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      In conclusion, it is needed to first meet all the concerns of the reviewers and then submit an appropriately adapted and comprehensive paper (also showing the robustness of the "organoids" and functionality of the findings) instead of this still fully descriptive paper. Further comments are included in the rebuttal document of the authors and will be provided by the editor as PDF.

      Reviewer #2 (Recommendations For The Authors):

      The authors made a good effort to reply all the statements. However, there are some points that the authors need to address.

      • There is an inconsistency in the manuscript regarding the number of passages in which the organoids are used; in the response to the reviewers, it mentions 5 passages, while in the Materials and Methods section, it states 3 passages.

      We sincerely appreciate your thorough review. In this study, organoids within the first three passages were used. To address the reviewer's question comprehensively, we have now provided a detailed account of the organoid passage history in our response.

      • We agree that the difference between SEC and WOI organoids may be subtle, but in response to this, the authors should explain what they mean by "the most notable differences lie in the more comprehensive differentiation and varied cellular functions exhibited by WOI organoids..."

      In the original manuscript, this statement indicated that, at the single-cell level, WOI endometrial organoids exhibited more functionally mature and thoroughly differentiated characteristics compared to SEC endometrial organoids (See details below).

      In the revised version, we have restructured this section to focus on following aspects: hormone response, energy metabolism, ciliary assembly and motility, epithelial-mesenchymal transition, and embryo implantation potential. Consequently, the "the most notable differences lie in the more comprehensive differentiation and varied cellular functions exhibited by WOI organoids..."has been removed.

      (1) Varied cellular functions:

      a. Secretory Epithelium: Compared to SEC organoids, WOI organoids exhibit enhanced peptide metabolism and mitochondrial energy metabolism in their secretory epithelium, supporting endometrial decidualization and embryo implantation (Figure 3F).

      b. Proliferative Epithelium: Compared to SEC organoids, WOI organoids demonstrate enhanced GTPase activity, angiogenesis, cytoskeletal assembly, cell differentiation, and RAS protein signaling in their proliferative epithelium (Figure S2G).

      c. Ciliated Epithelium: The ciliated epithelium of WOI endometrial organoids is associated with the regulation of vascular development and exhibits higher transcriptional activity compared to SEC organoids (Figure 5E).

      d. Stromal Cells: Compared to SEC organoids, WOI organoids exhibit enhanced cell junctions, cell migration, and cytoskeletal regulation in EMT-derived stromal cells (Figure S4A right panel). Similarly, cell junctions are also strengthened in stromal cells (Figure S4A left panel).

      (2) comprehensive differentiation:

      a. Compared to SEC organoids, WOI organoids exhibit more complete differentiation from proliferative epithelium to secretory epithelium (Figure 3G).

      b. The WOI organoids demonstrate more robust ciliary differentiation compared to SEC organoids (Figure 5F).

      c. The proliferative epithelium progressively differentiates into EMT-derived cells. Compared to SEC organoids, WOI organoids are predominantly localized at the terminal end of the differentiation trajectory, indicating more complete differentiation (Figure S4B).

      • What do the authors mean by "average intensity" when referring to the extra reagents added to the WOI? The results that the authors show in response to Reviewer 2's Q1 must be included as part of the results and explain how it was done in the materials and methods section.

      This parameter indicates the growth status of organoids. It measures the gray value of organoids through long-term live-cell tracking. When organoids undergo apoptosis, they progressively condense into denser solid spheres, leading to an increase in gray value (average intensity). This content has been incorporated into the Results section (Line 129) and is further explained in the Supporting Information "Materials and Methods" (Lines 70-77).

      • In panel 1C, it is not possible to see the stromal cells around because they are brightfield images.

      You are partly right. Bright-field images alone indeed make it difficult to distinguish stromal cells. However, by combining whole-mount immunofluorescence staining with the characteristic elongated spindle-shaped morphology of stromal cells, we were able to roughly determine their distribution in the bright-field images.

      • Responding to Reviewer 2's question Q7, the authors indicate how they establish the cluster. However, they do not specify whether they extrapolate the data from a database or create the cluster themselves based on the literature. It should be stated from which classification list (or classification database) the extrapolation has been made.

      Within endometrial biology, stromal cells in the transition from epithelial to mesenchymal phenotype are specifically referred to as 'stromal EMT transition cells' (PMID: 39775038, PMID: 39968688).

      In certain cancers or fibrotic diseases, epithelial cells can transition into a mesenchymal phenotype, contributing to the stromal environment that supports tumor growth or tissue remodeling (PMID: 20572012).

      • Regarding Reviewer 2's question Q8, if the authors have not been able to make comparisons with, at least, SEC organoids, unfortunately, the ERT loses much of its strength and should not serve as support.

      We agree with you at this point. These results have been moved to the supplementary figures.

      • If the differences in the transcriptome and proteome between SEC and WOI organoids are not significant, the result does not support the authors' model. If there are barely any differences at the proteome and transcriptome level between SEC and WOI organoids, why would anyone choose to use their model over SEC organoids?

      We sincerely appreciate your valuable feedback. In this revised manuscript, we have further integrated transcriptomic and proteomic analyses, revealing that WOI organoids exhibit enlarged and functionally active mitochondria, along with enhanced cilia assembly and motility compared to SEC organoids. Additionally, using a blastoid model, we demonstrated that WOI organoids possess superior embryo implantation potential, significantly outperforming SEC organoids. Our research group aims to develop an embryo co-culture model. Through systematic comparisons of structural, molecular, and co-culture characteristics between SEC and WOI organoids, we ultimately confirmed the superior performance of WOI organoids.

      • SEC and WOI organoids must be different enough to establish a new model, and the authors do not demonstrate that they are.

      Thank you for your valuable feedback. In the revised manuscript, we have emphasized the distinctions between SEC and WOI organoids in terms of structure, molecular characteristics, and functionality (co-culture with blastoid), as detailed below.

      (1) Structurally, WOI endometrial organoids exhibit subcellular features characteristic of the implantation window: densely packed pinopodes on the luminal side of epithelial cells, abundant glycogen granules, elongated and tightly arranged microvilli, and increased cilia (Figure 2F).

      (2) At the molecular level, WOI organoids show enlarged and functionally active mitochondria, enhanced ciliary assembly and motility, and single-cell signatures resembling mid-secretory endometrium.

      Specifically, mitochondrial- and cilia-related genes/proteins are most highly expressed in WOI organoids (Figure 4A,B). TEM analysis revealed that WOI organoids have the largest average mitochondrial area (Figure 4C). Mitochondrial-related genes display an increasing trend across the three organoid groups, and WOI organoids produce more ATP and IL-8 (Figure 4D,E).

      For cilia, WOI organoids upregulate genes/proteins involved in ciliary assembly, basal bodies, and motile cilia, while downregulating non-motile cilia markers (Figure 5A-C).

      Single-cell analysis further confirms that WOI organoids recapitulate mid-secretory endometrial traits in mitochondrial metabolism and cell adhesion (Figure 2G).

      (3) Functionally, WOI organoids demonstrate superior embryo implantation potential. Given the scarcity and ethical constraints of human embryos, we used blastoids for implantation assays (Figure 6A). These blastoids successfully grew within endometrial organoids, established interactions (Figure 6B), and exhibited normal trilineage differentiation (epiblast: OCT4; hypoblast: GATA6; trophoblast: KRT18) (Figure 6C). WOI organoids achieved significantly higher blastoid survival (66% vs. 19% in CTRL and 28% in SEC) and interaction rates (90% vs. 47% in CTRL and 53% in SEC), confirming their robust embryo-receptive capacity (Figure 6D,E).

      • Regarding Q16, Boretto et al. 2017 and Turco et al. 2017 also manage to isolate stromal cells, but they lose them between passages. It's not a matter of isolating them from the tissue or not, but rather how they justify their maintenance in culture. In the images added by the authors, it can be seen that the majority of stromal cells are lost from P0 to P1 after thawing. I still believe that the epithelial part can be passed and maintained, but the rest cannot, and that should be mentioned in the paper as a limitation. However, the authors can demonstrate the maintenance of stromal cells by performing immunostaining with vimentin from passages 4, 5, and 6.

      Thank you for your valuable comments. We have added the statement 'Stromal cells and immune cells are difficult to pass down stably and their proportion is lower than that in the in vivo endometrium' to the Limitations section (Line 364-365). Additionally, we performed immunostaining with vimentin starting from passage 6 and confirmed the presence of Vimentin+ F-actin+ stromal cells (as shown in Author response image 1).

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study presents a new and valuable theoretical account of spatial representational drift in the hippocampus. The evidence supporting the claims is convincing, with a clear and accessible explanation of the phenomenon. Overall, this study will likely attract researchers exploring learning and representation in both biological and artificial neural networks.

      We would like to ask the reviewers to consider elevating the assessment due to the following arguments. As noted in the original review, the study bridges two different fields (machine learning and neuroscience), and does not only touch a single subfield (representational drift in neuroscience). In the revision, we also analysed data from four different labs, strengthening the evidence and the generality of the conclusions.

      Public Reviews:

      Reviewer #1 (Public Review):

      The authors start from the premise that neural circuits exhibit "representational drift" -- i.e., slow and spontaneous changes in neural tuning despite constant network performance. While the extent to which biological systems exhibit drift is an active area of study and debate (as the authors acknowledge), there is enough interest in this topic to justify the development of theoretical models of drift.

      The contribution of this paper is to claim that drift can reflect a mixture of "directed random motion" as well as "steady state null drift." Thus far, most work within the computational neuroscience literature has focused on the latter. That is, drift is often viewed to be a harmless byproduct of continual learning under noise. In this view, drift does not affect the performance of the circuit nor does it change the nature of the network's solution or representation of the environment. The authors aim to challenge the latter viewpoint by showing that the statistics of neural representations can change (e.g. increase in sparsity) during early stages of drift. Further, they interpret this directed form of drift as "implicit regularization" on the network.

      The evidence presented in favor of these claims is concise. Nevertheless, on balance, I find their evidence persuasive on a theoretical level -- i.e., I am convinced that implicit regularization of noisy learning rules is a feature of most artificial network models. This paper does not seem to make strong claims about real biological systems. The authors do cite circumstantial experimental evidence in line with the expectations of their model (Khatib et al. 2022), but those experimental data are not carefully and quantitatively related to the authors' model.

      We thank the reviewer for pushing us to present stronger experimental evidence. We now analysed data from four different labs. Two of those are novel analyses of existing data (Karlsson et al, Jercog et al). All datasets show the same trend - increasing sparsity and increasing information per cell. We think that the results, presented in the new figure 3, allow us to make a stronger claim on real biological systems.

      To establish the possibility of implicit regularization in artificial networks, the authors cite convincing work from the machine-learning community (Blanc et al. 2020, Li et al., 2021). Here the authors make an important contribution by translating these findings into more biologically plausible models and showing that their core assumptions remain plausible. The authors also develop helpful intuition in Figure 4 by showing a minimal model that captures the essence of their result.

      We are glad that these translation efforts are appreciated.

      In Figure 2, the authors show a convincing example of the gradual sparsification of tuning curves during the early stages of drift in a model of 1D navigation. However, the evidence presented in Figure 3 could be improved. In particular, 3A shows a histogram displaying the fraction of active units over 1117 simulations. Although there is a spike near zero, a sizeable portion of simulations have greater than 60% active units at the end of the training, and critically the authors do not characterize the time course of the active fraction for every network, so it is difficult to evaluate their claim that "all [networks] demonstrated... [a] phase of directed random motion with the low-loss space." It would be useful to revise the manuscript to unpack these results more carefully. For example, a histogram of log(tau) computed in panel B on a subset of simulations may be more informative than the current histogram in panel A.

      The previous figure 3A was indeed confusing. In particular, it lumped together many simulations without proper curation. We redid this figure (now Figure 4), and added supplementary figures (Figures S1, S2) to better explain our results. It is now clear that the simulations with a large number of active units were either due to non-convergence, slow timescale of sparsification or simulations featuring label noise in which the fraction of active units is less affected. Regarding the log(tau) calculation, while it could indeed be an informative plot, it could not be calculated in a simple manner for all simulations. This is because learning curves are not always exponential, but sometimes feature initial plateaus (see also Saxe et al 2013, Schuessler et al 2020). We added a more detailed explanation of this limitation in the methods section, and we believe the current figure exemplifies the effect in a satisfactory manner.

      Reviewer #2 (Public Review):

      Summary:

      In the manuscript "Representational drift as a result of implicit regularization" the authors study the phenomenon of representational drift (RD) in the context of an artificial network that is trained in a predictive coding framework. When trained on a task for spatial navigation on a linear track, they found that a stochastic gradient descent algorithm led to a fast initial convergence to spatially tuned units, but then to a second very slow, yet directed drift which sparsified the representation while increasing the spatial information. They finally show that this separation of timescales is a robust phenomenon and occurs for a number of distinct learning rules.

      Strengths:

      This is a very clearly written and insightful paper, and I think people in the community will benefit from understanding how RD can emerge in such artificial networks. The mechanism underlying RD in these models is clearly laid out and the explanation given is convincing.

      We thank the reviewer for the support.

      Weaknesses:

      It is unclear how this mechanism may account for the learning of multiple environments.

      There are two facets to the topic of multiple environments. First, are the results of the current paper relevant when there are multiple environments? Second, what is the interaction between brain mechanisms of dealing with multiple environments and the results of the current paper?

      We believe the answer to the first question is positive. The near-orthogonality of representations between environments implies that changes in one can happen without changes in the other. This is evident, for instance, in Khatib et al and Geva et al - in both cases, drift seems to happen independently in two environments, even though they are visited intermittently and are visually similar.

      The second question is a fascinating one, and we are planning to pursue it in future work. While the exact way in which the brain achieves this near-independence is an open question, remapping is one possible window into this process.

      We extended the discussion to make these points clear.

      The process of RD through this mechanism also appears highly non-stationary, in contrast to what is seen in familiar environments in the hippocampus, for example.

      The non-stationarity noted by the reviewer is indeed a major feature of our observations, and is indeed linked to familiarity. We divide learning into three phases (now more clearly stated in Table 1 and Figure 4C). The first, rapid phase, consists of improvement of performance - corresponding to initial familiarity with the environment. The third phase, often reported in the literature of representational drift, is indeed stationary and obtained after prolonged familiarity. Our work focuses on the second phase, which is not as immediate as the first one, and can take several days. We note in the discussion that experiments which include a long familiarization process can miss this phase (see also Table 3). Furthermore, we speculate that real life is less stationary than a lab environment, and this second phase might actually be more relevant there.

      Reviewer #3 (Public Review):

      Summary:

      Single-unit neural activity tuned to environmental or behavioral variables gradually changes over time. This phenomenon, called representational drift, occurs even when all external variables remain constant, and challenges the idea that stable neural activity supports the performance of well-learned behaviors. While a number of studies have described representational drift across multiple brain regions, our understanding of the underlying mechanism driving drift is limited. Ratzon et al. propose that implicit regularization - which occurs when machine learning networks continue to reconfigure after reaching an optimal solution - could provide insights into why and how drift occurs in neurons. To test this theory, Ratzon et al. trained a Feedforward Network trained to perform the oft-utilized linear track behavioral paradigm and compare the changes in hidden layer units to those observed in hippocampal place cells recorded in awake, behaving animals.

      Ratzon et al. clearly demonstrate that hidden layer units in their model undergo consistent changes even after the task is well-learned, mirroring representational drift observed in real hippocampal neurons. They show that the drift occurs across three separate measures: the active proportion of units (referred to as sparsification), spatial information of units, and correlation of spatial activity. They continue to address the conditions and parameters under which drift occurs in their model to assess the generalizability of their findings.

      However, the generalizability results are presented primarily in written form: additional figures are warranted to aid in reproducibility.

      We added figures, and a Github with all the code to allow full reproducibility.

      Last, they investigate the mechanism through which sparsification occurs, showing that the flatness of the manifold near the solution can influence how the network reconfigures. The authors suggest that their findings indicate a three-stage learning process: 1) fast initial learning followed by 2) directed motion along a manifold which transitions to 3) undirected motion along a manifold.

      Overall, the authors' results support the main conclusion that implicit regularization in machine learning networks mirrors representational drift observed in hippocampal place cells.

      We thank the reviewer for this summary.

      However, additional figures/analyses are needed to clearly demonstrate how different parameters used in their model qualitatively and quantitatively influence drift.

      We now provide additional figures regarding parameters (Figures S1, S2).

      Finally, the authors need to clearly identify how their data supports the three-stage learning model they suggest.

      Their findings promise to open new fields of inquiry into the connection between machine learning and representational drift and generate testable predictions for neural data.

      Strengths:

      (1) Ratzon et al. make an insightful connection between well-known phenomena in two separate fields: implicit regularization in machine learning and representational drift in the brain. They demonstrate that changes in a recurrent neural network mirror those observed in the brain, which opens a number of interesting questions for future investigation.

      (2) The authors do an admirable job of writing to a large audience and make efforts to provide examples to make machine learning ideas accessible to a neuroscience audience and vice versa. This is no small feat and aids in broadening the impact of their work.

      (3) This paper promises to generate testable hypotheses to examine in real neural data, e.g., that drift rate should plateau over long timescales (now testable with the ability to track single-unit neural activity across long time scales with calcium imaging and flexible silicon probes). Additionally, it provides another set of tools for the neuroscience community at large to use when analyzing the increasingly high-dimensional data sets collected today.

      We thank the reviewer for these comments. Regarding the hypotheses, these are partially confirmed in the new analyses we provide of data from multiple labs (new Figure 3 and Table 3) - indicating that prolonged exposure to the environment leads to more stationarity.

      Weaknesses:

      (1) Neural representational drift and directed/undirected random walks along a manifold in ML are well described. However, outside of the first section of the main text, the analysis focuses primarily on the connection between manifold exploration and sparsification without addressing the other two drift metrics: spatial information and place field correlations. It is therefore unclear if the results from Figures 3 and 4 are specific to sparseness or extend to the other two metrics. For example, are these other metrics of drift also insensitive to most of the Feedforward Network parameters as shown in Figure 3 and the related text? These concerns could be addressed with panels analogous to Figures 3a-c and 4b for the other metrics and will increase the reproducibility of this work.

      We note that the results from figures 3 and 4 (original manuscript) are based on abstract tasks, while in figure 2 there is a contextual notion of spatial position. Spatial position metrics are not applicable to the abstract tasks as they are simple random mapping of inputs, and there isn’t necessarily an underlying latent variable such as position. This transition between task types is better explained in the text now. In essence the spatial information and place field correlation changes are simply signatures of the movements in parameter space. In the abstract tasks their change becomes trivial, as the spatial information becomes strongly correlated with sparsity and place fields are simply the activity vectors of units. These are guaranteed to change as long as there are changes in the activity statistics. We present here the calculation of these metrics averaged over simulations for completeness.

      Author response image 1.

      PV correlation between training time points averaged over 362 simulations. (B) Mean SI of units normalized to first time step, averaged over 362 simulations. Red line shows the average time point of loss convergence, the shaded area represents one standard deviation.

      (2) Many caveats/exceptions to the generality of findings are mentioned only in the main text without any supporting figures, e.g., "For label noise, the dynamics were qualitatively different, the fraction of active units did not reduce, but the activity of the units did sparsify" (lines 116-117). Supporting figures are warranted to illustrate which findings are "qualitatively different" from the main model, which are not different from the main model, and which of the many parameters mentioned are important for reproducing the findings.

      We now added figures (S1, S2) that show this exactly. We also added a github to allow full reproduction.

      (3) Key details of the model used by the authors are not listed in the methods. While they are mentioned in reference 30 (Recanatesi et al., 2021), they need to be explicitly defined in the methods section to ensure future reproducibility.

      The details of the simulation are detailed in the methods sections. We also added a github to allow full reproducibility.

      (4) How different states of drift correspond to the three learning stages outlined by the authors is unclear. Specifically, it is not clear where the second stage ends, and the third stage begins, either in real neural data or in the figures. This is compounded by the fact that the third stage - of undirected, random manifold exploration - is only discussed in relation to the introductory Figure 1 and is never connected to the neural network data or actual brain data presented by the authors. Are both stages meant to represent drift? Or is only the second stage meant to mirror drift, while undirected random motion along a manifold is a prediction that could be tested in real neural data? Identifying where each stage occurs in Figures 2C and E, for example, would clearly illustrate which attributes of drift in hidden layer neurons and real hippocampal neurons correspond to each stage.

      Thanks for this comment, which urged us to better explain these concepts.

      The different processes (reduction in loss, reduction in Hessian) happen in parallel with different timescales. Thus, there are no sharp transitions between the phases. This is now explained in the text in relation to figure 4C, where the approximate boundaries are depicted.

      The term drift is often used to denote a change in representation without a change in behavior. In this sense, both the second and third phases correspond to drift. Only the third stage is stationary. This is now emphasized in the text and in the new Table 1. Regarding experimental data, apart from the new figure 3 with four datasets, we also summarize in Table 3 the relation between duration of familiarity and stationarity of the data.

      Recommendations for the authors:

      The reviewers have raised several concerns. They concur that the authors should address the specific points below to enhance the manuscript.

      (1) The three different phases of learning should be clearly delineated, along with how they are determined. It remains unclear in which exact phase the drift is observed.

      This is now clearly explained in the new Table 1 and Figure 4C. Note that the different processes (reduction in loss, reduction in Hessian) happen in parallel with different timescales. Thus, there are no sharp transitions between the phases. This is now explained in the text in relation to figure 4C, where the approximate boundaries are depicted.

      The term drift is often used to denote a change in representation without a change in behavior. In this sense, both the second and third phases correspond to drift. Only the third stage is stationary. This is now emphasized in the text and in the new Table 1. Regarding experimental data, apart from the new figure 3 with four datasets, we also summarize in Table 3 the relation between duration of familiarity and stationarity of the data.

      (2) The term "sparsification" of unit activity is not fully clear. Its meaning should be more explicitly explained, especially since, in the simulations, a significant number of units appear to remain active (Fig. 3A).

      We now define precisely the two measures we use - Active Fraction, and Fraction Active Units. There is a new section with an accompanying figure in the Methods section. As Figure S2 shows, the noise statistics (label noise vs. update noise) differentially affects these two measures.

      (3) While the study primarily focuses on one aspect of representational drift-the proportion of active units-it should also explore other features traditionally associated with representational drift, such as spatial information and the correlation between place fields.

      This absence of features is related to the abstract nature of some of the tasks simulated in our paper. In our original submission the transition between a predictive coding task to more abstract tasks was not clearly explained, creating some confusion regarding the measured metrics. We now clarified the motivation for this transition.

      Both the initial simulation and the new experimental data analysis include spatial information (Figures 2,3). The following simulations (Figure 4) with many parameter choices use more abstract tasks, for which the notion of correlation between place cells and spatial information loses its meaning as there is no spatial ordering of the inputs, and every input is encountered only once. Spatial information becomes strongly correlated with the inverse of the active fraction metric. The correlation between place cells is also directly linked to increase in sparseness for these tasks.

      (4) There should be a clearer illustration of how labeling noise influences learning dynamics and sparsification.

      This was indeed confusing in the original submission. We removed the simulations with label noise from Figure 4, and added a supplementary figure (S2) illustrating the different effects of label noise.

      (5) The representational drift observed in this study's simulations appears to be nonstationary, which differs from in vivo reports. The reasons for this discrepancy should be clarified.

      We added experimental results from three additional labs demonstrating a change in activity statistics (i.e. increase in spatial information and increase in sparseness) over a long period of time. We suggest that such a change long after the environment is already familiar is an indication for the second phase, and stress that this change seems to saturate at some point, and that most drift papers start collecting data after this saturation, hence this effect was missed in previous in vivo reports. Furthermore, these effects are become more abundant with the advent on new calcium imaging methods, as the older electrophysiological regording methods did not usually allow recording of large amounts of cells for long periods of time. The new Table 3 surveys several experimental papers, emphasizing the degree of familiarity with the environment.

      (6) A distinctive feature of the hippocampus is its ability to learn different spatial representations for various environments. The study does not test representational drift in this context, a topic of significant interest to the community. Whether the authors choose to delve into this is up to them, but it should at least be discussed more comprehensively, as it's only briefly touched upon in the current manuscript version.

      There are two facets to the topic of multiple environments. First, are the results of the current paper relevant when there are multiple environments? Second, what is the interaction between brain mechanisms of dealing with multiple environments and the results of the current paper?

      We believe the answer to the first question is positive. The near-orthogonality of representations between environments implies that changes in one can happen without changes in the other. This is evident, for instance, in Khatib et al and Geva et al - in both cases, drift seems to happen independently in two environments, even though they are visited intermittently and are visually similar.

      The second question is a fascinating one, and we are planning to pursue it in future work. While the exact way in which the brain achieves this near-independence is an open question, remapping is one possible window into this process.

      We extended the discussion to make these points clear.

      (7) The methods section should offer more details about the neural nets employed in the study. The manuscript should be explicit about the terms "hidden layer", "units", and "neurons", ensuring they are defined clearly and not used interchangeably..

      We changed the usage of these terms to be more coherent and made our code publicly available. Specifically, “units” refer to artificial networks and “neurons” to biological ones.

      In addition, each reviewer has raised both major and minor concerns. These are listed below and should be addressed where possible.

      Reviewer #1 (Recommendations For The Authors):

      I recommend that the authors edit the text to soften their claims. For example:

      In the abstract "To uncover the underlying mechanism, we..." could be changed to "To investigate, we..."

      Agree. Done

      On line 21, "Specifically, recent studies showed that..." could be changed to "Specifically, recent studies suggest that..."

      Agree. Done

      On line 100, "All cases" should probably be softened to "Most cases" or more details should be added to Figure 3 to support the claim that every simulation truly had a phase of directed random motion.

      The text was changed in accordance with the reviewer’s suggestion. In addition, the figure was changed and only includes simulations in which we expected unit sparsity to arise (without label noise). We also added explanations and supplementary figures for label noise.

      Unless I missed something obvious, there is no new experimental data analysis reported in the paper. Thus, line 159 of the discussion, "a phenomenon we also observed in experimental data" should be changed to "a phenomenon that recently reported in experimental data."

      We thank the reviewer for drawing our attention to this. We now analyzed data from three other labs, two of which are novel analyses on existing data. All four datasets show the same trends of sparseness with increasing spatial information. The new Figure 3 and text now describe this.

      On line 179 of the Discussion, "a family of network configurations that have identical performance..." could be softened to "nearly identical performance." It would be possible for networks to have minuscule differences in performance that are not detected due to stochastic batch effects or limits on machine precision.

      The text was changed in accordance with the reviewer’s suggestion.

      Other minor comments:

      Citation 44 is missing the conference venue, please check all citations are formatted properly.

      Corrected.

      In the discussion on line 184, the connection to remapping was confusing to me, particularly because the cited reference (Sanders et al. 2020) is more of a conceptual model than an artificial network model that could be adapted to the setting of noisy learning considered in this paper. How would an RNN model of remapping (e.g. Low et al. 2023; Remapping in a recurrent neural network model of navigation and context inference) be expected to behave during the sparsifying portion of drift?

      We now clarified this section. The conceptual model of Sanders et al includes a specific prediction (Figure 7 there) which is very similar to ours - a systematic change in robustness depending on duration of training. Regarding the Low et al model, using such mechanistic models is an exciting avenue for future research.

      Reviewer #2 (Recommendations For The Authors):

      I only have two major questions.

      (1) Learning multiple representations: Memory systems in the brain typically must store many distinct memories. Certainly, the hippocampus, where RD is prominent, is involved in the ongoing storage of episodic memories. But even in the idealized case of just two spatial memories, for example, two distinct linear tracks, how would this learning process look? Would there be any interference between the two learning processes or would they be largely independent? Is the separation of time scales robust to the number of representations stored? I understand that to answer this question fully probably requires a research effort that goes well beyond the current study, but perhaps an example could be shown with two environments. At the very least the authors could express their thoughts on the matter.

      There are two facets to the topic of multiple environments. First, are the results of the current paper relevant when there are multiple environments? Second, what is the interaction between brain mechanisms of dealing with multiple environments and the results of the current paper?

      We believe the answer to the first question is positive. The near-orthogonality of representations between environments implies that changes in one can happen without changes in the other. This is evident, for instance, in Khatib et al and Geva et al - in both cases, drift seems to happen independently in two environments, even though they are visited intermittently and are visually similar.

      The second question is a fascinating one, and we are planning to pursue it in future work. While the exact way in which the brain achieves this near-independence is an open question, remapping is one possible window into this process.

      We extended the discussion to make these points clear.

      (2) Directed drift versus stationarity: I could not help but notice that the RD illustrated in Fig.2D is not stationary in nature, i.e. the upper right and lower left panels are quite different. This appears to contrast with findings in the hippocampus, for example, Fig.3e-g in (Ziv et al, 2013). Perhaps it is obvious that a directed process will not be stationary, but the authors note that there is a third phase of steady-state null drift. Is the RD seen there stationary? Basically, I wonder if the process the authors are studying is relevant only as a novel environment becomes familiar, or if it is also applicable to RD in an already familiar environment. Please discuss the issue of stationarity in this context.

      The non-stationarity noted by the reviewer is indeed a major feature of our observations, and is indeed linked to familiarity. We divide learning into three phases (now more clearly stated in Table 1 and Figure 4C). The first, rapid, phase consists of improvement of performance - corresponding to initial familiarity with the environment. The third phase, often reported in the literature of representational drift, is indeed stationary and obtained after prolonged familiarity. Our work focuses on the second phase, which is not as immediate as the first one, and can take several days. We note in the discussion that experiments which include a long familiarization process can miss this phase (see also Table 3). Furthermore, we speculate that real life is less stationary than a lab environment, and this second phase might actually be more relevant there.

      Reviewer #3 (Recommendations For The Authors):

      Most of my general recommendations are outlined in the public review. A large portion of my comments regards increasing clarity and explicitly defining many of the terms used which may require generating more figures (to better illustrate the generality of findings) or modifying existing figures (e.g., to show how/where the three stages of learning map onto the authors' data).

      Sparsification is not clearly defined in the main text. As I read it, sparsification is meant to refer to the activity of neurons, but this needs to be clearly defined. For example, lines 262-263 in the methods define "sparseness" by the number of active units, but lines 116-117 state: "For label noise, the dynamics were qualitatively different, the fraction of active units did not reduce, but the activity of the units did sparsify." If the fraction of active units (defined as "sparseness") did not change, what does it mean that the activity of the units "sparsified"? If the authors mean that the spatial activity patterns of hidden units became more sharply tuned, this should be clearly stated.

      We now defined precisely the two measures we use - Active Fraction, and Fraction Active Units. There is a new section with an accompanying figure in the Methods section. As Figure S2 shows, the noise statistics (label noise vs. update noise) differentially affects these two measures.

      Likewise, it is unclear which of the features the authors outlined - spatial information, active proportion of units, and spatial correlation - are meant to represent drift. The authors should clearly delineate which of these three metrics they mean to delineate drift in the main text rather than leave it to the reader to infer. While all three are mentioned early on in the text (Figure 2), the authors focus more on sparseness in the last half of the text, making it unclear if it is just sparseness that the authors mean to represent drift or the other metrics as well.

      The main focus of our paper is on the non-stationarity of drift. Namely that features (such as these three) systematically change in a directed manner as part of the drift process. This is in The new analyses of experimental data show sparseness and spatial information.

      The focus on sparseness in the second half of the paper is because we move to more abstract These are also easy to study in the more abstract tasks in the second part of the paper. In our original submission the transition between a predictive coding task to more abstract tasks was not clearly explained, creating some confusion regarding the measured metrics. We now clarified the motivation for this transition.

      It is not clear if a change in the number of active units alone constitutes "drift", especially since Geva et al. (2023) recently showed that both changes in firing rate AND place field location drive drift, and that the passage of time drives changes in activity rate (or # cells active).

      Our work did not deal with purely time-dependent drift, but rather focused on experience-dependence. Furthermore, Geva et al study the stationary phase of drift, where we do not expect a systematic change in the total number of cells active. They report changes in the average firing rate of active cells in this phase, as a function of time - which does not contradict our findings.

      "hidden layer", "units", and "neurons" seem to be used interchangeably in the text (e.g., line 81-85). However, this is confusing in several places, in particular in lines 83-85 where "neurons" is used twice. The first usage appears to refer to the rate maps of the hidden layer units simulated by the authors, while the second "neurons" appears to refer to real data from Ziv 2013 (ref 5). The authors should make it explicit whether they are referring to hidden layer units or actual neurons to avoid reader confusion.

      We changed the usage of these terms to be more coherent. Specifically, “units” refer to artificial networks and “neurons” to biological ones.

      The authors should clearly illustrate which parts of their findings support their three-phase learning theory. For example, does 2E illustrate these phases, with the first tenth of training time points illustrating the early phase, time 0.1-0.4 illustrating the intermediate phase, and 0.4-1 illustrating the last phase? Additionally, they should clarify whether the second and third stages are meant to represent drift, or is it only the second stage of directed manifold exploration that is considered to represent drift? This is unclear from the main text.

      The different processes (reduction in loss, reduction in Hessian) happen in parallel with different timescales. Thus, there are no sharp transitions between the phases. This is now explained in the text in relation to figure 4C, where the approximate boundaries are depicted.

      The term drift is often used to denote a change in representation without a change in behavior. In this sense, both the second and third phases correspond to drift. Only the third stage is stationary. This is now emphasized in the text and in the new Table 1. Regarding experimental data, apart from the new figure 3 with four datasets, we also summarize in Table 3 the relation between duration of familiarity and stationarity of the data.

      Line 45 - It appears that the acronym ML is not defined above here anywhere.

      Added.

      Line 71: the ReLU function should be defined in the text, e.g., sigma(x) = x if x > 0 else 0.

      Added.

      106-107: Figures (or supplemental figures) to demonstrate how most parameters do not influence sparsification dynamics are warranted. As written, it is unclear what "most parameters" mean - all but noise scale. What about the learning rule? Are there any interactions between parameters?

      We now removed the label noise from Figure 4, and added two supplementary figures to clearly explain the effect of parameters. Figure 4 itself was also redone to clarify this issue.

      2F middle: should "change" be omitted for SI?

      The panel was replaced by a new one in Figure 3.

      116-119: A figure showing how results differ for label noise is warranted.

      This is now done in Figure S1, S2.

      124: typo, The -> the

      Corrected.

      127-129: This conclusion statement is the first place in the text where the three stages are explicitly outlined. There does not appear to be any support or further explanation of these stages in the text above.

      We now explain this earlier at the end of the Introduction section, along with the new Table 1 and marking on Figure 4C.

      132-133 seems to be more of a statement and less of a prediction or conclusion - do the authors mean "the flatness of the loss landscape in the vicinity of the solution predicts the rate of sparsification?"

      We thank the reviewer for this observation. The sentence was rephrased:

      Old: As illustrated in Fig. 1, different solutions in the zero-loss manifold might vary in some of their properties. The specific property suggested from theory is the flatness of the loss landscape in the vicinity of the solution.

      New: As illustrated in Fig. 1, solutions in the zero-loss manifold have identical loss, but might vary in some of their properties. The authors of [26] suggest that noisy learning will slowly increase the flatness of the loss landscape in the vicinity of the solution.

      135: typo, it's -> its

      Corrected.

      Line 135-136 "Crucially, the loss on the 136 entire manifold is exactly zero..." This appears to contradict the Figure 4A legend - the loss appears to be very high near the top and bottom edges of the manifold in 4A. Do the authors mean that the loss along the horizontal axis of the manifold is zero?

      The reviewer is correct. The manifold mentioned in the sentence is indeed the horizontal axis. We changed the text and the figure to make it clearer.

      Equation 6: This does not appear to agree with equation 2 - should there be an E_t term for an expectation function?

      Corrected.

      Line 262-263: "Sparseness means that a unit has become inactive for all inputs." This should also be stated explicitly as the definition of sparseness/sparsification in the main text.

      We now define precisely the two measures we use - Active Fraction, and Fraction Active Units. There is a new section with an accompanying figure in the Methods section. As Figure S2 shows, the noise statistics (label noise vs. update noise) differentially affects these two measures.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #2 (Public Review):

      Weaknesses:

      The comparison of affinity predictions derived from AlphaFold2 and H3-opt models, based on molecular dynamics simulations, should have been discussed in depth. In some cases, there are huge differences between the estimations from H3-opt models and those from experimental structures. It seems that the authors obtained average differences of the real delta, instead of average differences of the absolute value of the delta. This can be misleading, because high negative differences might be compensated by high positive differences when computing the mean value. Moreover, it would have been good for the authors to disclose the trajectories from the MD simulations.

      Thanks for your careful checks. We fully understand your concerns about the large differences when calculating affinity. To understand the source of these huge differences, we carefully analyzed the trajectories of the input structures during MD simulations. We found that the antigen-antibody complex shifted as it transited from NVT to NPT during pre-equilibrium, even when restraints are used to determine the protein structure. To address this issue, we consulted the solution provided on Amber's mailing list (http://archive.ambermd.org/202102/0298.html) and modified the top file ATOMS_MOLECULE item of the simulation system to merge the antigen-antibody complexes into one molecule. As a result, the number of SOLVENT_POINTERS was also adjusted. Finally, we performed all MD simulations and calculated affinities of all complexes.

      We have corrected the “Afterwards, a 25000-step NVT simulation with a time step of 1 fs was performed to gradually heat the system from 0 K to 100 K. A 250000-step NPT simulation with a time step of 2 fs was carried out to further heat the system from 100 K to 298 K.” into “Afterwards, a 400-ps NVT simulation with a time step of 2 fs was performed to gradually heat the system from 0 K to 298 K (0–100 K: 100 ps; 100-298 K: 200 ps; hold 298 K: 100 ps), and a 100-ps NPT simulation with a time step of 2 fs was performed to equilibrate the density of the system. During heating and density equilibration, we constrained the antigen-antibody structure with a restraint value of 10 kcal×mol-1×Å-2.” and added the following sentence in the Method section of our revised manuscript: “The first 50 ns restrains the non-hydrogen atoms of the antigen-antibody complex, and the last 50 ns restrains the non-hydrogen atoms of the antigen, with a constraint value of 10 kcal×mol-1×Å-2”

      In addition, we have corrected the calculation of mean deltas using absolute values and have demonstrated that the average affinities of structures predicted by H3-OPT were closer to those of experimentally determined structures than values obtained through AF2. These results have been updated in the revised manuscript. However, significant differences still exist between the estimations of H3-OPT models and those derived from experimental structures in few cases. We found that antibodies moved away from antigens both in AF2 and H3-OPT predicted complexes during simulations, resulting in RMSDbackbone (RMSD of antibody backbone) exceeding 20 Å. These deviations led to significant structural changes in the complexes and consequently resulted in notable differences in affinity calculations. Thus, we removed three samples (PDBID: 4qhu, 6flc, 6plk) from benchmark because these predicted structures moved away from the antigen structure during MD simulations, resulting in huge energy differences from the native structures.

      Author response table 1.

      We also appreciate your reminder, and we have calculated all RMSDbackbone during production runs (SI Fig. 5).

      Author response image 1.

      Reviewer #3 (Public Review):

      Weaknesses:

      The proposed method lacks of a confidence score or a warning to help guiding the users in moderate to challenging cases.

      We were sorry for our mistakes. We have updated our GitHub code and added following sentences to clarify how we train this confidence score module in Method Section: “Confidence score prediction module

      We apply an MSE loss for confidence prediction, label error was calculated as the Cα deviation of each residue after alignment. The inputs of this module are the same as those used for H3-OPT, and it generates a confidence score ranging from 0 to 100. The dropout rates of H3-OPT were set to 0.25. The learning rate and weight decay of Adam optimizer are set to 1 × 10−5 and 1 × 10−4, respectively.”

      Reviewer #2 (Recommendations For The Authors):

      I would strongly suggest that the authors deepen their discussion on the affinity prediction based on Molecular Dynamics. In particular, why do the authors think that some structures exhibit huge differences between the predictions from the experimental structure and the predicted by H3-opt? Also, please compute the mean deltas using the absolute value and not the real value; the letter can be extremely misleading and hidden very high differences in different directions that are compensating when averaging.

      I would also advice to include graphical results of the MD trajectories, at least as Supp. Material.

      We gratefully thank you for your feedback and fully understand your concerns. We found the source of these huge differences and solved this problem by changing method of MD simulations. Then, we calculated all affinities and corrected the mean deltas calculation using the absolute value. The RMSDbackbone values were also measured to enable accurate affinity predictions during production runs (SI Fig. 5). There are still big differences between the estimations of H3-OPT models and those from experimental structures in some cases. We found that antibodies moved away from antigens both in AF2 and H3-OPT predicted complexes during simulations, resulting in RMSDbackbone exceeding 20 Å. These deviations led to significant structural changes in the complexes and consequently resulted in notable differences in affinity calculations. Thus, we removed three samples (PDBID: 4qhu, 6flc, 6plk) from benchmark.

      Thanks again for your professional advice.

      Reviewer #3 (Recommendations For The Authors):

      (1) I am pleased with the most of the answers provided by the authors to the first review. In my humble opinion, the new manuscript has greatly improved. However, I think some answers to the reviewers are worth to be included in the main text or supporting information for the benefit of general readers. In particular, the requested statistics (i.e. p-values for Cα-RMSD values across the modeling approaches, p-values and error bars in Fig 5a and 5b, etc.) should be introduced in the manuscript.

      We sincerely appreciate your advice. We have added the statistics values to Fig. 4 and Fig. 5 to our manuscript.

      Author response image 2.

      Author response image 3.

      (2) Similarly, authors state in the answers that "we have trained a separate module to predict the confidence score of the optimized CDR-H3 loops". That sounds a great improvement to H3-OPT! However, I couldn't find any reference of that new module in the reviewed version of the manuscript, nor in the available GitHub code. That is the reason for me to hold the weakness "The proposed method lacks of a confidence score".

      We were really sorry for our careless mistakes. Thank you for your reminding. We have updated our GitHub code and added following sentences to clarify how we train this confidence score module in Method Section:

      “Confidence score prediction module

      We apply an MSE loss for confidence prediction, label error was calculated as the Cα deviation of each residue after alignment. The inputs of this module are the same as those used for H3-OPT, and it generates a confidence score ranging from 0 to 100. The dropout rates of H3-OPT were set to 0.25. The learning rate and weight decay of Adam optimizer are set to 1 × 10−5 and 1 × 10−4, respectively.”

      (3) I acknowledge all the efforts made for solving new mutant/designed nanobody structures. Judging from the solved structures, mutants Y95F and Q118N seems critical to either crystallographic or dimerization contacts stabilizing the CDR-H3 loop, hence preventing the formation of crystals. Clearly, solving a molecular structure is a challenge, hence including the following comment in the manuscript is relevant for readers to correctly asset the magnitude of the validation: "The sequence identities of the VH domain and H3 loop are 0.816 and 0.647, respectively, comparing with the best template. The CDR-H3 lengths of these nanobodies are both 17. According to our classification strategy, these nanobodies belong to Sub1. The confidence scores of these AlphaFold2 predicted loops were all higher than 0.8, and these loops were accepted as the outputs of H3-OPT by CBM."

      We appreciate your kind recommendations and have revised “Although Mut1 (E45A) and Mut2 (Q14N) shared the same CDR-H3 sequences as WT, only minor variations were observed in the CDR-H3. H3-OPT generated accurate predictions with Cα-RMSDs of 1.510 Å, 1.541 Å and 1.411 Å for the WT, Mut1, and Mut2, respectively.” into “Although Mut1 (E45A) and Mut2 (Q14N) shared the same CDR-H3 sequences as WT (LengthCDR-H3 = 17), only minor variations were observed in the CDR-H3. H3-OPT generated accurate predictions with Cα-RMSDs of 1.510 Å, 1.541 Å and 1.411 Å for the WT, Mut1, and Mut2, respectively (The confidence scores of these AlphaFold2 predicted loops were all higher than 0.8, and these loops were accepted as the outputs of H3-OPT by CBM). ”. In addition, we have added following sentence in the legend of Figure 4 to ensure that readers can appropriately evaluate the significance and reliability of our validations: “The sequence identities of the VH domain and H3 loop are 0.816 and 0.647, respectively, comparing with the best template.”.

      (4) As pointed out in the first review, I think the work https://doi.org/10.1021/acs.jctc.1c00341 is worth acknowledging in section "2.2 Molecular dynamics (MD) simulations could not provide accurate CDR-H3 loop conformations" of supplementary material, as it constitutes a clear reference (and probably one of the few) to the MD simulations that authors pretend to perform. Similarly, the work https://doi.org/10.3390/molecules28103991 introduces a former benchmark on AI algorithms for predicting antibody and nanobody structures that readers may find interest to contrast with the present work. Indeed, this later reference is used by authors to answer a reviewer comment.

      Thanks a lot for your valuable comments. We have added these references in the proper positions in our manuscript.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Response to Reviewer 1 Comments (PublicReview)

      Point 1: First, the authors should provide more convincing data showing that tor and tapA genes are indeed duplicated genes in A. flavus. The authors appeared to use the A. flavus PTS strain as a parental strain for constructing the tor and tapA mutants. If so, the A. flavus CA14 strain (Hua et al., 2007) should be the parental wild-type strain for the A. flavus PTS strain. I did a BLAST search in NCBI for the torA (AFLA_044350) and tapA (AFLA_092770) genes using the most recent CA14 genome assembly sequence (GCA_014784225.2) and only found one allele for each gene: torA on chromosome 7 and tapA on chromosome 3. I could not find any other parts with similar sequences. Even in another popular A. flavus wild-type strain, NRRL3357, both torA and tapA exist as a single allele. Based on the published genome assembly data for A. flavus, there is no evidence to support the idea that tor and tapA exist as copies of each other. Therefore, the authors could perform a Southern blot analysis to further verify their claim. If torA and tapA indeed exist as duplicate copies in different chromosomal locations, Southern blot data could provide supporting results.

      Response 1: We thank the reviewer for their insightful observation. Based on the southern blot analysis results presented in Figure 1, we have determined that torA and tapA are single-copy genes. Additionally, we conducted protoplast transformation experiments repeated several times. which revealed that both torA and tapA transformants exhibited ectopic mutations. It is plausible that the deletion of torA and tapA genes may lead to the demise of A. flavus, this phenomenon is consistent with previous studies conducted on the fungus Fusarium graminearum[1].To ensure the rigor of the study, we have retracted the previously incorrect conclusion. We once again express our heartfelt appreciation to the experts for their valuable suggestions.

      Author response image 1.

      Fig.1 Southern blot hybridization analyses of WT, torA, and tapA transformants. (A) The structure diagram of the torA gene. (B) The structure diagram of the tapA gene. (C) Southern blot hybridization analyses of torA gene. (D) Southern blot hybridization analyses of tapA gene.

      Point 2: Second, the authors should consider the possibility of aneuploidy for their constructed mutants. When an essential gene is targeted for deletion, aneuploidy often occurs even in a fungal strain without the "ku" mutation, which results in seemingly dual copies of the gene. As the authors appear to use the A. flavus PTS strain having the "ku" mutation, the parental strain has increased genome instability, which may result in enhanced chromosomal rearrangements. So, it will be necessary to Illumina-sequence their tor and tapA mutants to make sure that they are not aneuploidy.

      Response 2: Thank you for your comment. Based on the sequencing results of the torA and tapA mutants, it was determined that the torA and tapA genes were still present in both mutants. In this case, it suggests that the torA and tapA genes may have undergone a genetic rearrangement or insertion at a different site in the mutant strains.

      Point 3: Furthermore, the genetic nomenclature +/- and -/- should be reserved for heterozygous and homozygous mutants in a diploid strain. As A. flavus is not a diploid strain, this type of description could cause confusion for the readers.

      Response 3: Thank you for your suggestion. We acknowledge your concerns about potential confusion caused by using this type of description, and we agree that it is best to avoid any misunderstandings for readers. Therefore, we have decided to remove this part of the content from the manuscript.

      Response to Reviewer 2 Comments (PublicReview)

      Point 1: However, findings have not been deeply explored and conclusions are mostly are based on parallel phenotypic observations. In addition, there are some concerns for the conclusions.

      Response 1: We are grateful for the suggestion. We conduct additional experiments and analyses to provide a more comprehensive understanding and address concerns raised.

      Response to Reviewer 3 Comments (PublicReview)

      Point 1: The paper by Li et al. describes the role of the TOR pathway in Aspergillus flavus. The authors tested the effect of rapamycin in WT and different deletion strains. This paper is based on a lot of experiments and work but remains rather descriptive and confirms the results obtained in other fungi. It shows that the TOR pathway is involved in conidiation, aflatoxin production, pathogenicity, and hyphal growth. This is inferred from rapamycin treatment and TOR1/2 deletions. Rapamycin treatment also causes lipid accumulation in hyphae. The phenotypes are not surprising as they have been shown already for several fungi. In addition, one caveat is in my opinion that the strains grow very slowly and this could cause many downstream effects. Several kinases and phosphatases are involved in the TOR pathway. They were known from S. cerevisiae or filamentous fungi. The authors characterized them as well with knock-out approaches.

      Response 1: Thank you for your comment. The role of the target of rapamycin (TOR) signaling pathway is of fundamental importance in the physiological processes of diverse eukaryotic organisms. Nevertheless, its precise involvement in regulating the developmental and virulent characteristics of opportunistic pathogenic fungi, such as A. flavus, has yet to be fully elucidated. Furthermore, the mechanistic underpinnings of TOR pathway activity specifically in A. flavus remain largely unresolved. Consequently, our study represents a significant contribution as the first comprehensive exploration of the conserved TOR signaling pathway encompassing a majority of its constituent genes in A. flavus.

      Response to Reviewer 1 Comments (Recommendations For The Authors)

      Point 1: In Table S3, the authors indicated that the Δku70 ΔniaD ΔpyrG::pyrG strain is A. flavus wild-type strain. However, this strain is not a wild-type strain because it seems like a control strain after introducing the pyrG gene into the A. flavus PTS strain (Δku70 ΔniaD ΔpyrG). So please indicate the real wild-type A. flavus strain name to help readers find out its original genome sequence data. Also, the reference for this Δku70 ΔniaD ΔpyrG::pyrG strain is "saved in our lab". This is not an eligible reference. If you use this control strain for the first time in this study, it should be described as "In this study". Otherwise, please indicate the proper reference for which the strain was first used.

      Response 1: Thank you for your valuable feedback on our manuscript. We appreciate your attention to detail and the opportunity to clarify the information regarding the strain in Table S3. The A. flavus CA14 strain which produces aflatoxins and large sclerotia was isolated from a pistachio bud in the Wolfskill Grant Experimental Farm (University of Davis, Winters, California, USA)[2]. The A. flavus CA14 strain is the parental wild-type strain for the A. flavus CA14 PTs (Δku70, ΔniaD, ΔpyrG) strain. The recipient strain CA14 PTs has been used satisfactorily in gene knockout and subsequent genetic complementation experiments[3]. In this study, the A. flavus CA14 PTs strain was used as the transformation recipient strain, and the control strain (Δku70, ΔniaD, ΔpyrG::pyrG) created by introducing the pyrG gene into the A. flavus CA14 PTs strain. Refer to previously published literature[4],this control strain (Δku70, ΔniaD, ΔpyrG::pyrG) was named wild-type strain. Therefore, this control strain was also named wild-type strain in this study. As this control strain is indeed used in this study, we will revise the reference to "In this study" Once again, we appreciate your keen attention to detail and thank you for bringing these issues to our attention.

      Response to Reviewer 2 Comments (Recommendations For The Authors)

      Point 1: As in response: However, the tor gene in A. flavus exhibited varying copy numbers, as was confirmed by absolute quantification PCR at the genome level (Table S1). However, it is hard to understand Table S1: Estimation of copy number of tor gene in A. flavus toro and sumoo stand for the initial copy number, and the data are figured as the mean {plus minus} 95% confidence limit. CN is copy number. As indicated in the section of Method, using sumo gene as reference, the tor and tapA gene copy number was calculated by standard curve. In Table S1 of WT, for tor gene, CN value is 1412537 compared to 1698243 in tor+/-, for the reference gene sumo,794328 compared to1584893, how these data could support copy gene numbers of tor?

      Response 1: Thank you for your suggestion. We understand the confusion with the data presented in Table S1 regarding the copy number estimation of the tor gene in A. flavus. We apologize for not providing a clear explanation for the data in the table. Quantitative real-time PCR (qPCR) is widely used to determine the copy number of a specific gene. It involves amplifying the gene of interest and a reference gene simultaneously using specific primers and probes. By comparing the amplification curves of the gene of interest and the reference gene, you can estimate the relative copy number of the gene.

      To address your concern and provide more accurate information, we have re-performed the copy number analysis using southern blot. Southern blot analysis allows for the direct estimation of gene copy number by hybridizing genomic DNA with a specific probe for the gene. This method provides more reliable and accurate results in determining gene copy numbers. The southern blot analysis results are presented in Figure 1.

      We appreciate your input and apologize for any confusion caused by the earlier presentation of the data.

      Point 2: In response: For the knockout of the FRB domain, we used the homologous recombination method, but because tor genes are double-copy genes, there are also double copies in the FRB domain. Despite our efforts, we encountered challenges in precisely determining the location of the other copy of the tor gene. I could not understand these consistent data, why not for using sequencing?

      Response 2: Thank you for your comment. We observed that the torA gene is a single copy. We removed this part of the results to avoid any ambiguity or potential misinterpretation.

      Point 3: Response in Due to the large number of genes involved, we did not perform a complementation experiment. If there were no complementation data, how to demonstrate data are solid?

      Response 3: Thank you for your important suggestion. We understand that complementation experiments are commonly used to validate gene deletions. Therefore, to ensure the reliability of our data, we have conducted supplementary experiments on specific gene deletions, such as ΔsitA-C and Δppg1-C. Thank you again for your positive comments and valuable suggestions to improve the quality of our manuscript.

      References:

      (1) Yu F, Gu Q, Yun Y, et al. The TOR signaling pathway regulates vegetative development and virulence in Fusarium graminearum. New Phytol. 2014; 203(1): 219-32.

      (2) Hua SS, Tarun AS, Pandey SN, Chang L, Chang PK. Characterization of AFLAV, a Tf1/Sushi retrotransposon from Aspergillus flavus. Mycopathologia. 2007 Feb;163(2):97-104.

      (3) Chang PK, Scharfenstein LL, Mack B, Hua SST. Genome sequence of an Aspergillus flavus CA14 strain that is widely used in gene function studies. Microbiol Resour Announc. 2019 Aug 15;8(33):e00837-19.

      (4) Zhu Z, Yang M, Yang G, Zhang B, Cao X, Yuan J, Ge F, Wang S. PP2C phosphatases Ptc1 and Ptc2 dephosphorylate PGK1 to regulate autophagy and aflatoxin synthesis in the pathogenic fungus Aspergillus flavus. mBio. 2023 Oct 31;14(5):e0097723.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review)

      Summary:

      Huang and colleagues present a method for approximation of linkage disequilibrium (LD) matrices. The problem of computing LD matrices is the problem of computing a correlation matrix. In the cases considered by the authors, the number of rows (n), corresponding to individuals, is small compared to the number of columns (m), corresponding to the number of variants. Computing the correlation matrix has cubic time complexity , which is prohibitive for large samples. The authors approach this using three main strategies:

      1. they compute a coarsened approximation of the LD matrix by dividing the genome into variant-wise blocks which statistics are effectively averaged over;

      2. they use a trick to get the coarsened LD matrix from a coarsened genomic relatedness matrix (GRM), which, with time complexity, is faster when n << m;

      3. they use the Mailman algorithm to improve the speed of basic linear algebra operations by a factor of log(max(m,n)). The authors apply this approach to several datasets.

      Strengths:

      The authors demonstrate that their proposed method performs in line with theoretical explanations.

      The coarsened LD matrix is useful for describing global patterns of LD, which do not necessarily require variant-level resolution.

      They provide an open-source implementation of their software.

      Weaknesses:

      The coarsened LD matrix is of limited utility outside of analyzing macroscale LD characteristics. The method still essentially has cubic complexity--albeit the factors are smaller and Mailman reduces this appreciably. It would be interesting if the authors were able to apply randomized or iterative approaches to achieve more fundamental gains. The algorithm remains slow when n is large and/or the grid resolution is increased.

      Thanks for your positive and accurate evaluation! We acknowledge the weakness and include some sentences in Discussion.

      “The weakness of the proposed method is obvious that the algorithm remains slow when the sample size is large or the grid resolution is increased. With the availability of such as UK Biobank data (Bycroft et al., 2018), the proposed method may not be adequate, and much advanced methods, such as randomized implementation for the proposed methods, are needed.”  

      Reviewer #2 (Public Review)

      Summary:

      In this paper, the authors point out that the standard approach of estimating LD is inefficient for datasets with large numbers of SNPs, with a computational cost of , where n is the number of individuals and m is the number of SNPs. Using the known relationship between the LD matrix and the genomic- relatedness matrix, they can calculate the mean level of LD within the genome or across genomic segments with a computational cost of . Since in most datasets, n<<m, this can lead to major computational improvements. They have produced software written in C++ to implement this algorithm, which they call X-LD. Using the output of their method, they estimate the LD decay and the mean extended LD for various subpopulations from the 1000 Genomes Project data.

      Strengths:

      Generally, for computational papers like this, the proof is in the pudding, and the authors appear to have been successful at their aim of producing an efficient computational tool. The most compelling evidence of this in the paper is Figure 2 and Supplementary Figure S2. In Figure 2, they report how well their X- LD estimates of LD compare to estimates based on the standard approach using PLINK. They appear to have very good agreement. In Figure S2, they report the computational runtime of X-LD vs PLINK, and as expected X-LD is faster than PLINK as long as it is evaluating LD for more than 8000 SNPs.

      Weakness:

      While the X-LD software appears to work well, I had a hard time following the manuscript enough to make a very good assessment of the work. This is partly because many parameters used are not defined clearly or at all in some cases. My best effort to intuit what the parameters meant often led me to find what appeared to be errors in their derivation. As a result, I am left worrying if the performance of X-LD is due to errors cancelling out in the particular setting they consider, making it potentially prone to errors when taken to different contexts.

      Thanks for you critical reading and evaluation. We do feel apologize for typos, which have been corrected and clearly defined now (see Eq 1 and Table 1). In addition, we include more detailed mathematical steps, which explain how LD decay regression is constructed and consequently finds its interpretation (see the detailed derivation steps between Eq 3 and Eq 4).

      Impact:

      I feel like there is value in the work that has been done here if there were more clarity in the writing. Currently, LD calculations are a costly step in tools like LD score regression and Bayesian prediction algorithms, so a more efficient way to conduct these calculations would be useful broadly. However, given the difficulty I had following the manuscript, I was not able to assess when the authors’ approach would be appropriate for an extension such as that.

      See our replies below in responding to your more detailed questions.

      Reviewer #1 (Recommendations For The Authors)

      There are numerous linguistic errors throughout, making it challenging to read.

      It is unclear how the intercepts were chosen in Figure S2. Since theory only gives you the slopes, it seems like it would make more sense to choose the intercept such that it aligns with the empirical results in some way.

      Thanks for your critical evaluation. We do feel apologize some typos, and we have read it through and clarify the text as much as possible. In addition, we included Table 1, which introduces mathematical symbols of the paper.

      In Figure S2, the two algorithms being compared have different software implementations, PLINK vs X-LD. Their real performance not only depended on the time complexity of the algorithms (right-side y-axis), but also how the software was coded. PLINK is known for its excellent programming. If we could have programmed as well as Chris Chang, the performance of X-LD should have been even better and approach the ratio m/n. However, even under less skilled programming, X-LD outperformed plink.

      Reviewer #2 (Recommendations For The Authors):

      Thank you for the chance to review your manuscript. It looks like compelling work that could be improved by greater detail. Providing the level of detail necessary may require creating a Supplementary Note that does a lot of hand-holding for readers like me who are mathematically literate but who don’t have the background that you do. Then you can refer readers to the Supplement if they can’t follow your work.

      We fix the problems and style issues as possible as we can.

      Regarding the weakness section in the public review, here are a few examples of where I got confused, though this list is not exhaustive.

      1) Consider Equation 1 (line 100), which I believe must be incorrect. Imagine that g consists of two SNPs on different chromosomes with correlation rho. Then ell_g (which is defined as the average squared elements of the correlation matrix) would be

      ell_g = 1/4 (1 + 1 + rho^2 + rho^2) = (1+rho^2)/2.

      But ell_1=1 and ell_2=1 and ell_12=rho^2 (The average squared elements of the chromosome-specific correlation matrices and the cross-chromosome correlation matrix, respectively). So

      sum(ell_i)+sum(ell_ij) = 1 + 1 + rho^2 + rho^2 = (1+rho^2)*2.

      I believe your formulas would hold if you defined your LD values as the sum of squared correlations instead of the mean, but then I don’t know if the math in the subsequent sections holds. I think this problem also holds for Eq 2 and therefore makes Eqs 3 and 4 difficult to interpret.

      Thanks for your attentive review and invaluable suggestions. We acknowledge the typo in calculating the mean in Eq 1, resulting in difficulties in understanding the equations. We sincerely apologize for this oversight. To address this issue and ensure clarity in the interpretation of Eq 3 and Eq 4, we have provided more detailed explanations (see the derivation between Eq 3 and Eq 4).

      2) I didn’t know what the parameters are in Equation 3. The vector ell needs to be defined. Is it the vector of ell_i for each chromosomal segment i? I’m also confused by the definition of m_i, which is defined on line 113 as the “SNP number of the i-th chromosome.” Do the authors mean the number of SNPs on the i-th chromosomal segment? If so, it wasn’t clear to me how Eq 2 and Eq 3 imply Eq 4. Further, it wasn’t clear to me why E(b1) quantifies the average LD decay of the genome. I’m used to seeing plots of average LD as a function of distance between SNPs to calculate this, though I’m admittedly not a population geneticist, so maybe this is standard. Standard or not, readers deserve to have their hands held a bit more through this either in the text or in a Supplementary Note.

      Thanks for your insightful feedback. When we were writing this paper, our actually focus was Eq 3 and to establish the relationship between chromosomal LD and the reciprocal of the length of chromosome (Fig 6A) – which was surrogated by the number of SNPs, the correlation between ell_i and 1/m_i.

      We asked around our friends who are population geneticists, who anticipated the correlation between chromosomal LD (ell) and 1/m. The rationale simple if one knows the very basis of population genetics. A long chromosome experiences more recombination, which weakens LD for a pair of loci. In particular, for a pair of loci D_t=D_0 (1-c)^t. D_t the LD at the t generation, D_0 at the 0 generation, and c the recombination fraction. As recombination hotspots are nearly even distributed along the genome, such as reported by Science 2019;363:eaau8861, the chromosome will be broken into the shape in Author response image 1 (Fig 1C, newly added). Along the diagonal you see tight LD block, which will be vanished in the further as predicted by D_t equation, and any loci far away from each other will not be in LD otherwise raised by such as population structure. Ideally, we assume the diagonal block of aveage size of m×m and average LD of a SNP with other SNPs inside the diagonal block (red) is l_u; and, in contrast, off-diagonal average LD (light red) to be l_uv. This logic is hidden but employed in such as ld score regression and prs refinement using LD structure.

      Author response image 1.

      But, how to estimate chromosomal LD (ell), which is overwhelming as our friends said! So, the Figure 6A is logically anticipated by a seasoned population geneticist, but has never been realized because of is nightmare. Often, those signature patterns should have been employed as showcases in releasing new reference data, such as HapMap. However, to our knowledge, this signature linear relationship has never been illustrated in those reference data.

      If you further test a population geneticist, if any chromosome will deviate from this line (Fig 6A)? The answer most likely will be chromosome 6 because of the LD tight HLA region. However, it is chromosome 11 because of its most completed sequenced centromere. Chr 11 is a surprise! With T2T sequenced population, Chr 11 will not deviate much. We predict!

      However, we suspect whether people appreciate this point, we shift our focus to efficient computation of LD—which is more likely understood. We acknowledge the lack of clarity in notation definitions and the absence of the derivation for the interpretation of b1 and b0 for LD decay regression. So, we have added a table to provide an explanation of the notation (see the Table 1) and provided additional derivations, which explained how LD decay regression was derived (see the derivation between Eq 3 and Eq 4). Figure 1C provides illustration for the underlying assumption under LD.

      The technique to bridge Eq 2~3 to Eq 4 is called “building interpretation”. It once was one of the kernel tasks for population genetics or statistical genetics, and a classical example is Haseman-Elston regression (Behavior Genetics, 1972, 2:3-19). When it is moving towards a data-driven style, the culture becomes “shut up, calculate”. Finding interpretation for a regression is a vanishing craftmanship, and people often end up with unclear results!

      3) In line 135, it’s not clear to me what is meant by . If it is , then wouldn’t the resulting matrix be a matrix of zeros since is zero everywhere except the lower off-diagonal? So maybe it is ? But then later in that line, you say that the square of this matrix is the sum of several terms of the form . Are these the scalar elements of the G matrix? But then the sum is a scalar, which can’t be true since is a matrix.

      Thanks for your attentive review. We indeed confused the definition of matrices and their elements, and should refer to the stacked off-diagonal elements of matrix . So, is a vector for variable – the relationship between sample i and j. We assume the reviewer use R software, then corresponds to mean .

      See the text between Eq 5 and Eq 6.

      “We extract two vectors , which stacks the off-diagonal elements of , and , which takes the diagonal elements of .”

      In addition, , so the ground truth is that , but not zero.

      To clarify these math symbols, we replace G with K, so as to be consistent with our other works (see Table 1).

      To derive the means and the sampling variances for and , the Eq 7 can be established by some modifications on the Delta method as exampled in Appendix I of Lynch and Walsh’s book (Lynch and Walsh, 1998). We added this sentence near Eq 7 in the main text.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Recommendations for the authors:

      Please make corrections as suggested by reviewer 1 to improve the manuscript. Specifically, reviewer 1 suggests making changes to p values in Figure 5, and the importance of citing original scholarly works related to effects of increase in excitability of sympathetic neurons by M1 receptors, and the terminology for M currents and KCNQ currents. These changes will improve the manuscript and are strongly recommended.

      The section dealing with Aging Reduces KCNQ currents seems to contain a lot of extraneous information especially in the last part of the long paragraph and this section should be rewritten for improved clarity and - the implications or lack thereof - of the correlation of KCNQ with AP firing rates. The apparent lack of correlation between KCNQ current and KCNQ2 protein needs to be better explained. This is a central part of the study and this result undercuts the premise of the paper. Additionally, the poor specificity of Linordipine for KCNQ should be pointed out in the limitations.

      Finally, the editor notes that the author response should not contain ambiguities in what was addressed in the revision. In the original summary of consolidated revisions that were requested, one clearly and separately stated point (point 4) was that experiments in slice cultures should be strongly considered to extend the significance of the work to an intact brain preparation. The author response letter seems to imply that this was done, but this is not the case. The author response seems to have combined this point with another separate point (point 3) about using KCNQ drugs, and imply that all concerns were addressed. Authors should be clear about what revisions were in fact addressed.

      Summary of recommendations from the three reviewers:

      Please make corrections as suggested by reviewer 1 to improve the manuscript.

      Specifically, reviewer 1 suggests making changes to p values in Figure 5,

      As a team, we have decided to keep p values. Here is our rationale:

      Our lab favors reporting p-values for all statistical comparisons to help readers identify what we consider statistically significant. We color-coded the p-values, with red for p-value < 0.05 and black for p-value > 0.05. As a reader, seeing a p-value=0.7 allows me to know that the authors performed an analysis comparing these conditions and found the mean not to be different. Not presenting the p-value makes me wonder whether the authors even analyzed those groups. We value the ability to analyze the data by seeing all p-values than not being distracted by non-significant p-values.

      and the importance of citing original scholarly works related to effects of increase in excitability of sympathetic neurons by M1 receptors, and the terminology for M currents and KCNQ currents. These changes will improve the manuscript and are strongly recommended.

      We cited original papers on that area and changed the terminology for M current. I kept KCNQ when referring to the channel protein or abundance.

      The section dealing with Aging Reduces KCNQ currents seems to contain a lot of extraneous information especially in the last part of the long paragraph and this section should be rewritten for improved clarity… and - the implications or lack thereof - of the correlation of KCNQ with AP firing rates.

      I separated the long paragraph in two. I also removed extraneous information in that section. It now reads:

      Previous work by our group and others demonstrated that cholinergic stimulation leads to a decrease in M current and increases the excitability of sympathetic motor neurons at young ages.67-71 The molecular determinants of the M current are channels formed by KCNQ2 and KCNQ3 in these neurons.70, 76, 77 Thus, Figure 6A shows a voltage response (measured in current-clamp mode) and a consecutive M current recording (measured in voltage-clamp mode) in the same neuron upon stimulation of cholinergic type 1 muscarinic receptors. It illustrates the temporal correlation between the decrease of M current with the increase in excitability and firing of APs. This strong dependence led us to hypothesize that aging decreases M current, leading to a depolarized RMP and hyperexcitability (Figure 6B). For these experiments, we measured the RMP and evoked activity using perforated patch, followed by the amplitude of M current using a whole-cell voltage clamp in the same cell. We also measured the membrane capacitance as a proxy for cell size. Interestingly, M current density was smaller by 29% in middle age (7.5 ± 0.7 pA/pF) and by 55% in old (4.8 ± 0.7 pA/pF) compared to young (10.6 ± 1.5 pA/pF) neurons (Figure 6C-D). The average capacitance was similar in young (30.8 ± 2.2 pF), middle-aged (27.4 ± 1.2 pF), and old (28.8 ± 2.3 pF) neurons (Figure 6E), suggesting that aging is not associated with changes in cell size of sympathetic motor neurons, and supporting the hypothesis that aging alters the levels of M current. Next, we tested the effect on the abundance of the channels mediating M current. Contrary to our expectation, we observed that KCNQ2 protein levels were 1.5 ± 0.1 -fold higher in old compared to young neurons (Figure 6F-G). Unfortunately, we did not find an antibody to detect consistently KCNQ3 channels. We concluded that the decrease in M current is not caused by a decrease in the abundance of KCNQ2 protein.

      B. and - the implications or lack thereof - of the correlation of KCNQ with AP firing rates.

      I am not sure to understand the request in the section on the correlation of KCNQ with AP firing rate. I divided the long paragraph.

      The apparent lack of correlation between KCNQ current and KCNQ2 protein needs to be better explained. This is a central part of the study and this result undercuts the premise of the paper.

      Indeed, total KCNQ2 protein abundance increases while M current decreases. We do not claim in our work that changes in excitability are caused by a reduction in the expression or density of KCNQ2 channels. On the contrary, our current working hypothesis is that the reduction in M current is caused by changes in traffic, degradation, posttranslational modifications, or cofactors for KCNQ2 or KCNQ3 channels. I have modified the description in the results section and discussion to clarify this concept. We also note that the discussion section contains a paragraph discussing this discrepancy.

      Additionally, the poor specificity of Linordipine for KCNQ should be pointed out in the limitations.

      Thank you for the suggestion. I have added the following sentences to the Limitations section. It reads: “We want to point out that linopirdine has been reported to affect other ionic currents besides M current (Neacsu and Babes, 2010; Lamas et al., 1997). Despite this limitation, the application of linopirdine to young sympathetic motor neurons led to depolarization and firing of action potentials.”

      Finally, the editor notes that the author response should not contain ambiguities in what was addressed in the revision. In the original summary of consolidated revisions that were requested, one clearly and separately stated point (point 4) was that experiments in slice cultures should be strongly considered to extend the significance of the work to an intact brain preparation. The author response letter seems to imply that this was done, but this is not the case. The author response seems to have combined this point with another separate point (point 3) about using KCNQ drugs, and imply that all concerns were addressed. Authors should be clear about what revisions were in fact addressed.

      We apologize for this omission. After reviewing this comment, I realized I did not respond to the Major points in the section of the Recommendations for the authors from Reviewer 3. We missed that entire section. Our previous responses addressed the Public review of Reviewer 3. When doing so, we did not separate the sentences, omitting the request to perform the experiment in slices.

      The proposed experiments will require an upward microscope coupled to an electrophysiology rig; unfortunately, we do not have the equipment to do these experiments. We agree that our findings need to be tested in intact preparations to understand how the hyperactivity of sympathetic motor neurons affects systemic responses and the function of controlling organ function. This is a crucial step to move the field forward. Our laboratory is trying to find the appropriate experimental design to address this problem. We believe we must go beyond redoing these experiments in slices.

      Reviewer #1 (Recommendations For The Authors):

      (1) The significance values greater than p < 0.05 do not add anything and distract focus from the results that are meaningful. Fig. 5 is a good example. What does p = 0.7 mean? Or p = 0.6? Does this help the reader with useful information?

      We thank Reviewer 1 for raising this question. We have attempted different versions of how we report p values, as we want to make sure to address rigor and transparency in reporting data.

      Our lab favors reporting p-values for all statistical comparisons to help readers identify what we consider statistically significant. We color-coded the p-values, with red for p-value < 0.05 and black for p-value > 0.05. As a reader, seeing a p-value=0.7 allows me to know that the authors performed an analysis comparing these conditions and found the mean not to be different. Not presenting the p-value makes me wonder whether the authors even analyzed those groups. We value the ability to analyze the data by seeing all p-values than not being distracted by non-significant p-values.

      (2) Fig. 1 is not informative and should be removed.

      Although we agree with the reviewer that this figure is not informative, it was created to guide the reader in identifying the problem addressed in our manuscript in the physiological context. Our colleagues who read the first drafts of the manuscript recommended this, so we prefer to keep the figure.

      (3) The emphasis on a particular muscarinic agonist favored by many ion channel physiologists, oxotremorine, is not meaningful (lines 192, 198). The important point is stimulation of muscarinic AChRs, which physiologically are stimulated by acetylcholine. The particular muscarinic agonist used is unimportant. Unless mandated by eLife, "cholinergic type 1 muscarinic receptors" are usually referred to as M1 mAChRs, or even better is "Gq-coupled M1 mAChRs." I don't think that Kruse and Whitten, 2021 were the first to demonstrate the increase in excitability of sympathetic neurons from stimulation of M1 mAChRs. Please try and cite in a more scholarly fashion.

      A) We have modified lines 192 and 198, removing the mention of oxotremorine.

      B) We have modified the nomenclature used to refer to cholinergic type 1 muscarinic receptors.

      C) We cited references on the role of M current on sympathetic motor neuron excitability.

      (4) The authors may want to use the term "M current" (after defining it) as the current produced by KCNQ2&3-containing channels in sympathetic neurons, and reserve "KCNQ" or "Kv7" currents as those made by cloned KCNQ/Kv7 channels in heterologous systems. A reason for this is to exclude currents KCNQ1-containing channels, which most definitely do not contribute to the "KCNQ" current in these cells. I am not mandating this, but rather suggesting it to conform with the literature.

      Thank you for the suggestion. I have modified the text to use the term M current. I maintained the use of KCNQ only when referring to KCNQ channel, such as in the section describing the abundance of KCNQ2.

      (5) The section in the text on "Aging reduces KCNQ current" is confusing. Can the authors describe their results and their interpretation more directly?

      (6) Please explain the meaning of the increase in KCNQ2 abundance with age in Fig. 6G. How is this increase in KCNQ2 expression consistent with an increase in excitability? The explanation of "The decrease in KCNQ current and the increase in the abundance of KCNQ2 protein suggest a potential compensatory mechanism that occurs during aging, which we are actively investigating in an independent study." is rather odd, considering that the entire thesis of this paper is that changes in excitability and firing properties are underlied by changes in KCNQ2/3 channel expression/density. Suddenly, is this not the case?? What about KCNQ3? It would be very enlightening if the authors would just quantify the ratio of KCNQ2:KCNQ3 subunits in M-type channels in young and old mice using simple TEA dose/response curves (see Shapiro et al., JNS, 2000; Selyanko et al., J. Physiol., Hadley et al., Br. J. Pharm., 2001 and a great many more). It is also surprising that the authors did not assess or probe for differences in mAChR-induced suppression of M current between SCG neurons of young and old mice. This would seem to be a fundamental experiment in this line of inquiry.

      We have divided this paragraph in sections.

      A. Please explain the meaning of the increase in KCNQ2 abundance with age in Fig. 6G. How is this increase in KCNQ2 expression consistent with an increase in excitability? The explanation of "The decrease in KCNQ current and the increase in the abundance of KCNQ2 protein suggest a potential compensatory mechanism that occurs during aging, which we are actively investigating in an independent study." is rather odd, considering that the entire thesis of this paper is that changes in excitability and firing properties are underlied by changes in KCNQ2/3 channel expression/density. Suddenly, is this not the case??

      Our interpretation is that the decrease in M current is not caused by a decrease in the abundance of KCNQ (2) channels. We do not claim that changes in excitability are caused by a reduction in the expression or density of KCNQ2 channels. On the contrary, our working hypothesis is that the reduction in M current is caused by changes in traffic, degradation, posttranslational modifications, or cofactors for KCNQ2 or KCNQ3 channels. We have modified the description in the results section to clarify this concept. “We concluded that the decrease in M current is not caused by a decrease in the abundance of KCNQ2 protein.”

      B. What about KCNQ3?

      Unfortunately, we did not find an antibody to detect KCNQ3 channels. I have added a sentence to state this.

      C. KCNQ2: KCNQ3 subunits in M-type channels in young and old mice using simple TEA dose/response curves.

      Our laboratory is working to deeply understand the mechanism behind the changes in M current and its regulation by mAChRs in young and old ages. However, it is part of different research to attend to the complexity of the question. We think pharmacology experiments are insufficient to understand the question's complexity as we described in the next answer.

      D. It is also surprising that the authors did not assess or probe for differences in mAChR-induced suppression of M current between SCG neurons of young and old mice. This would seem to be a fundamental experiment in this line of inquiry.

      As mentioned, our laboratory is working to understand the mechanism behind M current and its regulation in young and old ages deeply. Our preliminary data show that M currents recorded in old neurons show two behaviors with the activation of mAChR: 1) they do not respond (blue line), or 2) they show a smaller and slower current inhibition than young neurons (red line). This data shows the complexity of the mechanism behind the M current in old neurons where changes in basal levels of PIP2, phospholipids metabolism, KCNQ2/3 changes in traffic/degradation, and M current pharmacology need to be addressed together for a proper interpretation. Showing only one part of this set of experiments in this article may lead to misinterpretation of results.

      Author response image 1.

      (7) Why do the authors use linopirdine instead of XE-991? Both are dirty drugs hardly specific to KCNQ channels at 25 uM concentrations, but linopirdine less so. The Methods section lists the source of XE991 used in the study, not linopirdine. Is there an error?

      A. Why do the authors use linopirdine instead of XE-991?

      We use linopiridine with the experimental goal of observing the recovery phase during the washout. The main difference between the effects of XE991 and linopiridine on Kv7.2/3 is associated with the recovery phase. Currents under XE991 treatment recover 30% after 10 min compared to 93.4% with linopiridine in expression systems at -30 mV (Greene DL et al., 2017, J Pharmacol Exp Ther). After validation of KCNQ2/3 inhibition by linopirdine (IC50 value of 2.4 µM), we found linopirdine the most appropriate drug for our experiments.

      Unfortunately, we were not able to observe a recovery in our experiments. The limited recovery after washout may be associated with the membrane potential of our conditions (-60 to -50 mV).

      B. Both are dirty drugs hardly specific to KCNQ channels at 25 uM concentrations, but linopirdine less so.

      We understand the concern of the reviewer. The specificity of XE-991 and linopiridine is not absolute. Linopiridine has been reported to activate TRPV1 channels (EC50 =115 µM, Neacsu and Babes, 2010, J Pharmacol Sci) or nicotinic acetylcholine receptors and GABA-induced Cl- currents (EC50 =7.6 µM and 8.1 µM respectively; Lamas et al, 1997, Eur J Neurosci).

      To clarify this limitation in the article, we have added the following sentence in the section Limitations and Conclusions. “We want to point out that linopirdine has been reported to affect other ionic currents besides M current (Neacsu and Babes, 2010; Lamas et al., 1997). Despite this limitation, the application of linopirdine to young sympathetic motor neurons led to depolarization and firing of action potentials.”

      C. The Methods section lists the source of XE991 used in the study, not linopirdine. Is there an error?

      Thank you for pointing out this. We have added information for both retigabine and linopirdine in the Methods section; both were missing.

      (8) Can the authors use a more scientific explanation of RTG action than "activating KCNQ channels?" For instance, RTG induces both a negative-shift in the voltage-dependance of activation and a voltage-independent increase in the open probability, both of which differing in detail between KCNQ2 and KCNQ3 subunits. The authors are free to use these exact words. Thus, the degree of "activation" is very dependent upon voltage at any voltages negative to the saturating voltages for channel activation.

      We have modified the text to reflect your suggestion. Thank you.

      (9) Methods: did the authors really use "poly-l-lysine-coated coverslips?" Almost all investigators use poly-D-lysine as a coating for mammalian tissue-culture cells and more substantial coatings such as poly-D-lysine + laminin or rat-tail collagen for peripheral neurons, to allow firm attachment to the coverslip.

      That is correct. We used poly-L-lysine-coated coverslips. Sympathetic motor neurons do not adhere to poly-D-Lysine.

      (10) As a suggestion, sampling M-type/KCNQ/Kv7 current at 2 kHz is not advised, as this is far faster than the gating kinetics of the channels. Were the signals filtered?

      Signals were not filtered. Currents were sampled at 2KHz. Our conditions are not far from what is reported by others. Some sample at 10KHz and even 50 KHz. Others do not report the sample frequency.

      Reviewer #2:

      Weaknesses:

      None, the revised version of the manuscript has addressed all my concerns.

      We are very appreciative and glad that our responses satisfied your previous concerns.

      Reviewer #3:

      The main weakness is that this study is a descriptive tabulation of changes in the electrophysiology of neurons in culture, and the effects shown are correlative rather than establishing causality.

      In the previous revision, Reviewer 3 wrote: “It is difficult to know from the data presented whether the changes in KCNQ channels are in fact directly responsible for the observed changes in membrane excitability.” And suggested the “use of blockers and activators to provide greater relevance.”

      Attending this recommendation, we performed experiments in Fig. 8. Young neurons exposed to linopirdine depolarize membrane potential and promote action potential firing. In contrast, the old neurons treated with retigabine repolarize membrane potential and stop firing action potentials. This new set of experiments suggests age-related electrophysiological changes in old neurons are associated with changes in M current. The main finding of our article.

      If Reviewer 3 refers to establishing causality between aging and a reduction in M current, I would like to emphasize that our laboratory is working toward a better understanding of the molecular mechanism of how M current is affected by aging; however, it will be part of a different article.  One of our attempts was to reverse aging with rapamycin, but the previous recommendation was to remove those experiments.

      … but the specifics of the effects and relevance to intact preparations are unclear.

      Additional experiments in slice cultures would provide greater significance on the potential relevance of the findings for intact preparations.

      I apologize for missing this point in the previous revision. The proposed experiments will require an upward microscope coupled to an electrophysiology rig. Unfortunately, I do not

      have the equipment to do these experiments.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1:

      Section 4.3 ("expert baseline model"): the authors need to explain how the probabilities defined as baselines were exactly used to predict individual patient susceptible profiles.

      We have added a more detailed and mathematically formal explanation of the “simulated expert’s best guess” in Section 4.3.

      This section now reads:

      “More formally, considering all training spectra as Strain, all training labels corresponding to one drug j and species t are gathered:

      The "simulated expert's best guess" predicted probability for any spectrum si and drug dj, then, corresponds to, the fraction of positive labels in their corresponding training label set :

      Authors should explain in more detail how a ROC curve is generated from a single spectrum (i.e., per patient) and then average across spectra. I have an idea of how it's done but I am not completely sure.

      We have added a more detailed explanation in Section 3.2. It reads:

      To compute the (per-patient average) ROC-AUC, for any spectrum/patient, all observed drug resistance labels and their corresponding predictions are gathered. Then, the patient-specific ROC-AUC is computed on that subset of labels and predictions. Finally, all ROC-AUCs per patient are averaged to a "spectrum-macro" ROC-AUC.

      In addition, our description under Supplementary Figure 8 (showing the ROC curve) provides additional clarification:

      Note that this ROC curve is not a traditional ROC curve constructed from one single label set and one corresponding prediction set. Rather, it is constructed from spectrum-macro metrics as follows: for any possible threshold value, binarize all predictions. Then, for every spectrum/patient independently, compute the sensitivity and specificity for the subset of labels corresponding to that spectrum/patient. Finally, those sensititivies and specificities are averaged across patients to obtain one point on above ROC curve.

      Section 3.2 & reply # 1: can the authors compute and apply the Youden cutoff that gives max precision-sensitivity for each ROC curve? In that way the authors could report those values.

      We have computed this cut-off on the curve shown in Supplementary Figure 8. The Figure now shows the sensitivity and specificity at the Youden cutoff in addition to the ROC. We have chosen only to report these values for this model as we did not want to inflate our manuscript with additional metrics (especially since the ROC-AUC already captures sensitivities and specificities). We do, however, see the value of adding this once, so that biologists have an indication of what kind of values to expect for these metrics.

      Related to reply #5: assuming that different classifiers are trained in the same data, with the same number of replicates, could authors use the DeLong test compare ROC curves? If not, please explain why.

      We thank the reviewer for bringing our attention to the DeLong’s test. It does indeed seem true that this test is appropriate for comparing two ROC-AUCs using the same ground truth values.

      We have chosen not to use this test for one conceptual and one practical reason:

      (1) Our point still stands that in machine learning one chooses the test set, and hence one can artificially increase statistical power by simply allocating a larger fraction of the data to test.

      (2) DeLong’s test is defined for single AUCs (i.e. to compare two lists of predictions against one list of ground truths), but here we report the spectrum/patient-macro ROC-AUC. It is not clear how to adjust the test to macro-evaluated AUCs. One option may be to apply the test per patient ROC curve, and perform multiple testing correction, but then we are not comparing models, but models per patient. In addition, the number of labels/predictions per patient is prohibitively small for statistical power.

      Reviewer #2 (Recommendations For The Authors):

      After revision, all issues were been resolved.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Response to reviewer’s comments

      Reviewer #2 (Public Review):

      Summary: 

      The manuscript focuses on comparison of two PLP-dependent enzyme classes that perform amino acyl decarboxylations. The goal of the work is to understand the substrate specificity and factors that influence catalytic rate in an enzyme linked to theanine production in tea plants.

      Strengths: 

      The work includes x-ray crystal structures of modest resolution of the enzymes of interest. These structures provide the basis for design of mutagenesis experiments to test hypotheses about substrate specificity and the factors that control catalytic rate. These ideas are tested via mutagenesis and activity assays, in some cases both in vitro and in plants. 

      Weaknesses:

      Although improved in a revision, the manuscript could be more clear in explaining the contents of the x-ray structures and how the complexes studied relate to the reactant and product complexes. The manuscript could also be more concise, with a discussion section that is largely redundant with the results and lacking in providing scholarly context from the literature to help the reader understand how the current findings fit in with work to characterize other PLP-dependent enzymes or protein engineering efforts. Some of the figures lack sufficient clarity and description. Some of the claims about the health benefits of tea are not well supported by literature citations.

      Thank you for your insightful comments on our manuscript and your recognition of the strengths of our study. We understand your concerns about the weaknesses mentioned, and we have addressed them appropriately in the revised manuscript. We acknowledge that the discussion section needs to be improved for conciseness and context. We have revised this part by removing the redundant content. We also acknowledge your comments concerning the clarity and description of some figures. We have revisited these figures and revised them, ensuring they are clear and adequately described. Lastly, concerning the claims about the health benefits of tea, we understand your concern about the lack of supporting citations. We ensure to back such claims with valid literature or, if necessary, omit these statements.

      Reviewer #2 (Recommendations For The Authors):

      (1) Line 21: Alanine Decarboxylase should not be capitalized.

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (2) Line 31: Grammatical error. Also not clear what "evolution analysis" means here. Revise to "Structural comparisons led us to..."

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (3) Line 34: Revise to "Combining a double mutant of CsAlaDC"

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (4) Line 35: Change word order to "increased theanine production 672%"

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (5) Line 37: meaning unclear. Revise to "provides a route to more efficient biosynthesis of theanine."

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (6) Line 44: I'm not sure that the "health effects" of tea have been proven in placebo controlled studies. And the references provided (2-4 and 5) do not describe original research articles supporting these claims. I would suggest removing these statements from the introduction and at later points in the manuscript.

      Thank you for your thoughtful feedback and suggestions. Based on your suggestion, we have removed these statements: "The popularity of tea is determined by its favorable flavor and numerous health benefits (2-4). The flavor and health-beneficial effects of tea are conferred by the abundant secondary metabolites, including catechins, caffeine, theanine, volatiles, etc (5). " As for the subsequent statement: " It has also many health-promoting functions, including neuroprotective effects, enhancement of immune functions, and potential anti-obesity capabilities, among others. " the referenced literature cited can substantiate this conclusion.

      (7) Line 58: insert "the" between provided and basis

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (8) Line 100: Not clear what this phrase means, "As expected, CsSerDC was closer to AtSerDC" Please clarify - closer to what?

      We apologize for any confusion caused by the unclear phrasing. When referring to "CsSerDC was closer to AtSerDC," we intended to convey that CsSerDC exhibits a higher degree of sequence homology with AtSerDC than it does with the other enzymes evaluated in our investigation. However, a 1.29% difference between 86.21% and 84.92% in amino acid similarity is not statistically significant (Figure 1B and Supplementary table 1 in the original manuscript), we have deleted the relevant descriptions in the revised manuscript.

      (9) Line 112: "were constructed into" makes no sense. It would be better to say the genes for the proteins of interest were inserted into the overexpression plasmid.

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (10) Line 115: missing the word "the" between generated and recombinant

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (11) Line 121: catalyze not catalyzed

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (12) Lines 129 and 130: The reported Km values are really large - in the mM range. Do these values make sense in terms of the available concentrations of the substrates inside the cell?

      The content of alanine in tea plant roots ranges from 0.28 to 4.18 mg/g DW (Yu et al., 2021; Cheng et al., 2017). Correspondingly, the physiological concentration of alanine is 3.14 mM to 46.92 mM, in tea plant roots. The content of serine in plants ranges from 0.014 to 17.6 mg/g DW (Kumar et al., 2017). Correspondingly, the physiological concentration of serine is 0.13 mM to 167.48 mM in plants. Therefore, in this study, the Km values are within the range of available substrate concentrations inside the cell.

      Yu, Y. et al. (2021) Glutamine synthetases play a vital role in high accumulation of theanine in tender shoots of albino tea germplasm "Huabai 1". J. Agric. Food Chem. 69 (46),13904-13915.

      Cheng, S. et al. (2017) Studies on the biochemical formation pathway of the amino acid L-theanine in tea (Camellia sinensis) and other plants. J. Agric. Food Chem. 65 (33), 7210-7216.

      Kumar, V. et al. (2017) Differential distribution of amino acids in plants. Amino Acids. 49(5), 821-869.

      (13) Line 211: it is unclear what the phrase "as opposed to wild-type" means. Please clarify.

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We intend to communicate that the wild-type CsAlaDC and AtSerDC demonstrate decarboxylase activity, while the mutated proteins have experienced a loss of decarboxylation activity. We have already modified this concern in the revised version of the manuscript.

      (14) Line 222: residues not residue

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (15) Line 227 and Figure 4B: It is not clear what the different sequence logos mean in this part of the figure. The caption is too brief and not helpful. And the sentences describing this figure panel are also not sufficiently clear.

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have provided a more detailed explanation of this section in the revised manuscript and added additional annotations in the figure caption to provide further clarity.

      (16) Lines 233 and 234: "in the substrate specificity" is awkwardly worded. I would revise to "in selective binding of the appropriate substrate."

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have meticulously revised the description of this section.

      (17) Line 243: a word is missing in this sentence - but I can't figure out the intended meaning or what the missing word is. Rephrase to improve clarity.

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have revised this sentence to: " These findings indicate the essential role of Phe106 in the selective binding of alanine for CsAlaDC. "

      (18) Line 255: The "expression system...was carried out" is not correct. I would say the expression system was used - but you probably also want to rearrange the sentences to more directly say what it was used for. Later, the word "the" is also missing.

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have revised this sentence to: "To further verify that Phe106 of CsAlaDC and Tyr111 of AtSerDC were key amino acid residues determining its substrate recognition in planta, we employed the Nicotiana benthamiana transient expression system. "

      (19) Line 273: use "understand" instead of "elucidate" and instead of "we proposed a prediction test:" say "we designed a test of the prediction that..."

      Thank you very much for your careful reading of the manuscript. We have revised this sentence to: “In light of this observation, we postulated a hypothesis:”

      (20) Line 301: I don't think "effectuate" is a word. Replace with something else.

      Thank you very much for your careful reading of the manuscript. We have revised the sentence as: " The biosynthetic pathway of theanine in tea plants comprises two consecutive enzymatic steps: alanine decarboxylase facilitates the decarboxylation of alanine to generate EA, while theanine synthetase catalyzes the condensation reaction between EA and Glu to synthesize theanine. "

      (21) Line 307: replace "activity" with "ability"

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (22) Line 322: I didn't find the discussion very useful. Much of it is simply a recap of the results - which is not necessary. The structural comparisons are overly descriptive without providing appropriate rationale or topic sentence structure so that the reader understands why certain details are emphasized. I think the manuscript would be much stronger if this section were not included or integreted more concisely into the results section where appropriate.

      Thank you for your constructive comments. We understand your concerns about the discussion section of our manuscript. We acknowledge that the discussion section has redundancies with the result. In response to this, we have revised this section to eliminate unnecessary repetition of the results.

      (23) Line 369: "an amino acid devoid of the hydroxyl moiety present in Lys" - what does this mean? Lys does not have a hydroxyl functional group. Please correct so that the sentence makes sense.

      Thank you very much for your careful reading of the manuscript. This sentence states that the amino acid occupying the corresponding position in CsAlaDC is Phe, which lacks one hydroxyl functional group as compared to Lys. We have made modifications to the sentence as follows: "In contrast, the equivalent position in CsAlaDC is occupied by Phe, an amino acid lacking the hydroxyl group. This substitution enhances the hydrophobic nature of the substrate-binding pocket. "

      (24) Line 370: "This structural nuance portends a predisposition for CsAlaDC to select the comparatively hydrophobic amino acid alanine as its suitable substrate." This sentence also makes no sense - please revise to use simpler language so the meaning is more clear.

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have revised the sentence as follows: " Consequently, CsAlaDC demonstrates a unique predilection, selectively binding Ala (an amino acid with comparatively hydrophobic properties) as its preferred substrate."

      (25) Lines 376-384: This section makes several references to "catalytic rings." I have no idea what this term means? If the authors mean a loop structure in the enzyme - please use the term "loop"

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have corrected it in the revised manuscript.

      (26) Line 396-397: The authors reference data that is not shown in the manuscript. Either show the data in the results section or do not mention.

      Thank you for your insightful comment regarding the unshown data referenced in the manuscript. We have included Supplementary figure 9 in the revised manuscript to display this data.

      (27) Line 445-446: what is "mutation technology" - if the authors mean site-directed mutagenesis - please use the simpler and more recognizable terminology.

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have revised the sentence as follows: "Based on the findings of this study, site-directed mutagenesis can be employed to modify enzymes involved in theanine synthesis. This modification enhances the capacity of bacteria, yeast, model plants, and other organisms to synthesize theanine, thereby facilitating its application in industrial theanine production."

      Reviewer #3 (Public Review):

      In the manuscript titled "Structure and Evolution of Alanine/Serine Decarboxylases and the Engineering of Theanine Production," Wang et al. solved and compared the crystal structures of Alanine Decarboxylase (AlaDC) from Camellia sinensis and Serine Decarboxylase (SerDC) from Arabidopsis thaliana. Based on this structural information, the authors conducted both in vitro and in vivo functional studies to compare enzyme activities using site-directed mutagenesis and subsequent evolutionary analyses. This research has the potential to enhance our understanding of amino acid decarboxylase evolution and the biosynthetic pathway of the plant specialized metabolite theanine, as well as to further its potential applications in the tea industry.

      Thank you very much for taking the time to review this manuscript. We appreciate all your insightful comments.

      Reviewer #3 (Recommendations For The Authors):

      The additional material added by the authors addresses some of the previously raised questions and enhances the manuscript's quality. However, certain critical issues we pointed out earlier remain unaddressed. Some of the new data also raises new questions. To provide readers with more comprehensive data, the authors should include additional quantitative data and convert the data presented in the reviewer's comments into supplemental figure format.

      Thank you for acknowledging the improvements in the revised manuscript and providing further valuable feedback. We understand your concern about the critical issues that have not been fully addressed and the new questions raised by some of the newly added data. We have strived to address these issues with additional analysis and clarification in our subsequent revision. Regarding your suggestion for more quantitative data and converting the data mentioned in the reviewer's comments into a supplemental figure format, we agree that this would provide a more comprehensive view of the results. We have reformatted the relevant data into supplemental figures to enhance the clarity and accessibility of information. We are grateful for the time and effort you have dedicated to improving our manuscript.

      * Page 5 & Figure 1B

      "As expected, CsSerDC was most closed to AtSerDC, which implies that they shared similar functions. However, CsAlaDC is relatively distant from CsSerDC."

      : In Figure 1B, CsSerDC and AtSerDC are in different clades, and this figure does not show that the two enzymes are closest. To provide another quantitative comparison, please provide a matrix table showing amino acid sequence similarities as a supplemental table. 

      Comment: I don't believe that a 1.29% difference between 86.21% and 84.92% in amino acid similarity is statistically significant. Although the authors have rephrased the original sentence, it's improbable that this small 1.29% difference can explain the observed distinction.

      Many thanks. We have carefully considered your comments. Indeed, the 1.29% difference in amino acid similarity cannot reflect the functional difference between the AlaDC and SerDC proteins. We have deleted the relevant descriptions in the revised manuscript.

      * Page 6, Figure 2, Page 23 (Methods)

      "The supernatants were purified with a Ni-Agarose resin column followed by size-exclusion chromatography."

      : What kind of SEC column did the authors use? Can the authors provide the SEC elution profile comparison results and size standard curve?

      Comment: The authors should include the SEC elution profiles as a supplemental figure or incorporate them as a panel in Figure 2. Furthermore, they should provide a description of the oligomeric state of each protein in this experiment. Additionally, there is a significant difference between CsSerDC (65.38 mL) and CsAlaDC (74.37 mL) elution volumes. Can this difference be explained structurally? In comparison to the standard curve of molecular weight provided by the authors, it appears that these proteins are at least homo-tetramers, which contradicts the description in the text. This should be re-evaluated and clarified.  

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have included the SEC elution profile in Supplemental figure 1A and added descriptions of the oligomeric states of proteins in the revised manuscript. CsSerDC was eluted at 65.38 mL, corresponding to a molecular weight of 292 kDa, which is five times the monomeric protein (54.7 kDa). However, due to the absence of CsSerDC crystal structure, it remains uncertain whether the protein forms a pentamer. AtSerDC was eluted at 72.25 mL, with a corresponding molecular weight of 155 kDa, which is 3.3 times the monomer (47.3 kDa). CsAlaDC was eluted at 74.37 mL, with a corresponding molecular weight of 127 kDa, which is 2.7 times the monomer (47.3 kDa). The elution profiles suggest that AtSerDC and CsAlaDC potentially exist in homotrimeric form. This observation stands in contradiction to our subsequent findings where the protein manifests in a dimeric structure. A plausible explanation could be the non-ideal spherical shape of the protein. Under such circumstances, the hydrodynamic radius of the protein could supersede its actual size, potentially leading to an overestimation of the molecular weight on the size-exclusion chromatography [ref].

      References:

      Burgess, R. R. (2018) A brief practical review of size exclusion chromatography: Rules of thumb, limitations, and troubleshooting. Protein Expression and Purification. 150, 81-85.

      Erdner J. M., et al. (2006) Size-Exclusion Chromatography Using Deuterated Mobile Phases. Journal of Chromatography A. 1129(1):41–46.

      * Page 6 & Page 24 (Methods)

      "The 100 μL reaction mixture, containing 20 mM substrate (Ala or Ser), 100 mM potassium phosphate, 0.1 mM PLP, and 0.025 mM purified enzyme, was prepared and incubated at standard conditions (45 {degree sign}C and pH 8.0 for CsAlaDC, 40 {degree sign}C and pH 8.0 for AtSerDC for 30 min)."

      (1) The enzymatic activities of CsAldDC and AtSerDC were measured at two different temperatures (45 and 40 {degree sign}C), but their activities were directly compared. Is there a reason for experimenting at different temperatures?

      (2) Enzyme activities were measured at temperatures above 40{degree sign}C, which is not a physiologically relevant temperature and may affect the stability or activity of the proteins. At the very least, the authors should provide temperature-dependent protein stability data (e.g., CD spectra analysis) or, if possible, temperature-dependent enzyme activities, to show that their experimental conditions are suitable for studying the activities of these enzymes.

      Comment: I appreciate the authors for including temperature-dependent enzyme activity data in their study. However, it remains puzzling that plant enzymes were tested at a physiologically irrelevant temperature of 40 and 45 degrees Celsius. Additionally, it may not be appropriate to directly compare enzyme activity measurements at different temperatures. Furthermore, the data at 45 degrees in panel A appears to be an outlier, which contrasts with the overall trend observed in the graph.

      We appreciate your point regarding the testing temperatures for plant enzymes. We fully appreciate the importance of conducting experiments under physiologically relevant conditions. But the intent behind operating at these elevated temperatures was to assess the thermal stability of the enzymes, which can be a valuable characteristic in certain applications, such as industrial production processes, and does not necessarily reflect their physiological conditions. Our findings indicate that CsAlaDC exhibits its peak activity at 45 °C. This result aligns with previously reported data in the literature [Bai, P. et al. (2021) figure 4e], thus bolstering our confidence in the reliability of our experimental outcomes.

      Author response image 1.

      Relative activity of CsAlaDC at different temperatures.

      * Pages 6-7 & Table 1

      (1) Use the correct notation for Km and Vmax. Also, the authors show kinetic parameters and use multiple units (e.g., mmol/L or mM for Km).

      (2) When comparing the catalytic efficiency of enzymes, kcat/Km (or Vmax/Km) is generally used. The authors present a comparison of catalytic activity from results to conclusion. A clarification of what results are being compared is needed.

      Comment: The authors are still comparing catalytic efficiency solely based on the Vmax values. As previously suggested, it would be advisable to calculate kcat/Km and employ it for comparing catalytic efficiencies. Furthermore, based on the data provided by the authors, I conducted a rough calculation of these catalytic efficiencies and did not observe a significant difference, which contrasts with the authors' statement, "These findings indicated that the catalytic efficiency of CsAlaDC is considerably lower than that of both CsSerDC and AtSerDC." This discrepancy requires clarification.  

      We want to express our sincere appreciation for your meticulous review and constructive suggestions. We understand the importance of accurately comparing catalytic efficiencies using Kcat/Km values, rather than solely relying on Vmax values. Following your suggestion, we recalculated Kcat/Km to reanalyze our results. The computed Kcat/Km for CsSerDC and AtSerDC are 152.7 s-1 M-1 and 184.6 s-1 M-1, respectively. For CsAlaDC, the calculated Kcat/Km is 55.7 s-1 M-1. Therefore, the catalytic efficiency of CsSerDC and AtSerDC is approximately three times that of CsAlaDC.  What we intended to convey was that the Vmax of CsAlaDC is lower than that of CsSerDC and AtSerDC.  Our description in the manuscript was not accurate, and we have addressed this in the revised version.

      * Pages 9 & 10

      "This result suggested this Tyr is required for the catalytic activity of CsAlaDC and AtSerDC."

      : The author's results are interesting, but it is recommended to perform the experiments in a specific order. First, experiments should determine whether mutagenesis affects the protein's stability (e.g., CD, as discussed earlier), and second, whether mutagenesis affects ligand binding (e.g., ITC, SPR, etc.), before describing how site-directed mutagenesis alters enzyme activity. In particular, the authors' hypothesis would be much more convincing if they could show that the ligand binding affinity is similar between WT and mutants.

      Comments: While it is appreciated that you have included CD and UV-vis absorption spectra data, it would be more beneficial to provide quantitative data to address the previously proposed binding affinity. I also recommend presenting the data mentioned in the reviewer's comments as a supplementary figure for better clarity and reference.  

      Thank you for your valuable feedback and suggestions. I agree that providing quantitative data would lend more support to our findings and better address the proposed binding affinity.

      It is generally acknowledged that proteins complexed with PLP exhibit a yellow hue, and the ligand PLP forms a Schiff base structure with the ε-amino group of a lysine residue in the protein, with maximum absorbance around 420 nm. However, during our protein purification process, we observed that the purified protein retained its yellow coloration, even when PLP wasn't introduced into the purification buffer. Subsequent absorbance measurements revealed that the protein exhibited absorbance within the aforementioned wavelength (420 nm) (the experimental results are shown in the following figures), implying an inherent presence of the PLP ligand within the protein. This could have resulted from binding with PLP during the protein's expression in E. coli. Consequently, due to this inseparability between the protein and the ligand, obtaining quantitative data through experimental means becomes unfeasible.

      Author response image 2.

      (A) Absorption Spectra of CsAlaDC (WT) and CsAlaDC (Y336F). (B) Absorption Spectra of AtSerDC (WT) and AtSerDC (Y341F).

      Regarding your suggestion about presenting the data mentioned in the reviewer's comments as a supplementary figure, we agree that it is an excellent idea. We have prepared supplementary figure 7 and supplementary figure 8 accordingly, ensuring that they present the required data.

      * Page 10

      "The results showed that 5 mM L-DTT reduced the relative activity of CsAlaDC and AtSerDC to 22.0% and 35.2%, respectively"

      : The authors primarily use relative activity to compare WT and mutants. Can the authors specify the exact experiments, units, and experimental conditions? Is it Vmax or catalytic efficiency? If so, under what specific experimental conditions?

      Response: "However, due to the unknown mechanism of DTT inhibition on protein activity, we have removed this part of the content in the revised manuscript."

      Comment: I believe this requires a more comprehensive explanation rather than simply removing it from the text.  

      Although we have observed that DTT is capable of inhibiting enzyme activity, at present, we are unable to offer a comprehensive explanation for the inhibitory effect of DTT on enzyme activity in terms of its structural and catalytic mechanisms. Further research is required to elucidate the mechanism of action of DTT. It is worth noting, however, that our study does not emphasize investigating the specific inhibitory mechanisms of DTT on enzyme activity. Furthermore, the existing findings do not provide an adequate explanation for the observed phenomenon, leading us to exclude this particular aspect from the content.

      * Pages 10-12

      : The identification of 'Phe106 in CsAlaDC' and 'Tyr111 in AtSerDC,' along with the subsequent mutagenesis and enzymatic activity assays, is intriguing. However, the current manuscript lacks an explanation and discussion of the underlying reasons for these results. As previously mentioned, it would be helpful to gain insights and analysis from WT-ligand and mutant-ligand binding studies (e.g., ITC, SPR, etc.). Furthermore, the authors' analysis would be more convincing with accompanying structural analysis, such as steric hindrance analysis.

      Comment: While it is appreciated that you have included UV-vis absorption spectra data, it would be more beneficial to provide quantitative data to address the previously proposed binding affinity. I also recommend presenting the data mentioned in the reviewer's comments as a supplementary figure for better clarity and reference.  

      Response: Thank you for your valuable feedback and suggestions. Given that the protein forms a complex with PLP during its expression in E. coli and cannot be dissociated from it, obtaining quantitative data via experimental protocols is rendered impracticable.

      Author response image 3.

      (A) Absorption Spectra of CsAlaDC (WT) and CsAlaDC (F106Y). (B) Absorption Spectra of AtSerDC (WT) and AtSerDC (Y111F).

      Mutant proteins and wild-type proteins exhibited absorption bands at 420 nm, suggesting the formation of a Schiff base between PLP and the active-site lysine residue.

      Regarding your suggestion about presenting the data mentioned in the reviewer's comments as a supplementary figure, we have prepared supplementary figure 7 and supplementary figure 8 accordingly, ensuring that they present the required data.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      (1) It is a nice study but lacks some functional data required to determine how useful these alleles will be in practice, especially in comparison with the figure line that stimulated their creation.

      We are grateful for this comment. For the usefulness of these alleles, Figure 3 shows that specific and efficient genetic manipulation of one cell subpopulation can be achieved by mating across the DreER mouse strain to the rox-Cre mouse strain. In addition, Figure 7 shows that R26-loxCre-tdT can effectively ensure Cre-loxP recombination on some gene alleles and for genetic manipulation. The expression of the tdT protein is aligned with the expression of the Cre protein (Alb roxCre-tdT and R26-loxCre-tdT, Figure 2 and Figure 5), which ensures the accuracy of the tracing experiments. We believe more functional data can be shown in future articles that use mice lines mentioned in this manuscript.

      (2) The data in Figure 5 show strong activity at the Confetti locus, but the design of the newly reported R26-loxCre line lacks a WPRE sequence that was included in the iSure-Cre line to drive very robust protein expression.

      Thank you for bringing up this point in the manuscript. In the R26-loxCre-tdT mice knock-in strategy, the WPRE sequence is added behind the loxCre-P2A-tdT sequence, as shown in Supplementary Figure 9.

      (3) The most valuable experiment for such a new tool would be a head-to-head comparison with iSure (or the latest iSure version from the Benedito lab) using the same CreER and target foxed allele. At the very least a comparison of Cre protein expression between the two lines using identical CreER activators is needed.

      Thank you for your valuable and insightful comment. The comparison results of R26-loxCre-tdT with iSuRe-Cre using Alb-CreER and targeting R26-Confetti can be found in Figure 6, according to the reviewer’s suggestion.

      (4) Why did the authors not use the same driver to compare mCre 1, 4, 7, and 10? The study in Figure 2 uses Alb-roxCre for 1 and 7 and Cdh5-roxCre for 4 and 10, with clearly different levels of activity driven by the two alleles in vivo. Thus whether mCre1 is really better than mCre4 or 10 is not clear.

      Response: or two mCre versions that work efficiently. For example, if Alb-mCre1 was competitive with Cdh5-mCre10, we can use them for targeting genes in different cell types, broadening the potential utility of these mice.

      (5) Technical details are lacking. The authors provide little specific information regarding the precise way that the new alleles were generated, i.e. exactly what nucleotide sites were used and what the sequence of the introduced transgenes is. Such valuable information must be gleaned from schematic diagrams that are insufficient to fully explain the approach.

      Response: We appreciate your thoughtful suggestions. The schematic figures, along with the nucleotide sequences for the generation of mice, can be found in the revised Supplementary Figure 9.

      Reviewer #2 (Public Review):

      (1) The scenario where the lines would demonstrate their full potential compared to existing models has not been tested.

      Thank you for your thoughtful and constructive comment. The comparative analysis of R26-loxCre-tdT with iSuRe-Cre, employing Alb-CreER to target R26-Confetti, is provided in Figure 6.

      (2) The challenge lies in performing such experiments, as low doses of tamoxifen needed for inducing mosaic gene deletion may not be sufficient to efficiently recombine multiple alleles in individual cells while at the same time accurately reporting gene deletion. Therefore, a demonstration of the efficient deletion of multiple floxed alleles in a mosaic fashion would be a valuable addition.

      Thank you for your constructive comments. Mosaic analysis using sparse labeling and efficient gene deletion would be our future direction using roxCre and loxCre strategies.

      3) When combined with the confetti line, the reporter cassette will continue flipping, potentially leading to misleading lineage tracing results.

      Thank you for your professional comments. Indeed, the confetti used in this study can continue flipping, which would lead to potentially misleading lineage tracing results. Our use of R26-Confetti is to demonstrate the robustness of mCre for recombination. Some multiple-color mice lines that don’t flip have been published, for example, R26-Confetti2(PMID: 30778223) and Rainbow (PMID: 32794408). These reporters could be used for tracing Cre-expressing cells, without concerns of flipping of reporter cassettes.

      (4) Constitutive expression of Cre is also associated with toxicity, as discussed by the authors in the introduction.

      Thank you for your professional comments. The toxicity of constitutive expression of Cre and the toxicity associated with tamoxifen treatment in CreER mice line (PMID: 37692772) are known to the field. This study can’t solve the toxicity of the constitutive expression of Cre in this work. Many mouse lines with constitutive Cre driven by different promoters are present across various fields, representing similar toxicity. To solve this issue, it would be possible to construct a new strategy that enables the removal of Cre after its expression.

      Reviewer #3 (Public Review):

      (1) Although leakiness is rather minor according to the original publication and the senior author of the study wrote in a review a few years ago that there is no leakiness(https://doi.org/10.1016/j.jbc.2021.100509).

      Thank you so much for your careful check. In this review (PMID: 33676891), the writer’s comments on iSuRe-Cre are on the reader's side, and all summary words are based on the original published paper (PMID: 31118412). Currently, we have tested iSuRe-Cre in our hands. We did detect some leakiness in the heart and muscle, but hardly in other tissues as shown in Author response image 1.

      Author response image 1.

      Leakiness in Alb CreER;iSuRe-Cre mouse line. Pictures are representative results for 5 mice. Scale bars, white 100 µm.

      (2) I would have preferred to see a study, which uses the wonderful new tools to address a major biological question, rather than a primarily technical report, which describes the ongoing efforts to further improve Cre and Dre recombinase-mediated recombination.

      Response: We gratefully appreciate your valuable comment. The roxCre and loxCre mice mentioned in this study provide more effective methods for inducible genetic manipulation in studying gene function. We hope that the application of our new genetic tools could help address some major biological questions in different biomedical fields in the future.

      (3) Very high levels of Cre expression may cause toxic effects as previously reported for the hearts of Myh6-Cre mice. Thus, it seems sensible to test for unspecific toxic effects, which may be done by bulk RNA-seq analysis, cell viability, and cell proliferation assays. It should also be analyzed whether the combination of R26-roxCre-tdT with the Tnni3-Dre allele causes cardiac dysfunction, although such dysfunctions should be apparent from potential changes in gene expression.

      We are sorry that we mistakenly spelled R26-loxCre-tdT into R26-roxCre-tdT in our manuscript. We have not generated the R26-roxCre-tdT mouse line. We also thank the reviewer for concerns about the toxicity of high Cre expression. The toxicity of constitutive expression of Cre and the toxicity of tamoxifen treatment of CreER mice line (PMID: 37692772) are known to the field. This study can’t solve the toxicity of the constitutive expression of Cre in this work. Many mouse lines with constitutive Cre driven by different promoters are present across various fields, representing similar toxicity. To solve this issue, it would be possible to construct a new strategy that enables the removal of Cre after its expression.

      (4) Is there any leakiness when the inducible DreER allele is introduced but no tamoxifen treatment is applied? This should be documented. The same also applies to loxCre mice.

      In this study, we come up with new mice tool lines, including Alb roxCre1-tdT, Cdh5 roxCre4-tdT, Alb roxCre7-GFP, Cdh5 roxCre10-GFP and R26-loxCre-tdT. As the data shown in Supplementary Figure 1, Supplementary Figure 2, and Figure 4D, Alb roxCre1-tdT, Cdh5 roxCre4-tdT, Alb roxCre7-GFP, Cdh5 roxCre10-GFP and R26-loxCre-tdT are not leaky. Therefore, if there is any leakiness driven by the inducible DreER or CreER allele, the leakiness is derived from the DreER or CreER. Additional pertinent experimental data can be referenced in Figure S4C, Figure S7A-B, and Figure S8A.

      (5) It would be very helpful to include a dose-response curve for determining the minimum dosage required in Alb-CreER; R26-loxCre-tdT; Ctnnb1flox/flox mice for efficient recombination.

      Thank you for your suggestion. We value your feedback and have incorporated your suggestion to strengthen our study. Relevant experimental data can be referenced in Figure S8E-G.

      (6) In the liver panel of Figure 4F, tdT signals do not seem to colocalize with the VE-cad signals, which is odd. Is there any compelling explanation?

      The staining in Figure 4F in the revision is intended to deliver optimized and high-resolution images.

      (7) The authors claim that "virtually all tdT+ endothelial cells simultaneously expressed YFP/mCFP" (right panel of Figure 5D). Well, it seems that the abundance of tdT is much lower compared to YFP/mCFP. If the recombination of R26-Confetti was mainly triggered by R26-loxCre-tdT, the expression of tdT and YFP/mCFP should be comparable. This should be clarified.

      Thank you so much for your careful check. We checked these signals carefully and didn't find the “much lower” tdT signal. As the file-loading website has a file size limitation, the compressed image results in some signal unclear. We attached clear high-resolution images here. Author response image 2 shows how we split the tdT signal and compared it with YFP/mCFP.

      Author response image 2.

      (8) In several cases, the authors seem to have mixed up "R26-roxCre-tdT" with "R26-loxCre-tdT". There are errors in #251 and #256.Furthermore, in the passage from line #278 to #301. In the lines #297 and #300 it should probably read "Alb-CreER; R26-loxCretdT; Ctnnb1flox/flox" rather than "Alb-CreER;R26-tdT2;Ctnnb1flox/flox".

      We are grateful for these careful observations. We have corrected these typos accordingly.

      Recommendations for the authors:

      Reviewer #1:

      (1) However, for it to be useful to investigators a more direct comparison with the Benedito iSure line (or the latest version) is required as that is the crux of the study.

      Thank you for emphasizing this point, which we have now addressed in the revised manuscript and Figure 6.

      (2) I would like to know how the authors will make these new lines available to outside investigators.

      Please contact the lead author by email to consult about the availability of new mouse lines developed in this study.

      (3) The discussion is overly long and fails to address potential weaknesses. Much of it reiterates what was already said in the results section.

      We are thankful for your critical evaluation, which has helped us improve our discussion.

      Reviewer #2:

      (1) Assessing the efficiency and accuracy of the lines in mosaic deletions of multiple alleles and reporting them in single cells after low-dose tamoxifen exposure would be highly beneficial to demonstrate the full potential of the models.

      We appreciate your careful consideration of this issue. Our future endeavors will focus on mosaic analysis utilizing sparse labeling and efficient gene deletion, employing both roxCre and loxCre strategies.

      (2) Performing FACS analysis to confirm that all targeted (Cre reporter-positive) cells are also tdT-positive would provide more precise data and avoid vague statements like 'virtually all' or 'almost complete' in the results section:

      Line 166: Although mCre efficiently labeled virtually all targeted cells (Figure S3A-E)...

      Line 293: ... and not a single tdT+ hepatocyte 293 expressed Cyp2e1 (Figure 6D)... However, the authors do not provide any quantification. FACS would be ideal here.

      Line 244: ...expression of beta-catenin and GS almost disappeared in the 4W mutant sample... The resolution in the provided PDF is not adequate for assessment.

      Line 296: ... revealed almost complete deletion of Ctnnb1 in the Alb-CreER;R26-tdT2;Ctnnb1flox/flox mice...

      Thank you for suggesting these improvements, which have strengthened the robustness of our conclusions. In the revised version, we have incorporated FACS results that correspond to related sections. Additionally, a quantification statement has been included in the statistical analysis section. We appreciate your meticulous review and comments, which have significantly improved the clarity of our manuscript.

      (3) In the beginning of the results section, it is not clear which results are from this study and which are known background information (like Figure 1A). For example, it is not clear if Figure 1C presents data from R26-iSuRe-Cre. Please revise the text to more clearly present the experimental details and new findings.

      Thank you for this observation. Figure 1C belongs to this study, and the revised version has been modified to the related statement for improved clarity.

      (4) Experimental details regarding the genetic constructs and genotyping of the new knock-in lines are missing. Are R26 constructs driven by the endogenous R26 promoter or were additional enhancers used?

      Thank you for emphasizing this point. The schematic figures and nucleotide sequences for the generation of mice can be found in the revised Supplementary Figure 9, which can help to address this issue.

      (5) The method used to quantify mCre activity in terms of reporter+ target cells is not specified. From images or by FACS?

      Additionally, if images were used for quantification, it would be important to provide details on the number of images analyzed, the number of cells counted per image, and how individual cells were identified.

      Thank you for your comment. We have included the quantification statement in the statistical analysis section. Analyzing R26-Confetti+ target cells using FACS is challenging due to the limitations of the sorting instrument. Consequently, we quantified the related data by images. Each dot on the chart represents one sample, and the quantification for each mouse was conducted by averaging the data from five 10x fields taken from different sections.

      (6) Line 160: These data demonstrate that roxCre was functionally efficient yet non-leaky. Functional efficiency in vivo was not shown in the preceding experiments.

      Functional efficiency in vivo can be referred to in Figures S1-S2 and S4C.

      (7) It would be useful to provide a reference for easy vs low-efficiency recombination of different reporter alleles (lines 56-58).

      We are grateful for this comment, as it has allowed us to improve the clarity of our explanation. Consequently, we have made the necessary modifications.

      (8) Discussion on the potential drawbacks and limitations of the lines would be useful.

      We are thankful for your evaluation, which has significantly contributed to the enhancement of our discourse.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This paper investigates host and viral factors influencing transmission of alpha and delta SARS-CoV-2 variants in the Syrian hamster model and fundamentally increases knowledge regarding transmission of the virus via the aerosol route. The strength of evidence is solid and could be improved with a clearer presentation of the data.

      We thank the editors for their assessment. We are excited to present a revised version of the manuscript with improved data presentation and an improved discussion addressing the reviewer’s concerns.

      Public Reviews:

      Reviewer #1 (Public Review):

      In the submitted manuscript, Port et al. investigated the host and viral factors influencing the airborne transmission of SARS-CoV-2 Alpha and Delta variants of concern (VOC) using a Syrian hamster model. The authors analyzed the viral load profiles of the animal respiratory tracts and air samples from cages by quantifying gRNA, sgRNA, and infectious virus titers. They also assessed the breathing patterns, exhaled aerosol aerodynamic profile, and size distribution of airborne particles after SARS-CoV-2 Alpha and Delta infections. The data showed that male sex was associated with increased viral replication and virus shedding in the air. The relationship between co-infection with VOCs and the exposure pattern/timeframe was also tested. This study appears to be an expansion of a previous report (Port et al., 2022, Nature Microbiology). The experimental designs were rigorous, and the data were solid. These results will contribute to the understanding of the roles of host and virus factors in the airborne transmission of SARS-CoV-2 VOCs.

      Reviewer #2 (Public Review):

      This manuscript by Port and colleagues describes rigorous experiments that provide a wealth of virologic, respiratory physiology, and particle aerodynamic data pertaining to aerosol transmission of SARS-CoV-2 between infected Syrian hamsters. The data is particularly significant because infection is compared between alpha and delta variants, and because viral load is assessed via numerous assays (gRNA, sgRNA, TCID) and in tissues as well as the ambient environment of the cage. The paper will be of interest to a broad range of scientists including infectious diseases physicians, virologists, immunologists and potentially epidemiologists. The strength of evidence is relatively high but limited by unclear presentation in certain parts of the paper.

      Important conclusions are that infectious virus is only detectable in air samples during a narrow window of time relative to tissue samples, that airway constriction increases dynamically over time during infection limiting production of fine aerosol droplets, that variants do not appear to exclude one another during simultaneous exposures and that exposures to virus via the aerosol route lead to lower viral loads relative to direct inoculation suggesting an exposure dose response relationship.

      While the paper is valuable, I found certain elements of the data presentation to be unclear and overly complex.

      Reviewer #1 (Recommendations For The Authors):

      We thank the reviewer for their comments and their attention to detail. We have taken the following steps to address their suggestions and concerns.

      However, the following concerns need to be issued.

      1. Summary seems to be too simple, and some results are not clearly described in the summary.

      We have edited the summary and hope to have addressed the concerns raised by providing more information. We think that the summary includes all relevant findings.

      “It remains poorly understood how SARS-CoV-2 infection influences the physiological host factors important for aerosol transmission. We assessed breathing pattern, exhaled droplets, and infectious virus after infection with Alpha and Delta variants of concern (VOC) in the Syrian hamster. Both VOCs displayed a confined window of detectable airborne virus (24-48 h), shorter than compared to oropharyngeal swabs. The loss of airborne shedding was linked to airway constriction resulting in a decrease of fine aerosols (1-10µm) produced, which are suspected to be the major driver of airborne transmission. Male sex was associated with increased viral replication and virus shedding in the air. Next, we compared the transmission efficiency of both variants and found no significant differences. Transmission efficiency varied mostly among donors, 0-100% (including a superspreading event), and aerosol transmission over multiple chain links was representative of natural heterogeneity of exposure dose and downstream viral kinetics. Co-infection with VOCs only occurred when both viruses were shed by the same donor during an increased exposure timeframe (24-48 h). This highlights that assessment of host and virus factors resulting in a differential exhaled particle profile is critical for understanding airborne transmission.”

      1. Aerosol transmission experiment should be described in Materials and Methods although it is cited as Reference 21#;

      We have modified Line 433:

      “Aerosol caging

      Aerosol cages as described by Port et al. [2] were used for transmission experiments and air sampling as indicated. The aerosol transmission system consisted of plastic hamster boxes (Lab Products) connected by a plastic tube. The boxes were modified to accept a 7.62 cm (3') plastic sanitary fitting (McMaster-Carr), which enabled the length between the boxes to be changed. Airflow was generated with a vacuum pump (Vacuubrand) attached to the box housing the naïve animals and was controlled with a float-type meter/valve (McMaster-Carr).”

      And Line 458: “During the first 5 days, hamsters were housed in modified aerosol cages (only one hamster box) hooked up to an air pump.”.

      Especially, one superspreading event of Alpha VOC (donor animal) was observed in iteration A (Figure 4). What causes that event, experiment system?

      Based on the observed variation in airborne shedding (of the cages from which this was directly measured), we believe that one plausible explanation for the super-spreading event was that the Alpha-infected donor shed considerably more virus during the exposure than other donors, and thus more readily infected the sentinels. That said, it is also conceivable that other factors such as hamster behavior (e.g., closeness to the cage outlet, sleeping) or variable sentinel susceptibility could affect the distribution of transmissions.

      1. Same reference is repeatedly listed as Refs 2 and 21#.

      Addressed. We thank the reviewer for their attention to detail. We have also removed reference 53, which was the same as 54.

      1. Two forms of described time (hour and h) are used in the manuscript. Single form should be chosen.

      This has been addressed.

      5) Virus designation located in line 371 and line 583 is inconsistent, and it needs to be revised.

      For consistency we have chosen this nomenclature for the viruses used: SARS-CoV-2 variant Alpha (B.1.1.7) (hCoV320 19/England/204820464/2020, EPI_ISL_683466) and variant Delta (B.1.617.2/) (hCoV-19/USA/KY-CDC-2-4242084/2021, EPI_ISL_1823618).

      1. In Figure 5F, what time were lung and nasal turbinate tissues collected after virus infection?

      This has been added to the legend. Day 5. Line 904.

      1. Line 562-563, what is the coating antigen (spike protein, generated in-house)? purified or recombinant protein?

      It is in-house purified recombinant protein. This has been added to the methods.

      1. Line 575 and line 578: 10,000x is not standard description, and it should be revised.

      Done.

      Reviewer #2 (Recommendations For The Authors):

      We thank the reviewer for their comments and suggestions to improve the manuscript, and hope we have addressed all concerns adequately.

      • Direct interpretation of the linear regression slope in Figure 3 is challenging. Is the most relevant parameter for transmission known? Intuitively, it would be the absolute number of small droplets at a given timepoint rather than the slope and it would be easier to interpret if the data were reported in this fashion.

      We decided to show a percentage of counts to normalize the data among animals, as we observed large inter-individual variation in counts. The reviewer is correct that it is most likely the number of particles that would be most relevant to transmission, though much (including the role of particle size) remains to be determined. We have added a sentence to the results which explains this in L157.

      Therefore, we decided in this first analysis to utilize the slope measurement and not raw counts. The focus was on the slopes and how particle profiles were changing post inoculation. Because we have focused on percentages, it seems not appropriate to present particle counts within each diameter range because the analysis, model, and results are based on these percentages of particles.

      Use of regression to compute slope is a useful measure because it uses data from all timepoints to estimate the regression line and, therefore, the % of particles on each day. We decided on these methods because efficiency is especially important in a study with a relatively small number of animals and slopes are also a good surrogate for how animal particle profiles are changing post-inoculation.

      To assist with the interpretation: 1) We removed Figure 3C and D and replaced Figure 3B with individual line plots for all conditions to visualize the slopes. The figure legend was corrected to reflect these changes.

      2) We replaced L169 onwards to read: (Figure 3B). Females had a steeper decline at an average rate of 2.2 per day after inoculation in the percent of 1-10 μm particles (and a steeper incline for <0.53 μm) when compared to males, while holding variant group constant. When we compared variant group while holding sex constant, we found that the Delta group had a steeper decline at an average rate of 5.6 per day in the percent of 1-10 μm particles (and a steeper incline for <0.53 μm); a similar trend, but not as steep, was observed for the Alpha group.

      The estimated difference in slopes for Delta vs. controls and Alpha vs. controls in the percent of <0.53 μm particles was 5.4 (two-sided adjusted p= 0.0001) and 2.4 (two-sided adjusted p = 0.0874), respectively. The estimated difference in slopes for percent of 1-10 μm particles was not as pronounced, but similar trends were observed for Delta and Alpha. Additionally, a linear mixed model was considered and produced virtually the same results as the simpler analysis described above; the corresponding linear mixed model estimates were the same and standard errors were similar.

      • Fig 4: what is "limit of quality" mentioned in the legend? Are these samples undetectable?

      We have clarified this in the legend: “3.3 = limit of detection for RNA (<10 copies/rxn)”. If samples have below 10 copy numbers per reaction, they are determined to be below the limit of detection. The limit of detection is 10 copy number/rxn. All samples below 10 copies/rxn are taken to be negative and set = 10 copies/rxn, which equals 3.3. Log10 copies/mL oral swab.

      • Fig 4C would be easier to process in graphical rather than tabular form. The meaning of the colors is unclear.

      We agree with the reviewer that this is difficult to interpret, but we are uncertain if the same data in a tabular format would be easier to digest. We realized that the legend was misplaced and have added this back into the figure, which we hope clarifies the colors and the limit of detection.

      • Figure 4D & E are uninterpretable. What do the pie charts represent?

      We have remodeled this part of the figure to a schematic representation of the majority variant which transmitted for each individual sentinel, and have added a table (Table S1) which summarizes the exact sequencing results for the oral swabs. The reviewer is correct that it was difficult to interpret the pie charts, considering most values are either 0 or close to 100%. We hope this addresses the question. The legend states:

      Author response image 1.

      Airborne attack rate of Alpha and Delta SARS-CoV-2 variants. Donor animals (N = 7) were inoculated with either the Alpha or Delta variant with 103 TCID50 via the intranasal route and paired together randomly (1:1 ratio) in 7 attack rate scenarios (A-G). To each pair of donors, one day after inoculation, 4-5 sentinels were exposed for a duration of 4 h (i.e., h 24-28 post inoculation) in an aerosol transmission set-up at 200 cm distance. A. Schematic figure of the transmission set-up. B. Day 1 sgRNA detected in oral swabs taken from each donor after exposure ended. Individuals are depicted. Wilcoxon test, N = 7. Grey = Alpha, teal = Delta inoculated donors. C. Respiratory shedding measured by viral load in oropharyngeal swabs; measured by sgRNA on day 2, 3, and 5 for each sentinel. Animals are grouped by scenario. Colors refer to legend below. 3.3 = limit of detection of RNA (<10 copies/rxn). D. Schematic representation of majority variant for each sentinel as assessed by percentage of Alpha and Delta detected in oropharyngeal swabs taken at day 2 and day 5 post exposure by deep sequencing. Grey = Alpha, teal = Delta, white = no transmission.

      • Fig S2G is uninterpretable. Please label and explain.

      We have now included an explanations of the figure S2F. The figure is a graphic representation of the neutralization data depicted in Figure S2F. The spacing between grid lines is 1 unit of antigenic distance, corresponding to a twofold dilution of serum in the neutralization assay. The resulting antigenic distance depicted between Alpha and Delta is roughly a 4-fold difference in neutralization between homologous (e.g., Alpha sera with the Alpha virus vs. heterologous, Alpha sera with the Delta virus).

      • I would consider emphasizing lines 220-225 in the summary and abstract. The important implication is that aerosol transmission is more representative of natural heterogeneity of exposure dose and downstream viral kinetics. This is an often-overlooked point.

      We agree with the reviewer and have added this in Line 43.

      • Fig 5: A cartoon similar to Fig 4A showing timing of sentinel exposure with number of animals would be helpful.

      We have added this as a new panel A for Figure 5. See the redrafted Figure 5 below.

      • For Fig 5E & F It would be helpful to use a statistical test to more formally assess whether proportion at exposure predicts proportion of variants in downstream sentinel infection.

      This has been added as a new Figure 5 panel H and I, which we hope addresses the reviewer’s comment.

      Author response image 2.

      Airborne competitiveness of Alpha and Delta SARS-CoV-2 variants. A. Schematic. Donor animals (N = 8) were inoculated with Alpha and Delta variant with 5 x 102 TCID50, respectively, via the intranasal route (1:1 ratio), and three groups of sentinels (Sentinels 1, 2, and 3) were exposed subsequently at a 16.5 cm distance. Animals were exposed at a 1:1 ratio; exposure occurred on day 1 (Donors  Sentinels 1) and day 2 (Sentinels  Sentinels). B. Respiratory shedding measured by viral load in oropharyngeal swabs; measured by gRNA, sgRNA, and infectious titers on days 2 and day 5 post exposure. Bar-chart depicting median, 96% CI and individuals, N = 8, ordinary two-way ANOVA followed by Šídák's multiple comparisons test. C/D/E. Corresponding gRNA, sgRNA, and infectious virus in lungs and nasal turbinates sampled five days post exposure. Bar-chart depicting median, 96% CI and individuals, N = 8, ordinary two-way ANOVA, followed by Šídák's multiple comparisons test. Dark orange = Donors, light orange = Sentinels 1, grey = Sentinels 2, dark grey = Sentinels 3, p-values indicated where significant. Dotted line = limit of quality. F. Percentage of Alpha and Delta detected in oropharyngeal swabs taken at days 2 and day 5 post exposure for each individual donor and sentinel, determined by deep sequencing. Pie-charts depict individual animals. Grey = Alpha, teal = Delta. G. Lung and nasal turbinate samples collected on day 5 post inoculation/exposure. H. Summary of data of variant composition, violin plots depicting median and quantiles for each chain link (left) and for each set of samples collected (right). Shading indicates majority of variant (grey = Alpha, teal = Delta). I. Correlation plot depicting Spearman r for each chain link (right, day 2 swab) and for each set of samples collected across all animals (left). Colors refer to legend on right. Abbreviations: TCID, Tissue Culture Infectious Dose.”

      We have additionally added to the results section: L284: “Combined a trend, while not significant, was observed for increased replication of Delta after the first transmission event, but not after the second, and in the oropharyngeal cavity (swabs) as opposed to lungs (Figure 5H) (Donors compared to Sentinels 1: p = 0.0559; Donors compared to Sentinels 2: p = >0.9999; Kruskal Wallis test, followed by Dunn’s test). Swabs taken at 2 DPI/DPE did significantly predict variant patterns in swabs on 5 DPI/DPE (Spearman’s r = 0.623, p = 0.00436) and virus competition in the lower respiratory tract (Spearman’s r = 0.60, p = 0.00848). Oral swab samples taken on day 5 strongly correlate with both upper (Spearman’s r = 0.816, p = 0.00001) and lower respiratory tract tissue samples (Spearman’s r = 0.832, p = 0.00002) taken on the same day (Figure 5I).”

      • Fig 1A: how are pfu/hour inferred? This is somewhat explained in the supplement, but I found the inclusion of model output as the first panel confusing and am still not 100% clear how this was done. Consider, explaining this in the body of the paper.

      We have added a more detailed explanation of the PFU/h inference to the main text: The motivation for the model was to link more readily measurable quantities such as RNA measured in oral swabs to the quantity of greatest interest for transmission (infectious virus per unit time in the air). To do this, we jointly infer the kinetics of shed airborne virus and parameters relating observable quantities (infected sentinels, plaques from purified air sample filters) to the actual longitudinal shedding. The inferential model uses mechanistic descriptions of deposition of infectious virus into the air, uptake from the air, and loss of infectious virus in the environment to extract estimates of the key kinetic parameters, as well as the resultant airborne shedding, for each animal.

      We have added this information to L106 in the results and hope this clarifies the rationale and execution of the model.

      More minor points:

      • Line 292: "poor proxy" seems too strong as peak levels of viral RNA correlate with positive airway cultures. It might be more accurate to say that high levels of viral RNA during early infection only somewhat correlate with positive airway cultures.

      We have rephrased this to clarify that while peak RNA viral loads are predictive of positive cultures, measuring RNA, especially early during infection and only once, may not be sufficient to infer the magnitude or time-dependence of infectious virus shedding into the air. See Line 308: “We found that swab viral load measurements are a valuable but imperfect proxy for the magnitude and timing of airborne shedding. Crucially, there is a period early in infection (around 24 h post-infection in inoculated hamsters) when oral swabs show high infectious virus titers, but air samples show low or undetectable levels of virus. Viral shedding should not be treated as a single quantity that rises and falls synchronously throughout the host; spatial models of infection may be required to identify the best correlates of airborne infectiousness [32]. Attempts to quantify an individual’s airborne infectiousness from swab measurements should thus be interpreted with caution, and these spatiotemporal factors should be considered carefully.”

      • Line 352: Re is dependent on time of an outbreak (population immunity) and cannot be specified for a given variant as it depends on multiple other variables

      We agree that the current phrasing here could be interpreted to suggest, incorrectly, that Re is an intrinsic property of a variant. We have deleted that language and reworded the section to emphasize that the critical question is heterogeneity in transmission, not mean reproduction number. Line 348: “Moreover, at the time of emergence of Delta, a large part of the human population was either previously exposed to and/or vaccinated against SARS-CoV-2; that underlying host immune landscape also affects the relative fitness of variants. Our naïve animal model does not capture the high prevalence of pre-existing immunity present in the human population and may therefore be less relevant for studying overall variant fitness in the current epidemiological context. Analyses of the cross-neutralization between Alpha and Delta suggest subtly different antigenic profiles [35], and Delta’s faster kinetics in humans may have also helped it cause more reinfections and “breakthrough” infections [36].

      Our two transmission experiments yielded different outcomes. When sentinel hamsters were sequentially exposed, first to Alpha and then to Delta, generally no dual infections—both variants detectable—were observed. In contrast, when we exposed hamsters simultaneously to one donor infected with Alpha and another infected with Delta, we were able to detect mixed-variant virus populations in sentinels in one of the cages (Cage F, see Appendix figures S1, S2). The fact that we saw both single-lineage and multi-lineage transmission events suggests that virus population bottlenecks at the point of transmission do indeed depend on exposure mode and duration, as well as donor host shedding. Notably, our analysis suggests that the Alpha-Delta co-infections observed in the Cage F sentinels could be due to that being the one cage in which both the Alpha and the Delta donor shed substantially over the course of the exposure (Appendix figures S2, S3). Mixed variant infections were not retained equally, and the relative variant frequencies differed between investigated compartments of the respiratory tract, suggesting roles for randomness or host-and-tissue specific differences in virus fitness.

      A combination of host, environmental and virus parameters, many of which vary through time, play a role in virus transmission. These include virus phenotype, shedding in air, individual variability and sex differences, changes in breathing patterns, and droplet size distributions. Alongside recognized social and environmental factors, these host and viral parameters might help explain why the epidemiology of SARS-CoV-2 exhibits classic features of over-dispersed transmission [37]. Namely, SARS-CoV-2 circulates continuously in the human population, but many transmission chains are self-limiting, while rarer superspreading events account for a substantial fraction of the virus’s total transmission. Heterogeneity in the respiratory viral loads is high and some infected humans release tens to thousands of SARS-CoV-2 virions/min [38, 39]. Our findings recapitulate this in an animal model and provide further insights into mechanisms underlying successful transmission events. Quantitative assessment of virus and host parameters responsible for the size, duration and infectivity of exhaled aerosols may be critical to advance our understanding of factors governing the efficiency and heterogeneity of transmission for SARS-CoV-2, and potentially other respiratory viruses. In turn, these insights may lay the foundation for interventions targeting individuals and settings with high risk of superspreading, to achieve efficient control of virus transmission [40].”

      • The limitation section should mention that this animal model does not capture the large prevalence of pre-existing immunity at present in the population and may therefore be less relevant in the current epidemiologic context.

      We agree and have added this more clearly, see response above.

      • Limitation: it is unclear if airway and droplet dynamics in the hamster model are representative of humans.

      We have added the following sentence: Line 331: “It remains to be determined how well airway and particle size distribution dynamics in Syrian hamsters model those in humans.”

      • The mathematical model is termed semi-mechanistic but I think this is not accurate as the model appears to have no mechanistic assumptions.

      We describe the model as semi-mechanistic because it uses mechanistic descriptions of the shedding and uptake process (as described above), incorporating factors including respiration rate and environmental loss, and makes the mechanistic assumption that measurable swab and airborne shedding all stem from a shared within-host infection process that produces exponential growth of virus up to a peak, followed by exponential decay. The model is only semi-mechanistic, however, as we do not attempt a full model of within-host viral replication and shedding (e.g. a target-cell limited virus kinetics model).

    1. Author Response

      The following is the authors’ response to the previous reviews.

      We thank the reviewers for their reading of the manuscript, and their suggestions. We have extensively addressed all these concerns in the text, and also included several new data and figures in the revised version of the manuscript. We hope that our response and the new experimental data fully address the concerns raised by the reviewers. We include a detailed, pointby-point response to each of the reviewer concerns, pointing to new data and specific changes made in the main manuscript.

      Note: Do note that these new data have resulted in a new figure-figure 6, a new supplementary figure -figure 2-figure supplement 2, and an increase in the number of panels in each figure, as well as supplementary figures.

      General response comments, highlighting a few aspects missed by the reviewers

      This manuscript has an enormous amount of data in it. This is understandable, since in part we are proposing an entirely new hypothesis, and way to think about mitochondrial repression, built around substantial circumstantial evidences from diverse literature sources. But to keep the narrative readable and the main idea understandable, a lot of information had to be only very briefly mentioned in the text, and is therefore included as supplemental information. Due to that, it may not always be apparent that this study has set several technical benchmarks. These experiments are extremely challenging to perform, took many iterations to standardize, and in themselves are a first in the field. Yeast cells have the highest known rate of glycolytic flux for any organism. Measuring this glycolytic rate using the formation of intermediates is hard, and all current estimates have been in vitro, and using a stop-flow type set up. In this study, we optimized and directly measured the glycolytic flux using isotope labelled glucose (13C-glucose), which has never been reported before in highly glycolytic cells such as yeast. This is due to the very rapid label saturation (within seconds) after 13C glucose pulse (as is now shown in the figure 2-figure supplement 1). For brevity, this is summarized in this study with sufficient information to reproduce the method, but we will put out a more detailed, associated methodology paper describing several challenges, infrastructure requirements, and resources to be able to carry out these types of experiments using yeast. An added highlight of these experiments with WT and Ubp3 deletion strains is the most direct till date experimental demonstration that glycolytic flux in yeast in high glucose follows zero-order kinetics, and depends entirely on the amounts of the glycolytic enzymes (presumably operating at maximal activity). This nicely complements the recent study by Grigatis 2022 (cited in the discussion), that suggests this possibility.

      Separately, this study required the estimation of total inorganic phosphates, as well as mitochondrial pools of phosphates. Till date, there are no studies that have estimated mitochondrial pools of phosphate (for a variety of reasons). In this study, we also experimentally determined the changes in mitochondrial phosphate pools. For this, we had to establish and standardize a rapid mitochondrial isolation method in yeast. Thus, this study provides the first quantitative estimates of mitochondrial Pi amounts (in the context of measured mitochondrial outputs), as shown now in Figure 4. This component on mitochondrial isolation in yeast to assess metabolites may also be explored in future as a methods paper.

      Specific responses to the Reviews:

      Reviewer #1 (Public Review):

      The study by Vengayil et al. presented a role for Ubp3 for mediating inorganic phosphate (Pi) compartmentalization in cytosol and mitochondria, which regulates metabolic flux between cytosolic glycolysis and mitochondrial processes. Although the exact function of increased Pi in mitochondria is not investigated, findings have valuable implications for understanding the metabolic interplay between glycolysis and respiration under glucose-rich conditions. They showed that UBP3 KO cells regulated decreased glycolytic flux by reducing the key Pidependent-glycolytic enzyme abundances, consequently increasing Pi compartmentalization to mitochondria. Increased mitochondria Pi increases oxygen consumption and mitochondrial membrane potential, indicative of increased oxidative phosphorylation. In conclusion, the authors reported that the Pi utilization by cytosolic glycolytic enzymes is a key process for mitochondrial repression under glucose conditions.

      (1) However, the main claims are only partially supported by the low number of repeats and utilizing only one strain background, which decreased the overall rigor of the study. The fullpower yeast model could be utilized with testing findings in different backgrounds with increased biological repeats in many assays described in this study. In the yeast model, it has been well established that many phenotypes are genotype/strain dependent (Liti 2019, Gallone 2016, Boekout 2021, etc...). with some strains utilizing mitochondrial respiration even under high glucose conditions (Kaya 2021). It would be conclusive to test whether wild strains with increased respiration under high glucose conditions would also be characterized by increased mitochondrial Pi.

      “However, the main claims are only partially supported by the low number of repeats and utilizing only one strain background, which decreased the overall rigor of the study. The full-power yeast model could be utilized with testing findings in different backgrounds with increased biological repeats in many assays described in this study.”

      Thank you for the suggestion. We agree that a larger, universal statement cannot be made with data from a single strain, since yeasts do have substantial diversity. In this study, we had originally used a robust, prototrophic industrial strain (CEN.PK background). We have now utilized multiple, diverse strains of S. cerevisiae to test our findings. This includes strains from the common laboratory backgrounds – W303 and BY4742 – which have different auxotrophies, as well as another robust, highly flocculent strain from a prototrophic Σ1278 background. Using all these strains, we now comprehensively find that the role of altered Pi budgeting as a constraint for mitochondrial respiration, and the role of Ubp3 as a regulator of mitochondrial repression is very well conserved. In all tested strains of S. cerevisiae the loss of Ubp3 increases mitochondrial activity (as shown by increased mitochondrial membrane potential and increased Cox2 levels in Figure 6A, B). These data now expand the generality of our findings, and strengthen the manuscript. These results are included in the revised manuscript as a new figure- Figure 6 and the associated text.

      Some of the included data in the revised manuscript are shown below:

      Author response image 1.

      Mitochondrial activity and Cox2 levels in ubp3Δ in different genetic backgrounds

      We also used the W303 strain to assess Pi levels, and its role in increasing mitochondrial respiration. We find that the loss of Ubp3 in this genetic background also increases Pi levels and that the increased Pi is necessary for increasing mitochondrial respiration (Figure 6C, D).

      Author response image 2.

      Basal OCR in WT vs ubp3Δ (W303 strain background) in normal vs low Pi

      These experiments collectively have strengthened our findings on the critical role of intracellular Pi budgeting as a general constraint for mitochondrial respiration in high glucose.

      “It would be conclusive to test whether wild strains with increased respiration under high glucose conditions would also be characterized by increased mitochondrial Pi.”

      Addressed partially above. Right now the relative basal respiration in glucose across different strains is not well known. We measured mitotracker activity in high glucose in multiple WT strains of S. cerevisiae (W303, Σ1278, S288C and BY4742, compared to the CEN.PK strain). These strains all largely had similar mitotracker potential, except for a slight increase in mitochondrial membrane potential in Σ1278 strain, but not in other strains. We further characterized this using Cox2 protein levels as well as basal OCR, and found that these do not increase. These data is shown below, and is not included in the main text since it does not add any new component to the study.

      Author response image 3.

      Mitochondrial respiration in different WT strains

      We did find this suggestion very interesting though, and are exploring directions for future research based on this suggestion. Since we have now identified a role for intracellular Pi allocation in regulating the Crabtree effect, an interesting direction can be to understand the glucose dependent mitochondrial Pi transport in Crabtree negative yeast strains. We will have to bring in a range of new tools and strains for this, so these experiments are beyond the focus of this current study.

      We hope that these new experiments in different genetic backgrounds increases the breadth and generality of our findings, and stimulates new lines of thinking to address how important the role of Pi budgeting as a constraint for mitochondrial repression in high glucose might be.

      (2) It is not described whether the drop in glycolytic flux also affects TCA cycle flux. Are there any changes in the pyruvate level? If the TCA cycle is also impaired, what drives increased mitochondrial respiration?

      Thank you for pointing this out, and we agree this should be included. We have addressed these concerns in the revised version of the manuscript

      Since glucose derived pyruvate must enter the mitochondrial TCA cycle, one possibility is that a decrease in glycolytic rate could decrease the TCA flux. An alternate possibility is that the cells coincidently increase the pyruvate transport to mitochondria, to thereby maintain the TCA cycle flux comparable to that of WT cells. To test both these possibilities, we first measured the steady state levels of pyruvate and TCA cycle intermediates in WT vs ubp3Δ cells. We do not observe any significant change in the levels of pyruvate, or TCA cycle intermediates (except malate, which showed a significant decrease in ubp3Δ cells). This data is now included in the revised manuscript as Figure 2 – figure supplement 1D and figure supplement 2 A, along with associated text.

      Author response image 4.

      Pyruvate levels in WT vs ubp3Δ

      Author response image 5.

      Steady state TCA cycle intermediate levels

      Next, in order to address if the TCA cycle flux is impaired in ubp3Δ cells, we also measured the TCA cycle flux in WT vs ubp3Δ cells by pulsing the cells with 13C glucose and tracking 13C label incorporation from glucose into TCA cycle intermediates. This experiment first required substantial standardization, for the time of cell collection and quenching post 13C glucose addition, by measuring the kinetics of 13C incorporation into TCA cycle intermediates at different time points after 13C glucose addition. The standardization of this method is now included in the revised manuscript as Figure 2 – figure supplement 2 C, along with associated text, and is shown below for reference.

      Author response image 6.

      Kinetics of 13C labelling in TCA cycle intermediates

      Actual TCA cycle flux results: For measuring the TCA cycle flux, cells were treated with 1% 13C glucose, quenched and samples were collected at 7 mins post glucose addition which is in the linear range of 13C label incorporation (Figure 2- Figure 2 – figure supplement 2 C).

      Result: We did not observe any significant changes in the relative 13C label incorporation in TCA cycle intermediates. This data is included in the revised manuscript as Figure 2 – figure supplement 2 D, along with associated text, and is below for your reference.

      Author response image 7.

      TCA cycle flux

      What these data show is that the TCA cycle flux itself is not altered in ubp3Δ. A likely interpretation of this data is that this is due to the increase in the pyruvate transport to mitochondria in ubp3Δ cells, as indicated by the ~10-fold increase in Mpc3 (mitochondrial pyruvate transporter) protein levels (shown in Figure 5-figure supplement 5H), allowing the net same amount of pyruvate into the mitochondria. This increased mitochondrial pyruvate transport could support maintaining the TCA flux in ubp3Δ cells, and supporting the increased respiration. Putting a hierarchy together, the increased respiration in ubp3Δ cells could therefore be primarily due to increased Pi transport, followed by a consequent increase in ETC proteins. We leave it to the readers of this study to make this conclusion.

      We hope that we have addressed all concerns that the reviewer has with respect to TCA cycle flux in ubp3Δ cells.

      (3) In addition, some of the important literature was also missed in citation and discussion. For example, in a recent study (Ouyang et al., 2022), it was reported that phosphate starvation increases mitochondrial membrane potential independent of respiration in yeast and mammalian cells, and some of the conflicting results were presented in this study.

      We are very aware of the recent study by Ouyang et al, which reports that Pi starvation increases mitochondrial membrane potential independent of respiration. However, this study is distinct from the context of our case due to the reasons listed below.

      (a) The reviewer may have misinterpreted our low Pi condition as Pi starvation. There is no Pi ‘starvation’ in this study. Here, we cultured ubp3Δ and tdh2Δtdh3Δ cells in a low Pi medium with 1 mM Pi concentration in order to bring down the intracellular free Pi to that of WT levels. These cells are therefore not Pi-starved, but have been manipulated to have the same intracellular Pi levels as that of WT cells, as shown in Figure 4-figure supplement 1D. The Pi concentration in the medium is still in the millimolar range, and the cells are grown in this medium for a short time (~4 hrs) till they reach OD600 ~ 0.8. This is entirely different from the conditions used in Ouyang et al., 2022, where the cells were grown in a Pi-starvation condition with 1-100 micromolar Pi in the medium for a time duration of 6-8 hrs. Since cells respond differentially to changes in Pi concentrations over time (Vardi et al., 2014), the response to low Pi vs Pi starvation will be completely different.

      (b) In our study, mitochondrial membrane potential is used as only one of the readouts for mitochondrial activity. Our estimations of mitochondrial respiration are established by including other measurements such as Cox2 protein levels (as an indicator of the ETC) and basal OCR measurements (measuring respiration), all of which provide distinct information. The mitochondrial membrane potential can be regulated independent of mitochondrial respiration state (Liu et al., 2021), using membrane potential alone as a readout to estimate mitochondrial respiration can therefore be limiting in the information it provides. As indicated earlier, mitochondrial membrane potential can change, independent of mitochondrial respiration (Ouyang et al., 2022) and ATP synthesis (Liu et al., 2021). Since the focus of our study is mitochondrial respiration, and not just the change in membrane potential, making conclusions based on potential alone are ambiguous. Most studies in the field have in fact not used the comprehensive array of distinct estimates that we use in this study, and we believe the standards set in this study should become a norm for the field.

      (c) The only mutant that is similar to the Ouyang et al study is the Mir1 deletion mutant, which results in acute Pi starvation in mitochondria. In this strain, we find an increase in mitochondrial membrane potential. The data is not included in the manuscript but is shown below.

      Author response image 8.

      Mitochondrial potential in WT vs mir1Δ

      As clear from this data, mitochondrial membrane potential is significantly high in mir1Δ cells. However, the basal OCR and Cox2 protein levels clearly show decreased mitochondrial respiration which is expected in this mutant (Figure 5 A,B). This in fact highlights the limitations of solely relying on mitochondrial membrane potential measurements to draw conclusions, as doing so will lead to a misinterpretation of the actual mitochondrial activity in these cells. We do not wish to highlight limitations in other studies, but hope we make our point clear.

      (4) An additional experiment with strains lacking mitochondrial DNA under phosphate-rich and restricted conditions would further strengthen the result.

      Strains lacking mitochondrial DNA (Rho0 cells) cannot express the mitochondrially encoded ETC subunit proteins. These strains are therefore incapable of performing mitochondrial respiration. Since Rho0 cells are known to utilize alternate mechanisms to maintain their mitochondrial membrane potential (Liu et al., 2021), using mitotracker fluorescence as a readout of mitochondrial respiration in these strains under different Pi conditions is inconclusive and misleading due to the reasons mentioned in point number 3(b and c). However, since this was a concern raised by the reviewer, we now measured basal OCR in WT and Rho0 strains with Ubp3 deletion under normal vs low Pi medium. As expected, Rho0 cells show extremely low basal OCR values, an entire order of magnitude lower than WT cells. At these very low (barely detectable) levels the deletion of Ubp3 or change in Pi concentration in the medium does not change basal OCR, since these strains are not capable of respiration. We have included this data as Figure 4-figure supplement 1G.

      Author response image 9.

      Basal OCR in Rho0 cells

      (5) Western blot control panels should include entire membrane exposure, and non-cut western blots should be submitted as supplementary.

      The non-cut western blot images and the loading controls are now included in the revised manuscript as a supplementary file 2.

      (6) In Figure 4, it is shown that Pi addition decreases basal OCR to the WT level. However, the Cox2 level remains significantly higher. This data is confusing as to whether mitochondrial Pi directly regulates respiration or not.

      As described in the previous point, the Cox2 levels and the OCR provide distinct pieces of information. In figure 4, we show that culturing ubp3Δ in low Pi significantly decreases both Cox2 protein levels and basal OCR. Since Cox2 protein levels and basal OCR are different readouts for mitochondrial activity, there could be differences in the extent by which Pi availability controls each of these factors. Basal OCR is a direct readout for mitochondrial respiration, and is regulated by multiple factors including ETC protein levels, rate of ATP synthesis, rate of Pi transport etc. In figure 4, we find that culturing ubp3Δ in low Pi decreases basal OCR to WT level. This strongly suggests that high Pi levels are necessary to increase basal OCR in ubp3Δ.

      (7) Representative images of Ubx3 KO and wild-type strains stained with CMXRos are missing.

      Thank you for noticing this. This data is now included in the revised manuscript as Figure 1figure supplement 1C.

      Author response image 10.

      (8) Overall, mitochondrial copy number and mtDNA copy number should be analyzed in WT and Ubo3 KO cells as well as Pi-treated and non-treated cells, and basal OCR data should be normalized accordingly. The reported normalization against OD is not appropriate.

      This is a valid concern raised by the reviewer, and something we had extensively considered during the study. To normalize the total mitochondrial amounts in each strain, we always measure the protein levels of the mitochondrial outer membrane protein Tom70. While we had described this in the methods, it may not have been obvious in the text. But this information is included in Figure 1-figure supplement 1G. We did not observe any significant change in Tom70 levels, suggesting that the total mitochondrial amount does not change in ubp3Δ, and we have noted this in the manuscript (results section relevant to Figure 1). As an additional control, to directly measure the mitochondrial amount in these conditions, we have now measured the mitochondrial volume in ubp3Δ cells and WT cells treated with Pi. For this, we used a strain which encodes mitochondria targeted with mNeon green protein (described in Dua et al., JCB, 2023), and which can therefore independently assess total mitochondrial amount. We do not observe any changes in mitochondrial volume or amounts in ubp3Δ cells or WT+Pi, compared to that of WT cells. Therefore, the change in mitochondrial respiration in Ubp3 deletion and Pi addition are not due to changes in total amounts of mitochondria in these conditions. Given all these, the normalization of basal OCR using total cell number is therefore the most appropriate way for normalization. This is also conventionally used for basal OCR normalization in multiple studies.

      We have now included these additional data on mitochondrial volumes and amounts in the revised manuscript as Figure1-figure supplement 1F and Figure5-figure supplement 1D, and associated text, and is shown below.

      Author response image 11.

      Mitochondrial volume in WT vs ubp3Δ cells

      Author response image 12.

      Mitochondrial volume in WT and WT+Pi

      These data collectively address the reviewer’s concerns regarding changes in mitochondrial amounts in all the conditions and strains used in this study.

      Reviewer #2 (Public Review):

      Summary:

      Cells cultured in high glucose tend to repress mitochondrial biogenesis and activity, a prevailing phenotype type called Crabree effect that is observed in different cell types and cancer. Many signaling pathways have been put forward to explain this effect. Vengayil et al proposed a new mechanism involved in Ubp3/Ubp10 and phosphate that controls the glucose repression of mitochondria. The central hypothesis is that ∆ubp3 shifts the glycolysis to trehalose synthesis, therefore leading to the increase of Pi availability in the cytosol, then mitochondria receive more Pi, and therefore the glucose repression is reduced.

      Strengths:

      The strength is that the authors used an array of different assays to test their hypothesis. Most assays were well-designed and controlled.

      Weaknesses:

      I think the main conclusions are not strongly supported by the current dataset.

      (1) Although the authors discovered ∆ubp3 cells have higher Pi and mitochondrial activity than WT in high glucose, it is not known if WT cultured in different glucose concentration also change Pi that correlate with the mitochondrial activity. The focus of the research on ∆ubp3 is somewhat artificial because ∆ubp3 not only affects glycolysis and mitochondria, but many other cellular pathways are also changed. There is no idea whether culturing cells in low glucose, which derepress the mitochondrial activity, involves Ubp3 or not. Similarly, the shift of glycolysis to trehalose synthesis is also not relevant to the WT cells cultured in a low-glucose situation. “The focus of the research on ∆ubp3 is somewhat artificial because ∆ubp3 not only affects glycolysis and mitochondria, but many other cellular pathways are also changed. There is no idea whether culturing cells in low glucose, which de-repress the mitochondrial activity, involves Ubp3 or not.”

      We would like to clarify that the focus of this research is not on Ubp3, or to address mechanistic aspects of how Ubp3 regulates mitochondrial activity, or to identify the targets of Ubp3. That would be an entirely distinct study, with a very different approach.

      In this study, while carrying out a screen, we serendipitously found that ubp3Δ cells showed an increase in mitochondrial activity in high glucose. Subsequently, we used this observation, bolstered by diverse orthogonal approaches, to identify a general, systems-level principle that governs mitochondrial repression in high glucose. Through this, we identify a role of phosphate budgeting as a controller of mitochondrial repression in high glucose. In this study, our entire focus has been to use orthogonal approaches, as well as parsimonious interpretations, to establish this new hypothesis as a possibility. We hope this idea, supported by these data, will now enable researchers to pursue other experiments to establish the generality of this phenomenon.

      We have not focused our effort in identifying the role of Ubp3, or its regulation upon changes in glucose concentration in this context. That is a very specific, and separate effort, and misses the general point we address here. It is entirely possible that Ubp3 might also regulate mitochondrial activity by additional mechanisms other than mitochondrial Pi availability (such as via the reduction of key glycolytic enzymes at nodes of glycolysis, resulting in reduced glycolytic flux and rerouted glucose metabolism). Had the goal been to identify Ubp3 substrates, it is very likely that we would not have found the role of Pi homeostasis in controlling mitochondrial respiration. This is particularly because the loss of Ubp3 does not result in an acute disruption of glycolysis, unlike say a glycolytic enzyme mutant, which would have resulted in severe effects on growth and overall metabolic state. This would have made it difficult to dissect out finer details of metabolic principles that regulate mitochondrial respiration.

      In order to further corroborate our findings, we used the glycolysis defective mutant tdh2Δtdh3Δ cells, where we find a similar change in Pi balance. This complements the key observations made using ubp3Δ cells. Distinctly, we utilized the glycolytic inhibitor 2DG to independently assess the role of mitochondrial Pi transport in regulating respiration. Together, in this study we do not just relying on genetic mutants, but combine the Ubp3 deletion strain with a reduced GAPDH activity strain, and pharmacologic inhibition of glycolysis. Distinctly, we find that mitochondrial Pi transporter levels are repressed under high glucose (Figure 5C, Figure 5-figure supplement 1B). Further, we find that mitochondrial Pi transport is important in increasing mitochondrial respiration upon shift to low glucose and glycolytic inhibition by 2-DG. Therefore, we collectively unravel a more systems level principle that regulates glucose mediated mitochondrial repression, as opposed to a mechanistic study of Ubp3 targets.

      Of course, given the conservation of Ubp3, we are very excited to pursue a mechanistic study of Ubp3 targets in future. This is a general challenge for deubiquitinase enzymes, and till date there are very few bona fide substrates known for any deubiquitinase enzyme, from any cellular system (due to challenges in the field that we discuss separately, and have included in the discussion section of this text).

      “Similarly, the shift of glycolysis to trehalose synthesis is also not relevant to the WT cells cultured in a low-glucose situation”

      The reviewer is correct in pointing out that in low-glucose, the shift to trehalose synthesis might not be as relevant. We observe that the glycolysis defective mutant tdh2Δtdh3Δ cells does not show an increase in trehalose synthesis (Figure 3-figure supplement 1E). However, in this context, the decrease in the rate of GAPDH catalysed reaction alone appears to be sufficient to increase the Pi levels (Figure 3F) even without an increase in trehalose. Therefore, there might be differences in the relative contributions of these two arms towards Pi balance, based on whether it is low glucose in the environment, or a mutant such as ubp3 that modulates glycolytic flux. In ubp3Δ cells, the combination of low rate of GAPDH catalyzed reaction and high trehalose will happen (based on how glycolytic flux is modulated), vs only the low rate of GAPDH catalyzed reaction in tdh2Δtdh3Δ cells. As an end point the increase in Pi happens in both cases, but with slightly differing outcomes. It is also to be noted that in terms of free Pi sources a low-glucose condition (with low glycolytic rate) is very different from a no-glucose, respiratory condition (where cells perform very high gluconeogenesis). In high respiration conditions such as ethanol, cells switch to high gluconeogenesis, where there is a huge increase trehalose synthesis as a default (eg see Varahan et al 2019). In this condition, trehalose synthesis could be a major source for Pi (eg see Gupta 2021), and could support the increased mitochondrial respiration. In an ethanol medium, the directionality of GAPDH reaction is reversed. Therefore, this reaction will also now become an added source of Pi, instead of a consumer of Pi (see illustration in Figure 3G). Therefore, a reasonable interpretation is that a combination of increased trehalose and increased 1,3 BPG to G3P conversion can be a major Pi source to increasing mitochondrial respiration in a non-glucose, respiratory medium.

      “it is not known if WT cultured in different glucose concentration also change Pi that correlate with the mitochondrial activity”

      This is valid point raised by the reviewer. We have already found that the protein levels of mitochondrial Pi transporter is increased in a non-glucose respiratory (ethanol) medium and a low (0.1%) glucose medium (see Figure 5C, Figure5-figure supplement 1B). In addition, we have tried measuring mitochondrial Pi levels in cells grown in a high glucose medium vs a respiratory, ethanol medium. The results are shown below for the reviewer’s reference. Reviewer response image 3 – Mitochondrial Pi levels in ethanol vs glucose

      Author response image 13.

      We observe a clear trend where mitochondrial Pi levels are high in cells grown in ethanol medium compared to that of cells grown in high glucose. However, the estimation of Pi, and normalising the Pi levels in isolated mitochondria is extremely difficult in this condition (note that this has never been done before). This is likely due to a rapid rate of conversion of ADP and Pi to ATP (in ethanol) which increases the variation in the estimation of steady state Pi levels, and the high amounts of mitochondria in ethanol grown cells. Since the date shows high variation, we have not included this data in the manuscript, but we are happy to include it here in the response.

      Indeed, this study opens up the exciting question of addressing how intracellular Pi allocation is regulated in different conditions of glucose. This can be further extended to Crabtree negative strains such as K. lactis which do not show mitochondrial repression in high glucose. All of these are rich future research programs.

      (2) The central hypothesis that Pi is the key constraint behind the glucose repression of mitochondrial biogenesis/activity is supported by the data that limiting Pi will suppress mitochondrial activity increase in these conditions (e.g., ∆ubp3). However, increasing the Pi supply failed to increase mitochondrial activity. The explanation put forward by the authors is that increased Pi supply will increase glycolysis activity, and somehow even reduce the mitochondrial Pi. I cannot understand why only the increased Pi supply in ∆ubp3, but not the increased Pi by medium supplement, can increase mitochondrial activity. The authors said "...that ubp3Δ do not increase mitochondrial Pi by merely increasing the Pi transporters, but rather by increasing available Pi pools". They showed that ∆ubp3 mitochondria had higher Pi but WT cells with medium Pi supplement showed lower Pi, it is hard to understand why the same Pi increase in the cytosol had a different outcome in mitochondrial Pi. Later on, they showed that the isolated mito exposed to higher Pi showed increased activity, so why can't increased Pi in intact cells increase mito activity? Moreover, they first showed that ∆ubp3 had a Mir1 increase in Fig3A, then showed no changes in FigS4G. It is very confusing.

      “I cannot understand why only the increased Pi supply in ∆ubp3, but not the increased Pi by medium supplement, can increase mitochondrial activity.”

      This is an interesting point, that requires a nuanced explanation, which we try to provide below.

      For mitochondrial respiration to increase in the presence of high Pi, the cytosolic Pi has to be transported to the mitochondria sufficiently. In ubp3Δ the increased free Pi (as a consequence of rewired glycolysis) is transported to the mitochondria (Figure 4). This increased mitochondrial Pi can therefore increase mitochondrial respiration in ubp3Δ.

      In case of WT+Pi, the externally supplemented Pi cannot further enter mitochondria (as shown in Figure 5-Figure supplement 1C) and is most likely restricted to the cytosol. Because of this inability of the Pi to access mitochondria, the mitochondrial respiration does not increase in WT+Pi (Figure 5-Figure supplement 1E).

      The likely reason for this difference in mitochondrial Pi transport in ubp3Δ vs WT+Pi is the relative difference in their glycolytic rate. The glycolytic rate is inherently decreased in ubp3Δ, but not in WT+Pi. To dissect this possibility of glycolytic rate itself contributing to the Pi availability in the mitochondria, we inhibited glycolysis in WT cells (using 2DG), and then supplemented Pi. Compared to cells in the same glucose condition (with 2DG, but without supplementing excess Pi), now the WT+Pi (+2DG) has higher mitochondrial respiration (Figure 5-Figure supplement 1F). This suggests that a combination of low glycolysis and high Pi is required for increasing mitochondrial respiration (as elaborated in the discussion section of the manuscript).

      An obvious question that arises out of this observation is how does the change in glycolytic rate regulate mitochondrial Pi transport. One consequence of altering the glycolytic rate is a change in cytosolic pH. This itself will bear on the extent of Pi transport into mitochondria, as discussed in detail below.

      In mitochondria, Pi is co-transported along with protons. Therefore, changes in cytosolic pH (which changes the proton gradient) can control the mitochondrial Pi transport (Hamel et al., 2004). Glycolytic rate is a major factor that controls cytosolic pH. The cytosolic pH in highly glycolytic cells is ~7, and decreasing glycolysis results in cytosolic acidification (Orij et al., 2011). Therefore, under conditions of decreased glycolysis (such as loss of Ubp3), cytosolic pH becomes acidic. Since mitochondrial Pi transport is dependent on the proton gradient, a low cytosolic pH would favour mitochondrial Pi transport. Therefore, under conditions of decreased glycolysis (2DG treatment, or loss of Ubp3), where cytosolic pH would be acidic, increasing cytosolic Pi might indirectly increase mitochondria Pi transport, thereby leading to increased respiration.

      To explain this and integrate all these points, we have extended a discussion section in this manuscript. We include this section below:

      “Supplementing Pi under conditions of low glycolysis (where mitochondrial Pi transport is enhanced), as well as directly supplementing Pi to isolated mitochondria, increases respiration (Figure 5, Figure 5-figure supplement 1). Therefore, in order to derepress mitochondria, a combination of increased Pi along with decreased glycolysis is required. An additional systems-level phenomenon that might regulate Pi transport to the mitochondria is the decrease in cytosolic pH upon decreased glycolysis (60, 61). The cytosolic pH in highly glycolytic cells is ~7, and decreasing glycolysis results in cytosolic acidification (60, 61). Therefore, under conditions of decreased glycolysis (2DG treatment, deletion of Ubp3, and decreased GAPDH activity), cytosolic pH becomes acidic. Since mitochondrial Pi transport itself is dependent on the proton gradient, a low cytosolic pH would favour mitochondrial Pi transport (62). Therefore, under conditions of decreased glycolysis (2DG treatment, or loss of Ubp3, or decreased GAPDH activity), where cytosolic pH would be acidic, increasing cytosolic Pi might indirectly increase mitochondria Pi transport, thereby leading to increased respiration. Alternately, increasing mitochondrial Pi transporter amounts can achieve the same result, as seen by overexpressing Mir1 (Figure 5).”

      This possibility of changes in cytosolic pH regulating mitochondrial Pi transport and thereby respiration is a really interesting future research question, and an idea that has not yet been explored till date. This can stimulate new lines of thinking towards finding conserved biochemical principles that control mitochondrial repression in high glucose.

      “Moreover, they first showed that ∆ubp3 had a Mir1 increase in Fig3A, then showed no changes in FigS4G. It is very confusing”

      increase in Mir1 in ubp3Δ shown in figure 3A comes from the analysis of the proteomics dataset from a previous study (Isasa et al., 2015). Subsequently, we more systematically experimentally assessed Mir1 levels directly, and did not observe an increase in Mir1 (Figure 4figure supplement 1H in revised manuscript). It is entirely possible that in a large-scale study (as in Isasa 2015), some specific proteomic targets might not fully reproduce when tested very specifically (as is described in Handler et al., 2018 and Mehta et al., 2022). We do clearly indicate this in the text, but given the density of information in this study, it is understandable that this point was missed by the reviewer.

      (3) Given that there is no degradation difference for these glycolytic enzymes in ∆ubp3, and the authors found transcriptional level changes, suggests an alternative possibility where ∆ubp3 may signal through unknown mechanisms to parallelly regulate both mitochondrial biogenesis and glycolytic enzyme expression. The increase of trehalose synthesis usually happens in cells under proteostasis stress, so it is important to rule out whether ∆ubp3 signals these metabolic changes via proteostasis dysregulation. This echoes my first point that it is unknown whether wild-type cells use a similar mechanism as ∆ubp3 cells to regulate the glucose repression of mitochondria.

      We appreciate this point raised by the reviewer, but this again requires some clarification (as made earlier). The goal of this study was to identify systems-level principles that explain mitochondrial repression in high glucose. Although we started by performing a screen to identify proteostatic regulators of mitochondrial activity in high glucose, and identified Ubp3 as a mediator of mitochondrial activity, our approach was to use ubp3Δ cells as a model to understand the metabolic principles that regulate mitochondrial repression. This has been reiterated repeatedly in the manuscript – for example lines 123-124 “We therefore decided to use ubp3Δ cells to start delineating requirements for glucose-mediated mitochondrial repression.” and again in the discussion section – lines 442-460, where we discuss some unique advantages of using ubp3Δ cells to understand a general basis of mitochondrial regulation. To test this hypothesis, we also used orthogonal approaches, as well as other mutants and conditions with defective glycolysis, such as tdh2Δtdh3Δ cells and 2DG treatments. Only with these multiple converging evidences do we infer that there might be a role of the change in Pi balance (due to changes in glycolytic rate) in regulating mitochondrial activity.

      We certainly agree that there is great value in identifying the mechanistic details of how Ubp3 regulates mitochondria. But this requires very distinct approaches not pursued in this study. This is not the question that we are addressing in this story. Separately, identifying targets of DUBs is one of the exceptional challenges in biology, since there are currently no straightforward chemicalbiology approaches to do so for this class of proteins. Unlike kinase/phosphatase systems, or even ubiquitin ligases, substrate trapping mutants etc have proven to be abject failures in identifying direct targets of DUBs. A quantitative proteomics study might suggest some proteins/cellular processes regulated by Ubp3. This has been attempted for several DUBs, but rarely have any direct substrates of DUBs every been identified, in any system. A high quality quantitative, descriptive proteome dataset of ubp3Δ cells is already available from a previous study (Isasa et al., 2015), which we cite extensively in this manuscript, and indeed was invaluable for this study. We cannot improve the outstanding quality dataset already available. Interestingly, the findings of this study actually help substantiate our idea of an increased mitochondrial activity and change in Pi homeostasis in ubp3Δ cells. The Isasa et al dataset finds proteins involved in mitochondrial respiration that are high in ubp3Δ cells, and the glycolytic enzymes and PHO regulon proteins are reduced. In our study, using these data references, we were able to conceptually piece together how changes in glycolytic flux can alter Pi balance.

      Apart from identifying changes in protein levels, a separate challenge in making sense of this quantitative proteomics data is the difficulty in pinpointing any target of Ubp3 that specifically regulates these processes. A single DUB can have multiple substrates, and this could regulate the cellular metabolic state in a combinatorial manner. This is the essence of all signaling regulators in how they function, and it is therefore important to understand what their systems-level regulation of cell states are (separate from their specific individual substrates). Therefore, identifying the specific target of Ubp3 responsible for this metabolic rewiring can be very challenging. These experiments are well beyond the scope or interest of the current manuscript.

      If we had pursued that road in this study, we would not have made any general findings related to Pi balance, nor would this more general hypothesis have emerged.

      (4) Other major concerns:

      (a) The authors selectively showed a few proteins in their manuscript to support their conclusion. For example, only Cox2 and Tom70 were used to illustrate mitochondrial biogenesis difference in line 97. Later on, they re-analyzed the previous MS dataset from Isasa et al 2015 and showed a few proteins in Fig3A to support their conclusion that ∆ubp3 increases mitochondrial OXPHOS proteins. However, I checked that MS dataset myself and saw that many key OXPHOS proteins do not change, for example, both ATP1 and ATP2 do not change, which encode the alpha and beta subunits of F1 ATPase. They selectively reported the proteins' change in the direction along with their hypothesis.

      To clarify, we observe an increase in Cox2 protein levels but not in Tom70 levels which suggests that there is no increase in mitochondrial biogenesis. The increase is specific to some respiration related mitochondrial proteins such as Cox2 (Figure 1E, Figure 3A). We have clearly pointed out this in the manuscript. We used Cox2 protein levels as an additional readout for ETC activity, to validate our observations coming from the potentiometric mitotracker readouts, and basal oxygen consumption rate (OCR) measurements. This was for 3 reasons: Cox2 is a mitochondrial genome encoded subunit of the complex IV (cytochrome c oxidase) in the ETC, and has a redox centre critical for the cytochrome c oxidase activity. The biogenesis and assembly of complex IV subunits have been studied with respect to multiple conditions such as glucose availability and hypoxia and the expression and stability of the mitochondrial encoded complex IV subunits are exceptionally well correlated to changes in mitochondrial respiration (Fontanesi et al., 2006). Cox2 is very well characterised in S. cerevisiae, and the commercially available Cox2 antibodies are outstanding, which makes estimating Cox2 levels by western blotting unambiguous and reproducible.

      We re-analyzed the proteomic dataset from Isasa et al to find out additional information regarding the key nodes that are differentially regulated in ubp3Δ. We have not claimed at any point in the manuscript that all OXPHOS related proteins are upregulated in ubp3Δ, nor is there any need for that to be so. We identified Ubp3 from our screen, observed an increase in mitochondrial potential, basal OCR, and Cox2 levels. We later found out that the proteomic data set for ubp3Δ also supports our observations that mitochondrial respiration is upregulated in ubp3Δ. The reviewer points out that we “showed a few proteins in Fig3A to support their conclusion that ∆ubp3 increases mitochondrial OXPHOS proteins”. Our conclusion is that the deletion of Ubp3 increases mitochondrial respiration. The combined readouts which we used to reach this conclusion (OCR, mitochondrial potential, mitochondrial ATP production, Cox2 levels) are far more direct, comprehensive and conclusive than showing an increase in a few proteins related to OXPHOS, as also explained earlier toward a distinct reviewer query. Since different mitochondrial proteins are regulated by different mechanisms, we need not see an increase in all the OXPHOS proteins in a mutant like ubp3Δ where mitochondrial respiration is high. An increase in some key proteins would be sufficient to increase the respiration as seen in our case.

      To summarise, the proteomic dataset supports our observation, but our conclusions are not dependent on the increase in OXPHOS proteins observed in the dataset.

      (b) The authors said they deleted ETC component Cox2 in line 111. I checked their method and table S1, I cannot figure out how they selectively deleted COX2 from mtDNA. This must be a mistake.

      Yes, we understand that for mitochondrially encoded proteins, a simple knock-out strategy has limitations. However, we first tried to generate the Cox2 deletion mutant by a standard PCR mediated gene deletion strategy (Longtine 1998), with the optimistic assumption that even if all Cox2 is not lost, a substantial fraction of the Cox2 genes would be lost via recombination. We selected the transformants after strong antibiotic selection, and then we measured the Cox2 protein levels. Gratifyingly, we found that the mutant strain had substantially decreased Cox2 protein levels (but not a complete loss), and this was retained across generations. The data is shown below.

      Author response image 14.

      Cox2 levels in WT vs Cox2 mutants

      Since the mutants have decreased Cox2 levels, we went ahead and performed growth assays using this strain, in a WT or Ubp3 deletion background. Deletion of Ubp3 in the Cox2 mutant resulted in a more severe growth defect.

      However, we fully agree that this strain is not a complete Cox2 knockout, and it is possible that the decrease in Cox2 is due to modifications in some other unelated gene. In the text, we should also not have named this cox2Δ. Since we are not sure of the exact genetic modification in this mutant, we have removed this data from the revised manuscript.

      Instead, we have now repeated all experiments, utilizing a fully characterised Cox2 mutant -cox262, described in (5) which has defective respiration. In this revised version, we find that deletion of Ubp3 in this strain retains the originally observed severe growth defect in glucose. This is consistent with our conclusion that a functional mitochondria is required for proper growth in ubp3Δ mutant. To separately validate this conclusion, we also utilized a Rho0 strain which does not have mitochondrial DNA and thereby cannot perform mitochondrial respiration. We show that deletion of Ubp3 results in a more severe growth defect in a Rho0 strain. These results are included in the revised manuscript as figure 1-figure supplement 1 I.

      Author response image 15.

      Also, we further confirmed that the Rho0 strain and Rho0 ubp3 strain is incapable of respiration, using seahorse assay. This data is included in the revised manuscript as Figure 4-figure supplement 1G.

      Author response image 16.

      Basal OCR in Rho0 cells

      We hope that these new data address the reviewer’s concerns about the Cox2 mutant.

      (c) They used sodium azide in a lot of assays to inhibit complex IV. However, this chemical is nonspecific and broadly affects many ATPases as well. Not sure why they do not use more specific inhibitors that are commonly used to assay OCR in seahorse.

      We have now performed growth assays for WT and ubp3Δ cells in the presence of specific mitochondrial OXPHOS inhibitors - oligomycin and FCCP. We observe a more severe growth defect in ubp3Δ cells compared to WT cells in the presence of oligomycin and FCCP, similar to the results observed with sodium azide. All these data are now included in the revised manuscript as Figure 1I, Figure1-figure supplement 1H, along with associated text.

      Author response image 17.

      Growth rate in the presence of FCCP

      Author response image 18.

      Figure1-figure supplement 1H- Growth rate in the presence of oligomycin

      We hope that these new data addresses the reviewer’s concerns.

      (d) The authors measured cellular Pi level by grinding the entire cells to release Pi. However, this will lead to a mix of cytosolic and vacuolar Pi. Related to this caveat, the cytosol has ~50mM Pi, while only 1-2mM of these glycolysis metabolites, I am not sure why the reduction of several glycolysis enzymes will cause significant changes in cytosolic Pi levels and make Pi the limiting factor for mitochondrial respiration. One possibility is that the observed cytosolic Pi level changes were caused by the measurement issue mentioned above.

      The Pi estimation shown in figure 3 C, E, F and G is a measure of total Pi in the cells. The vacuole is a major storehouse of phosphate in cells. However, unlike plant cells where free phosphate is stored in vacuoles, yeast vacuoles store phosphate only in the form of polyphosphates (Yang et al., 2017, Hürlimann et al., 2007). The free Pi formed from the hydrolysis of polyphosphate is subsequently transported to cytosol via the exporter Pho91 (Hürlimann et al., 2007). This therefore makes cytosol and mitochondria the major storage of usable free Pi in yeast. Since the malachite green assay that we use for phosphate estimation is specific to free Pi, and not polyphosphate, the Pi estimates that we show in figure 3 come from a combination of cytosolic and mitochondrial Pi. As explained earlier, in order to specifically measure mitochondrial Pi, we have established methods to rapidly isolate mitochondria, and then followed this by estimating Pi in these isolated mitochondria (Figure 4B). Here we clearly see a large increase in mitochondrial Pi in the Ubp3 deletion cells. This allows us to estimate the changes in Pi levels that specific to mitochondria, without relying only on total Pi changes.

      “the cytosol has ~50mM Pi, while only 1-2mM of these glycolysis metabolites, I am not sure why the reduction of several glycolysis enzymes will cause significant changes in cytosolic Pi levels and make Pi the limiting factor for mitochondrial respiration”

      The reviewer has completely missed the fact that the glycolytic rate in yeast is the highest known for any cell. While the steady state levels of glycolytic metabolites might be ~2 mM, the process of glycolysis is not static but is rapid and continuous. Glucose is continuously broken down and converted to pyruvate, along with the consumption of Pi and generation of ATP. This is the reason for the rapid 13C label saturation (within seconds of 13C glucose addition) in yeast cells (Figure 2-figure supplement 1F). This instantaneous label saturation makes accurate flux measurements arduous because of which we had to optimize a method for measuring glycolytic flux in yeast cells (Figure 2-D, Figure 2-figure supplement 1F). Indeed, for that reason, our measurements of glycolytic flux in yeast are the first time this is being reported in the field. This in itself is an enormously challenging experiment, and establishes a new benchmark.

      In highly glycolytic cells, most of the ATP is synthesized via glycolysis and the rate of glycolysis and ATP synthesis is very high. In the reaction catalysed by GAPDH, Pi and ADP is converted to ATP. This ATP formed acts as a Pi donor to most of the Pi consuming reactions in the cells. Some of these processes such a protein translation utilizes ATP, but releases Pi and ADP and this Pi enters the cellular Pi pool. Several other reactions such as nucleotide biosynthesis, polyphosphate biosynthesis and protein phosphorylation use ATP as a Pi donor and the Pi is fixed in biomolecules. Increasing the rates of these ‘Pi sinks’ therefore can result in a decrease in Pi pools. This is a concept we have earlier tried to clarify more elaborately in (Gupta and Laxman, 2021). In fact, increasing nucleotide biosynthesis and polyphosphate synthesis has earlier been suggested to decrease available free Pi (Austin and Mayer 2020, Desfougères et al., 2016). When glycolytic flux is high, this is coupled/tuned to the consumption of Pi which will be correspondingly high due to increased ATP, nucleotide and polyphosphate synthesis. Pi levels rapidly decrease upon glucose addition, due to the continuous Pi consumption during glycolysis (Hohmann et al., 1996, Van Heerden et al., 2014 , Koobs et al., 1972). Therefore, changes in glycolytic rate due to change in glycolytic enzyme levels can result in significant changes in Pi levels due to changes in Pi consumption rate.

      Our results also show that the apart from Pi levels, the glycolytic state can regulate mitochondrial Pi transport as well. This is the reason for mitochondrial Pi levels and basal OCR not increasing merely by adding Pi to cells. We show that basal OCR can be increased by adding Pi in the presence of 2DG. This regulation of mitochondrial Pi transport is a major limiting factor for mitochondrial respiration and could be mediated partly by the regulating of Mir1 levels and also by the changes in the cytosolic pH which regulates the rate of mitochondrial Pi transport. We have discussed these points in the discussion section in our manuscript.

      We hope that this clarifies the reviewer’s concerns regarding how changes in glycolytic rate can regulate changes in cytosolic Pi levels.

      (e) The authors used ∆mir1 and MIR1 OE to show that Pi viability in the mitochondrial matrix is important for mitochondrial activity and biogenesis. This is not surprising as Pi is a key substrate required for OXPHOS activity. I doubt the approach of adding a control to determine whether Pi has a specific regulatory function, while other OXPHOS substrates, like ADP, O2 etc do not have the same effect.

      To clarify, we only used the mir1Δ cells to understand the requirement for Pi transport from cytosol to mitochondria in controlling respiration. The reviewer is correct in stating that deletion of Mir1 would reduce Pi import to mitochondria and thereby inhibit respiration. This is exactly the conclusion we suggest from this experiment as stated in the manuscript – “These data suggest that mitochondrial Pi transport (via Mir1) is critical for maintaining basal mitochondrial activity even in high glucose”. We have only used these experiments to support the idea that even though glycolysis and mitochondria are in different compartments, a change in Pi balance in one compartment (cytosol) can affect Pi levels in the other (mitochondria) since there is Pi transport between these two compartments. Since mitochondria has its own polyphosphate reserves, in the absence of these experiments with mir1Δ cells it can be imagined that mitochondria PolyP can be an additional source of Pi to support respiration, and therefore changes in cytosolic Pi may have only a minor effect on mitochondrial respiration. Our experiments with mir1Δ and Mir1-OEcells indubitably suggest that Pi transport to mitochondria from cytosol is important for respiration, and therefore changes in cytosolic Pi levels (or maintaining cytosolic Pi at a lower level due to the rate of glycolysis) will have rippling effects in mitochondrial Pi availability. Further, these data suggest that for example under glycolytic inhibition (low glucose, or 2DG), while all factors (signalling, substrate availability etc) favour respiration (and mitochondrial derepression), cells cannot unable to achieve this in the absence of ample Pi transport from cytosol. This therefore places Pi at the centre stage in controlling mitochondrial respiration.

      We conclude that Pi is a major, but not the only constraint for mitochondrial respiration. There certainly could be a role for ADP, oxygen availability etc in regulating respiration. However, these are beyond the scope of our study. We have discussed about the potential role of ADP in regulating mitochondrial repression in the discussion section. “An additional consideration is the possible contribution of changes in ADP in regulating mitochondrial activity, where the use of ADP in glycolysis might limit mitochondrial ADP. Therefore, when Pi changes as a consequence of glycolysis, it could be imagined that a change in ADP balance can coincidentally occur. However, prior studies show that even though cytosolic ADP decreases in the presence of glucose, this does not limit mitochondrial ADP uptake, or decrease respiration, due to the very high affinity of the mitochondrial ADP transporter.”

      We hope that this clarifies the reviewer’s concerns regarding the use of Mir1 OE and mir1Δ strains.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Some of the experiments should be repeated in other strain backgrounds for reproducibility and rigor.

      As discussed in the response to point number 1, we have now utilized multiple strains of S. cerevisiae to test our findings. We now find that our discoveries regarding the role of altered Pi budgeting as a constraint for mitochondrial respiration, and the role of Ubp3 as a regulator of mitochondrial repression are conserved across multiple genetic backgrounds of S. cerevisiae. These results are included in the revised manuscript as a new figure- Figure 6 and associated text. We used the W303, Σ1278 and BY4742 strains of S. cerevisiae to show that deletion of Ubp3 increases mitochondrial activity (as shown by increased mitochondrial membrane potential and increased Cox2 levels). Using the W303 strain we show that the deletion of Ubp3 increases Pi levels and that the increased Pi is necessary for increasing mitochondrial respiration (Figure 6C, D). These added experiments have substantially broadened the generality of our findings.

      The number of biological repeats needs to be increased in all experiments.

      We have increased the number of biological repeats in key experiments that shows that the increased Pi levels are necessary for the increased mitochondrial respiration in ubp3Δ and tdh2Δtdh3Δ cells (revised Figure 4F). Apart from a few basal OCR measurements and mitotracker data in supplementary figure, all our experiments are performed for 3 biological repeats. In case of basal OCR measurements, yeast cells have to be aliquoted to poly-L-lysine coated seahorse plates and centrifuged to ensure that the cells are properly settled. This is due to the non-adherent nature of yeast cells. During the centrifugation step, the wells in the two end rows cannot be utilized due to uneven settling of cells which affects the basal OCR readings in these wells. In case of several experiments that involve multiple samples, we were therefore limited to restrict the number of biological replicates to 2 (repeated independently), so that all samples could be accommodated in the plate.

      Full western blot images should be supplemented along with the other data.

      The complete western blot images are now included in the revised manuscript as supplementary file 2.

      TCA cycle flux should be analyzed and presented in the study to conclude some of the findings.

      As discussed in detail in the response to point number 2, we have performed steady state and flux measurements for TCA cycle intermediates. This data is now included as a new supplement figure- Figure 2-figure supplement 2.

      Reviewer #2 (Recommendations For The Authors):

      (1) In Fig. 2A, they should also include the gluconeogenesis enzymes (fructose 1,6 bisphosphatase, PEP carboxykinase, and pyruvate carboxylase) to exclude the possibility that glycolytic intermediates are not rerouted to gluconeogenesis.

      We measured the protein levels of Fbp1 (fructose 1,6 bisphosphatase) and Pck1 (PEP carboxykinase). We observed an increase in the protein levels in both enzymes in ubp3Δ. The data is shown below.

      Author response image 19.

      Fbp1 and Pck1 protein levels

      While we agree that this is an interesting observation which might help us in understanding the metabolic rewiring in ubp3Δ, we have not included this data in the current revised version of the manuscript due to two main reasons.

      (1) Since ubp3Δ cells have a defective glycolysis and therefore a defective glucose repression, the mRNA and protein levels of gluconeogenic enzymes which are usually glucose-repressed might increase. This might be a response at the level of transcription and translation of these enzymes and might or might not change the rate of gluconeogenesis in these cells. This is because of multiple other factors that regulate gluconeogenic flux such as allostery, mass action etc. Therefore, to avoid confounding our main points and since we cannot make a conclusive assumption on the gluconeogenic metabolism in these mutants, we don’t include this data. The primary focus of our story is the mitochondrial repression component. Understanding the feedback controls that alter gluconeogenesis in these mutants is beyond the scope of this study and could be addressed in a separate future study.

      (2) As we highlight extensively in the response letter and in the manuscript, our aim is not to understand the specific mechanistic role of Ubp3. In this manuscript, we identify the conserved constraints that control mitochondrial repression without focusing just on the role of Ubp3 in regulating this. Whether Ubp3 regulates gluconeogenesis is a question that could be addressed in a future study that focuses on identifying the altered signalling mechanisms in ubp3Δ and the targets of Ubp3.

      (2) In line 292, page 10, there is a typo "dermine".

      We apologize for this mistake. Corrected.

      (3) In Figure 5A, is there a reason why they chose 0.1% glucose condition as a low glucose condition? Also, is there a dose-dependent change in OCR or other mitochondrial functions according to the concentration of glucose?

      The glucose concentration of 0.1% was selected to decrease (but not completely remove) the available glucose. 0.1% glucose is considered as a standard low glucose condition in S. cerevisiae (Yin et al., 2003) and the effect of this glucose concentration on cellular processes has been extensively studied (Yin et al., 2003, Takeda et al., 2015 etc). <0.2% glucose is the critical threshold for activating respiratory metabolism (Takeda et al., 2015) and shifting cells to 0.1% glucose in our experiments will activate respiration, as we show in our data. However, this is very different from completely removing glucose or using an alternate carbon source such as ethanol, because this would result in full activation of gluconeogenesis. We further find that when cells are grown in ethanol, the gluconeogenic activation will also change the Pi homeostasis. This will in part be a result of the fully reversed direction of the GAPDH catalysed reaction (Figure 3G). If such a condition is used, it could lead to misinterpretations, and confound the conclusions that we make from these set of experiments where Pi homeostasis play a major role. In 0.1% glucose it has been shown that gluconeogenesis is still partly repressed (Yin et al., 2003). The pathways utilizing alternate carbon sources still remain repressed (even though to a lower extend compared to 2% glucose) in 0.1% glucose (Yin et al., 2003). We hope that this clarifies the concerns regarding the rationale behind using 0.1% glucose in our experiments.

      The extent of glucose repression is dependent on the concentration of glucose. Glucose concentration >1% has been shown to activate degradation of mRNAs involved in alternate carbon utilization. Different signaling pathways involved in growth under glucose and glucose repression is regulated by glucose concentration. This is discussed in detail in Yin et al., 2002. We (Figure 5figure supplement 1A) also observe a dose dependent increase in mitochondrial membrane potential in the presence of 2DG. This also suggests that the rate of glycolysis (which could be also mediated by changes in glucose concentration) can regulate the extent of mitochondrial derepression.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We would like to thank the reviewers for their constructive feedback and overall positive response to our manuscript. Reviewer #1 had no specific recommendations, so below we address Reviewer #2’s comments.

      Reviewer #2 (Recommendations For The Authors):

      Specific points

      (1) In Fig. 1H peptides selected are much more stable than the positive control KHN-FT, but they appear to be less stable than randomly selected 5 amino acid sequences. Are the differences

      between the randomly selected sequences and the selected sequences statistically significant.

      Thank you for the feedback. Yes, the differences are statistically significant by one-way ANOVA and the Tukey’s multiple comparisons tests, we’ve updated the figure legend to indicate this fact.

      (2) In Fig. 1I the FACS profile of 4x looks like that of KHN, but it is very difficult to see in the figure. Looking at the quantitation in Fig. 1J it is impossible to compare KHN with 4x as the KHN is on the baseline. Could this be improved by using a log scale to present the data.

      Thank you for pointing this out. We’ve improved the figure so the KHN is easier to see. In addition, we’ve attempted different way to display these results, but settled on scaling the data between 1 and 0 as our comparison points. We’ve updated the main figure to more clearly show this result so the KHN is easier to compare.

      (3) In Fig. 2G and Fig. 2F don't really match up. It looks from Fig. 2G like there is still some degradation in the hrd1 deletion strain, but this is not reflected in the quantitation (Fig. 2H).

      To our eyes, the degradation in a hrd1null appears to be quite small, which seems to be reflected in the quantification (~20% decrease over 90 minutes). We included the figure in Author response image 1 for quick comparison.

      Author response image 1.

      (4) Throughout the paper the authors claim that the proteins are degraded by a cytosolic proteasome. I agree that the proteins are degraded via the proteasome, but I don't see any evidence that it is cytosolic.

      Thank you for pointing this out. We’ve adjusted the text to reflect the fact that the proteasomal degradation is not necessarily in the cytosol.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Reviews):  

      First, the metabolic network in this study is incomplete. For example, amino acid synthesis and lipid synthesis are important for biomass and growth, but they4 are not included in the three models used in this study. NADH and NADPH are as important as ATP/ADP/AMP, but they are not included in the models. In the future, a more comprehensive metabolic and biosynthesis model is required.  

      Thank you for the critical comment on the weakness of the present study. We actually tried to study a larger model like Turnborg et al (2021), which is a model of JCVI-syn3A, but we give up to include it in our model list to study in depth. This is because we noticed that the concentration of ATP in the model can be negative (we confirmed this with one of the authors of the paper). Another "big" kinetic model of metabolism that we could list would be Khodayari et al (2017). However, we could not find the models to compare the dynamics of this big model with. Therefore, we decided to use the model only for the central carbon metabolism for now. We would like to leave a more extended study for the near future.  

      We would like to mention that NADH and NADPH are included in Khodayari model and Boecker model, while NADH and NADPH are ramped up to NADH in the latter model.  

      Second, this work does not provide a mathematical explanation of the perturbation response χ. Since the perturbation analysis is performed close to the steady state (or at least belongs to the attractor of single-steady-state), local linear analysis would provide useful information. By complementing with other analysis in dynamical systems (described below) we can gain more logical insights about perturbation response.  

      We tried a linear stability analysis. However, with the perturbation strength we used here, the linearization of the model is no longer valid, in the sense that the linearized model

      leads to negative concentrations of the metabolites (xst+Δx < 0 for some metabolites). We have added a scatter plot of the response coefficient of trajectories sharing the initial condition, while the dynamics are computed by the original model and the linearized model, respectively. (Fig. S1). 

      Since the response coefficient is based on the logarithm of the concentrations, as the metabolite concentrations approach zero, the response coefficient becomes larger. The high response coefficient in the Boecker and Chassagnole model would be explained by this artifact.  The linearized Khodayari model shows either χ~1 or χ = 0 (one or more metabolite concentrations become negative). This could be due to the number of variables in the model. For the response coefficient to have a larger value, the perturbation should be along the eigenvector that leads to oscillatory dynamics with long relaxation time (i.e., the corresponding eigenvalue has a small real part in terms of absolute value and a non-zero imaginary part). However, since the Khodayari model has about 800 variables, if perturbations are along such directions, there is a high probability that one or more metabolite concentrations will become negative.

      We fully agree that if the perturbations on the metabolite concentrations are in the linear regime, the response to the perturbations can be estimated by checking the eigenvalues and eigenvectors. However, we would say that the relationship between the linearized model (and thus the spectrum of eigenvalues) and the original model is unclear in this regime.  We remarked this in Lines 158160.

      Recommendations for the authors:

      My major suggestion is about understanding the key quantity in this study: the response coefficient χ. When the perturbed state is close to the fixed point, one could adopt local stability analysis and consider the linearized system. For a linear system with one stable fixed point P, we consider the Jacobian matrix M on P. If all eigenvalues of M are real and negative, the perturbed trajectory will return to P with each component monotonically varies. If some eigenvalues have negative real part and nonzero imaginary part, then the perturbed trajectory will spiral inward to the fixed point. Depending on the spiral trajectory and the initially perturbed state, some components would deviate furthermore (transiently) from the fixed point on the spiral trajectory. This explains why the response coefficient χ can be greater than 1. 

      Mathematically, a locally linearized system has similar behavior to the linear system, and the examples in this study can be analyzed in the similar way. Specifically, if a system has many complex eigenvalues, then the perturbed trajectory is more likely to have further deviation. The metabolic network models investigated in this work are not extremely large, and hence the author could analyze its spectrum of the Jacobian matrix at the steady state. Since the steady state is stable, I expect the spectrum located in the left half of the complex plane. If the spectrum spread out away from the real axis, we expect to see more spiral trajectories under perturbation. I think the spectrum analysis will provide a complementary view with respect to analysis on χ.  The authors' major findings, about the network sparsity and cofactors, can also be investigated under the framework of the spectrum analysis.  

      Of course, when the nonlinear system is perturbed far away from the fixed point, there are other geometrical properties of the vector field that can cause the response coefficient χ to be greater than 1. This could also be investigated in the future by testing the behavior of small and large perturbations and observing if the systems have signatures of nonlinearity.  

      Since all perturbed states return to the steady state, the eigenvalues of the Jacobi matrix accompanying the linearized system around the steady state are in the left half complex plane (negative real value). Also, some eigenvalues have non-zero imaginary parts.    

      The reason we emphasize the "nonlinear regime" is that the linearization is no longer valid in this regime, i.e. the metabolite concentrations can be negative when we calculate the linearized system. Certainly, there are complex eigenvalues in the Jacobi matrix of any model. However, we would say that there is no clear relationship between the eigenvalues and the response coefficient.      

      Minor suggestions:  

      Line 127: Regarding the source of perturbation, cell division also generates unequal concentration of proteins and metabolites for two daughter cells, and it is an interesting mechanism to create metabolic perturbation. 

      Thank you for the insightful suggestion. We mentioned the cell division as another source of perturbation (Lines 130-131).

      Line 175: I do not quite understand the statement "fixing each metabolite concentration...", since the metabolite concentration in the ODE simulation would change immediately after this fixing.  

      We meant in the sentence that we fixed the concentration of the selected metabolite as the steady state concentration and set the dx/dt of that metabolite to zero. We have rewritten the sentences to avoid confusion (Lines 180-181).

      Figure 2: There are a lot of inconsistencies between the three models. Could we learn which model is more reasonable, or the conclusion here is that the cellular response under perturbation is model-specific? The latter explanation may not be quite satisfactory since we expect the overall cellular property should not be sensitive to the model details. 

      Ideally, the overall cellular property should be insensitive to model details. However, the reality is that the behavior of the models (e.g., steady-state properties, relaxation dynamics, etc.) depends on the specific parameter choices, including what regulation is implemented. I think this situation is part of the motivation for the ensemble modeling (by J. Liao and colleague) that has been developed.  

      Detailed responsiveness would be model specific. For example, FBP has a fairly strong effect in the Boecker model, but less so in the Khodayari model, and the opposite effect in the Chassagnole model (Fig. 2). Our question was whether there are common tendencies among kinetic models that tend to show model-specific behavior.  

      Reviewer 2 (Public Review):

      (1) In the study on determining key metabolites affecting responses to perturbations (starting from line 171), the authors fix the values of individual concentrations to their steady-state values and observe the responses. Such a procedure adds artificial constraints to the network because, in the natural responses of cells (and models) to perturbations, it is highly unlikely that metabolites will not evolve in time. By fixing the values of specific metabolites, the authors prohibit the metabolic network from evolving in the most optimal way to compensate for the perturbation. Instead of this procedure, have the authors considered for this task applying techniques from variance-based sensitivity analysis (Sobol, global sensitivity analysis), where they can calculate the first-order sensitivity index and total effect index? Using this technique, the authors would be able to determine the key metabolites while allowing for metabolic responses to perturbations without unnatural constraints. 

      Thank you for the useful suggestion for studying the roles of each metabolite for responsiveness. We have computed the total sensitivity index (Homma and Salteli, 1996) for each metabolite of each model (Fig.S5). The total sensitivity indices of ATP are high-ranked in Khodayari- and Chassagnole model, while it is middle-ranked in the Boecker model. We believe that the importance of the adenyl cofactors is highlighted also in terms of the Sobol’ sensitivity analysis (the figure is referred in Lines 193-195). 

      We have encountered a minor difficulty for computing the sensitivity index. For the computation of the sensitivity index, we need to carry out the following Monte Carlo integral, 

      where the superscript (m) is the sample number index. The subscript i represents the ith element of the vector x, and ~i represents the vector x except for the ith element. The tilde stands for resampling.  

      There are several conserved quantities in each model. For independent resampling, we need to deal with the conserved quantities. For the Boecker and Chassagnole models, we picked a single metabolite from each conservation law and solved its concentration algebraically to make the metabolite concentration the dependent variable. Then, we can resample the metabolite concentration of one metabolite without changing the concentrations of other metabolites, which are independent variables.  

      However, in the Khodayari model, it was difficult to solve the dependent variables because the model has about 800 variables. Therefore, we gave up the computations of the sensitivity indices of the metabolites whose concentration is part of any conserved quantities, namely NAD, NADH, NADP, NADPH, Q8, and Q8H2.

      (2) To follow up on the previous remark, the authors state that the metabolites that augment the response coefficient when their concentration is fixed tend to be allosteric regulators. The authors should report which allosteric regulations are implemented in each of the models so that one can compare against Figure 2. Again, the effect of allosteric regulation by a specific metabolite that is quantified the way the authors did is biased by fixing the concentration value - it is true that negative feedback is broken when the metabolite concentration is fixed, however, in the rate law, there is still the fixed inhibition term with its value corresponding to the inhibition at the steady state. To see the effect of allosteric regulation by a metabolite, one can change the inhibition constants instead of constraining the responses with fixed concentrations.  

      We have listed the substrate-level regulations (Table S1-3). Also, we re-ran the simulation with reduced the effect of the substrate-level regulations for the reactions that are suspected to influence the change of the response coefficient. Instead of fixing the concentrations (Fig. S6). 

      The impact of substrate-level regulations is discussed in Lines 203-212.   

      We replaced "allosteric regulation" with "substrate-level regulation" because we noticed that some regulations are not necessarily allosteric.

      (3) Given the role of ATP in metabolic processes, the authors' finding of the sensitivity of the three networks' responses to perturbations in the AXP concentrations seems reasonable. However, drawing such firm conclusions from only three models, with each of them built around one steady state and having one kinetic parameter set despite that they were built for different physiologies, raises some questions. It is well-known in studies related to basins of attraction of the steady states that the nonlinear responses also depend on the actual steady states, the values of kinetic parameters, and implemented kinetic rate law, i.e., not only on the topology of the underlying systems. In the population of only three models, we cannot exclude the possibility of overlaps and strong similarities in the values of kinetic parameters, steady states, and enzyme saturations that all affect and might bias the observed responses. Ideally, to eliminate the possibility of such biases, one should simulate responses of a large population of models for multiple physiologies (and the corresponding steady states) and multiple parameter sets per physiology. This can be a difficult task, but having more kinetic models in this work would go a long way toward more convincing results. Recently, E. coli nonlinear kinetic models from several groups appeared that might help in this task, e.g., Haiman et al., PLoS Comput Biol, 17(1): e1008208, (2021), Choudhury et al., Nat Mach Intell, 4, 710-719, (2022); Hu et al., Metab Eng, 82, 123-133 (2024), Narayanan et al., Nat Commun, 15:723, (2024). 

      We have computed the responsiveness of 215 models generated by the MASSpy package (Haiman et al, 2021). Several model realizations showed a strong responsiveness, i.e. a broader distribution of the response coefficient (Fig.S8), and mentioned in Lines 339-341.

      We would like to mention that the three models studied in the present manuscript have limited overlap in terms of kinetic rate law and, accordingly, parameter values. In the Khodayari model, all reactions are bi-uni or uni-uni reactions implemented by mass-action kinetics, while the Boecker and Chassagnole models use the generalized Michaelis-Menten type rate laws. Also, the relationship between the response coefficients of the original model and the linearized model highlights the differences between the models (Fig. S1). If the models were somewhat effectively similar, the scatter plots of the response coefficient of the original- and linearized model should look similar among the three models. However, the three panels show completely different trends. Thus, the three models have less similarity even when they are linearized around the steady states. 

      (4) Can the authors share their insights on what could be the underlying reasons for the bimodal distribution in Figure 1E? Even after adding random reactions, the distribution still has two modes - why is that?  

      We have not yet resolved why only the Khodayari model shows the bimodal distribution of the response coefficient. However, by examining the time courses, the dynamics of the Khodayari model look like those of the excitable systems. This feature may contribute to the bimodal distribution of the response coefficient. In the future, we would like to show whether the system is indeed the excitable system and whcih reactions contribute to such dynamics.

      (5) Considering the effects of the sparsity of the networks on the perturbation responses (from line 223 onwards), when we compare the three analyzed models, it is clear that the Khodayari et al. model is a superset of the other two models. Therefore, this model can be considered as, e.g., Chassagnole model with Nadd reactions (though not randomly added). Based on Figures 1b and S2, one can observe that the responses of the Khodayari models have stronger responses, which is exactly opposite to the authors' conclusion that adding the reactions weakens the responses.

      The authors should comment on this.  

      The sparsity of the network is defined by the ratio of the number of metabolites to the number of reactions. Note that the Khodayari model is a superset of the Boecker and Chassagnole models in terms of the number of reactions, but also in terms of the number of metabolites (Boecker does not have the pentose phosphate pathway, Chassagnole does not have the TCA cycle, and neither has oxyative phosphorylation). Thus, even if we manually add reactions to the Boecker model, for example, we cannot obtain a network that is equivalent to the Khodayari model.  We added one sentence to clarify the point (Lines 254-255).

      Recommendations for the authors: 

      (1) Some typos: Line 57, remove ?; Line 134, correct "relaxation". 

      Thank you for pointing out. We fixed the typos.

      (2) Lines 510-515, please rewrite/clarify, it is confusing what are you doing. 

      We rewrote the sentences (Lines 529-532). We are sorry for the confusion.

      (3) Line 522, where are the expressions above Leq and K*? 

      Leq appears in the original paper of the Boecker model, but we decided not to use Leq. We apologize for not removing Leq from the present manuscript. The * in K* is the wildcard for representing the subscripts. We added a description for the role of “*”. 

      (4) Lines 525-530, based on the wording, it seems like you test first for 128 initial concentrations if the models converge back to the steady state and then you generate another set of 128 initial concentrations - is this what you are doing, or you simply use the 128 initial concentrations that have passed the test? 

      We apologize for the confusion. We did the first thing. We have rewritten the sentence to make it clearer. 

      (5) Figure 3, caption, by "broken line," did the authors mean "dashed line"? 

      We meant dashed line. We changed “broken line” to “dashed line”.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      I applaud the authors' for providing a thorough response to my comments from the first round of review. The authors' have addressed the points I raised on the interpretation of the behavioral results as well as the validation of the model (fit to the data) by conducting new analyses, acknowledging the limitations where required and providing important counterpoints. As a result of this process, the manuscript has considerably improved. I have no further comments and recommend this manuscript for publication.

      We are pleased that our revisions have addressed all the concerns raised by Reviewer #1.

      Reviewer #2 (Public review):

      Summary:

      This manuscript proposes that the use of a latent cause model for assessment of memory-based tasks may provide improved early detection in Alzheimer's Disease as well as more differentiated mapping of behavior to underlying causes. To test the validity of this model, the authors use a previously described knock-in mouse model of AD and subject the mice to several behaviors to determine whether the latent cause model may provide informative predictions regarding changes in the observed behaviors. They include a well-established fear learning paradigm in which distinct memories are believed to compete for control of behavior. More specifically, it's been observed that animals undergoing fear learning and subsequent fear extinction develop two separate memories for the acquisition phase and the extinction phase, such that the extinction does not simply 'erase' the previously acquired memory. Many models of learning require the addition of a separate context or state to be added during the extinction phase and are typically modeled by assuming the existence of a new state at the time of extinction. The Niv research group, Gershman et al. 2017, have shown that the use of a latent cause model applied to this behavior can elegantly predict the formation of latent states based on a Bayesian approach, and that these latent states can facilitate the persistence of the acquisition and extinction memory independently. The authors of this manuscript leverage this approach to test whether deficits in production of the internal states, or the inference and learning of those states, may be disrupted in knock-in mice that show both a build-up of amyloid-beta plaques and a deterioration in memory as the mice age.

      Strengths:

      I think the authors' proposal to leverage the latent cause model and test whether it can lead to improved assessments in an animal model of AD is a promising approach for bridging the gap between clinical and basic research. The authors use a promising mouse model and apply this to a paradigm in which the behavior and neurobiology are relatively well understood - an ideal situation for assessing how a disease state may impact both the neurobiology and behavior. The latent cause model has the potential to better connect observed behavior to underlying causes and may pave a road for improved mapping of changes in behavior to neurobiological mechanisms in diseases such as AD.

      The authors also compare the latent cause model to the Rescorla-Wagner model and a latent state model allowing for better assessment of the latent cause model as a strong model for assessing reinstatement.

      Weaknesses:

      I have several substantial concerns which I've detailed below. These include important details on how the behavior was analyzed, how the model was used to assess the behavior, and the interpretations that have been made based on the model.

      (1) There is substantial data to suggest that during fear learning in mice separate memories develop for the acquisition and extinction phases, with the acquisition memory becoming more strongly retrieved during spontaneous recovery and reinstatement. The Gershman paper, cited by the authors, shows how the latent causal model can predict this shift in latent causes by allowing for the priors to decay over time, thereby increasing the posterior of the acquisition memory at the time of spontaneous recovery. In this manuscript, the authors suggest a similar mechanism of action for reinstatement, yet the model does not appear to return to the acquisition memory after reinstatement, at least based on the simulation and examples shown in figures 1 and 3. More specifically, in figure 1, the authors indicate that the posterior probability of the latent cause,z<sub>A</sub> (the putative acquisition memory), increases, partially leading to reinstatement. This does not appear to be the case as test 3 (day 36) appears to have similar posterior probabilities for z<sub>A</sub> as well as similar weights for the CS as compared to the last days of extinction. Rather, the model appears to mainly modify the weights in the most recent latent cause, z<sub>B</sub> - the putative the 'extinction state', during reinstatement. The authors suggest that previous experimental data have indicated that spontaneous recovery or reinstatement effects are due to an interaction of the acquisition and extinction memory. These studies have shown that conditioned responding at a later time point after extinction is likely due to a balance between the acquisition memory and the extinction memory, and that this balance can shift towards the acquisition memory naturally during spontaneous recovery, or through artificial activation of the acquisition memory or inhibition of the extinction memory (see Lacagnina et al. for example). Here the authors show that the same latent cause learned during extinction, z<sub>B</sub>, appears to dominate during the learning phase of reinstatement, with rapid learning to the context - the weight for the context goes up substantially on day 35 - in z<sub>B</sub>. This latent cause, z<sub>B</sub>, dominates at the reinstatement test, and due to the increased associative strength between the context and shock, there is a strong CR. For the simulation shown in figure 1, it's not clear why a latent cause model is necessary for this behavior. This leads to the next point.

      We would like to first clarify that our behavioral paradigm did not last for 36 days, as noted by the reviewer. Our reinstatement paradigm contained 7 phases and 36 trials in total: acquisition (3 trials), test 1 (1 trial), extinction 1 (19 trials), extinction 2 (10 trials), test 2 (1 trial), unsignaled shock (1 trial), test 3 (1 trial). The day is labeled under each phase in Figure 2A. 

      We have provided explanations on how the reinstatement is explained by the latent cause model in the first round of the review. Briefly, both acquisition and extinction latent causes contribute to the reinstatement (test 3). The former retains the acquisition fear memory, and the latter has the updated w<sub>context</sub> from unsignaled shock. Although the reviewer is correct that the z<sub>B</sub> in Figure 1D makes a great contribution during the reinstatement, we would like to argue that the elevated CR from test 2 (trial 34) to test 3 (trial 36) is the result of the interaction between z<sub>A</sub> and z<sub>B</sub>.

      We provided Author response image 1 using the same data in Figure 1D and 1E to further clarify this point. The posterior probability of z<sub>A</sub> increased after an unsignaled shock (trial 35), which may be attributed to the return of acquisition fear memory. The posterior probability of z<sub>A</sub> then decreased again after test 3 (trial 36) because there was no shock in this trial. Along with the weight change, the expected shock change substantially in these three trials, resulting in reinstatement. Note that the mapping of expected shock to CR in the latent cause model is controlled by parameter θ and λ. Once the expected shock exceeds the threshold θ, the CR will increase rapidly if λ is smaller.

      Lastly, accepting the idea that separate memories are responsible for acquisition and extinction in the memory modification paradigm, the latent cause model (LCM) is a rational candidate modeling this idea. Please see the following reply on why a simple model like the Rescorla-Wagner (RW) model is not sufficient to fully explain the behaviors observed in this study.

      Author response image 1.

      The sum posterior probability (A), the sum of associative weight of CS (B), and the sum of associative weight of context (C) of acquisition and extinction latent causes in Figure 1D and 1E.

      (2) The authors compared the latent cause model to the Rescorla-Wagner model. This is very commendable, particularly since the latent cause model builds upon the RW model, so it can serve as an ideal test for whether a more simplified model can adequately predict the behavior. The authors show that the RW model cannot successfully predict the increased CR during reinstatement (Appendix figure 1). Yet there are some issues with the way the authors have implemented this comparison:

      (2A) The RW model is a simplified version of the latent cause model and so should be treated as a nested model when testing, or at a minimum, the number of parameters should be taken into account when comparing the models using a method such as the Bayesian Information Criterion, BIC.

      We acknowledge that the number of parameters was not taken into consideration when we compared the models. We thank the reviewer for the suggestion to use the Bayesian Information Criterion (BIC). However, we did not use BIC in this study for the following reasons. We wanted a model that can explain fear conditioning, extinction and reinstatement, so our first priority is to fit the test phases. Models that simulate CRs well in non-test phases can yield lower BIC values even if they fail to capture reinstatement. When we calculate the BIC by using the half normal distribution (μ = 0, σ \= 0.3) as the likelihood for prediction error in each trial, the BIC of the 12-month-old control is -37.21 for the RW model (Appendix 1–figure 1C) and -11.60 for the LCM (Figure 3C). Based on this result, the RW model would be preferred, yet the LCM was penalized by the number of parameters, even though it fit better in trial 36. Because we did not think this aligned with our purpose to model reinstatement, we chose to rely on the practical criteria to determine whether the estimated parameter set is accepted or not for our purpose (see Materials and Methods). The number of accepted samples can thus roughly be seen as the model's ability to explain the data in this study. These exclusion criteria then created imbalances in accepted samples across models (Appendix 1–figure 2). In the RW model, only one or two samples met the criteria, preventing meaningful statistical comparisons of BIC within each group. Overall, though we agreed that BIC is one of the reasonable metrics in model comparison, we did not think it aligns with our purpose in this study.

      (2B) The RW model provides the associative strength between stimuli and does not necessarily require a linear relationship between V and the CR. This is the case in the original RW model as well as in the LCM. To allow for better comparison between the models, the authors should be modeling the CR in the same manner (using the same probit function) in both models. In fact, there are many instances in which a sigmoid has been applied to RW associative strengths to predict CRs. I would recommend modeling CRs in the RW as if there is just one latent cause. Or perhaps run the analysis for the LCM with just one latent cause - this would effectively reduce the LCM to RW and keep any other assumptions identical across the models.

      Regarding the suggestion to run the analysis using the LCM with one latent cause, we agree that this method is almost identical to the RW model, which is also mentioned in the original paper (Gershman et al., 2017). Importantly, it would also eliminate the RW model’s advantage of assigning distinct learning rates to different stimuli, highlighted in the next comment (2C).

      We thank the reviewer for suggesting applying the transformation of associative strength (V) to CR as in the LCM. We examined this possibility by heuristically selecting parameter values to test how such a transformation would influence the RW model (Author response image 2A). Specifically, we set α<sub>CS</sub> = 0.5, α<sub>context</sub> \= 1, β = 1, and introduced the additional parameters θ and λ, as in the LCM. This parameter set is determined heuristically to address the reviewer’s concern about a higher learning rate of context. The dark blue line is the plain associative strength. The remaining lines are CR curves under different combinations of θ and λ.

      Consistent with the reviewer’s comment, under certain parameter settings (θ \= 0.01, λ = 0.01), the extended RW model can reproduce higher CRs at test 3, thereby approximating the discrimination index observed in the 12-month-old control group. However, this modification changes the characteristics of CRs in other phases from those in the plain RW model. In the acquisition phase, the CRs rise more sharply. In the extinction phase, the CRs remain high when θ is small. Though changing λ can modulate the steepness, the CR curve is flat on the second day of the extinction phase, which does not reproduce the pattern in observed data (Figure 2B). These trade-offs suggest that the RW model with the sigmoid transformation does not improve fit quality and, in fact, sacrifices features that were well captured by simpler RW simulations (Appendix 1–figure 1A to 1D). To further evaluate this extended RW model (RW*), we applied the same parameter estimation method used in the LCM for individual data (see Materials and Methods). For each animal, α<sub>CS</sub>, α<sub>context</sub>, β, θ, and λ were estimated with their lower and upper bounds set as previously described (see Appendix 1, Materials and Methods). The results showed that the number of accepted samples slightly increased compared to the RW model without sigmoidal transformation of CR (RW* vs. RW in Author response image 2B, 2C). However, this improvement did not surpass the LCM (RW* vs. LCM in Author response image 2B, Author response image 1C). Overall, these results suggest that while using the same method to map the expected shock to CR, the RW model does not outperform the LCM. Practically, further extension, such as adding novel terms, might improve the fitting level. We would like to note that such extensions should be carefully validated if they are reasonable and necessary for an internal model, which is beyond the scope of this study. We hope this addresses the reviewer's concerns about the implementation of the RW model. 

      Author response image 2.

      Simulation (A) and parameter estimation (B and C) in the extended Rescorla-Wagner model.

      (2C) In the paper, the model fits for the alphas in the RW model are the same across the groups. Were the alphas for the two models kept as free variables? This is an important question as it gets back to the first point raised. Because the modeling of the reinstatement behavior with the LCM appears to be mainly driven by latent cause z<sub>B</sub>, the extinction memory, it may be possible to replicate the pattern of results without requiring a latent cause model. For example, the 12-month-old App NL-G-F mice behavior may have a deficit in learning about the context. Within the RW model, if the alpha for context is set to zero for those mice, but kept higher for the other groups, say alpha_context = 0.8, the authors could potentially observe the same pattern of discrimination indices in figure 2G and 2H at test. Because the authors don't explicitly state which parameters might be driving the change in the DI, the authors should show in some way that their results cannot simply be due to poor contextual learning in the 12 month old App NL-G-F mice, as this can presumably be predicted by the RW model. The authors' model fits using RW don't show this, but this is because they don't consider this possibility that the alpha for context might be disrupted in the 12-month-old App NL-G-F mice. Of course, using the RW model with these alphas won't lead to as nice of fits of the behavior across acquisition, extinction, and reinstatement as the authors' LCM, the number of parameters are substantially reduced in the RW model. Yet the important pattern of the DI would be replicated with the RW model (if I'm not mistaken), which is the important test for assessment of reinstatement.

      We would like to clarify that we estimated three parameters in the RW model for individuals:  α<sub>CS</sub>,  α<sub>context</sub>, and β. Even if we did so, many samples did not satisfy our criteria (Appendix 1–figure 2). Please refer to the “Evaluation of model fit” in Appendix 1 and the legend of Appendix 1–figure 1A to 1D, where we have written the estimated parameter values.

      We did not agree that paralyzing the contextual learning by setting  α<sub>context</sub>  as 0 in the RW model can explain the CR curve of 12-month-old AD mice well. Specifically, the RW model cannot capture the between-day extinction dynamics (i.e., the increase in CR at the beginning of day 2 extinction)  and the higher CR at test 3 relative to test 2 (i.e., DI between test 3 and test 2 is greater than 0.5). In addition, because the context input (= 0.2) was relatively lower than the CS input (= 1), and there is only a single unsignaled shock trial, even setting  α<sub>context</sub> = 1 results in only a limited increase in CR (Appendix 1–figure 1A to 1D; see also Author response image 2 9). Thus, the RW model cannot replicate the reinstatement effect or the critical pattern of discrimination index, even under conditions of stronger contextual learning.  

      (3) As stated by the authors in the introduction, the advantage of the fear learning approach is that the memory is modified across the acquisition-extinction-reinstatement phases. Although perhaps not explicitly stated by the authors, the post-reinstatement test (test 3) is the crucial test for whether there is reactivation of a previously stored memory, with the general argument being that the reinvigorated response to the CS can't simply be explained by relearning the CS-US pairing, because re-exposure the US alone leads to increase response to the CS at test. Of course there are several explanations for why this may occur, particularly when also considering the context as a stimulus. This is what I understood to be the justification for the use of a model, such as the latent cause model, that may better capture and compare these possibilities within a single framework. As such, it is critical to look at the level of responding to both the context alone and to the CS. It appears that the authors only look at the percent freezing during the CS, and it is not clear whether this is due to the contextual-US learning during the US re-exposure or to increased responding to the CS - presumably caused by reactivation of the acquisition memory. The authors do perform a comparison between the preCS and CS period, but it is not clear whether this is taken into account in the LCM. For example, the instance of the model shown in figure 1 indicates that the 'extinction cause', or cause z6, develops a strong weight for the context during the reinstatement phase of presenting the shock alone. This state then leads to increased freezing during the final CS probe test as shown in the figure. If they haven't already, I think the authors must somehow incorporate these different phases (CS vs ITI) into their model, particularly since this type of memory retrieval that depends on assessing latent states is specifically why the authors justified using the latent causal model. In more precise terms, it's not clear whether the authors incorporate a preCS/ITI period each day the cue is presented as a vector of just the context in addition to the CS period in which the vector contains both the context and the CS. Based on the description, it seemed to me that they only model the CRs during the CS period on days when the CS is presented, and thereby the context is only ever modeled on its own (as just the context by itself in the vector) on extinction days when the CS is not presented. If they are modeling both timepoints each day that the CS I presented, then I would recommend explicitly stating this in the methods section.

      In this study, we did not model the preCS freezing rate, and we thank the reviewer for the suggestion to model preCS periods as separate context-only trials. In our view, however, this approach is not consistent with the assumptions of the LCM. Our rationale is that the available periods of context and the CS are different. We assume that observation of the context lasts from preCS to CS. If we simulate both preCS (context) and CS (context and tone), the weight of context would be updated twice. Instead, we follow the same method as described in the original code from Gershman et al. (2017) to consider the context effect. We agree that explicitly modeling preCS could provide additional insights, but we believe it would require modifying or extending the LCM. We consider this an important direction for future research, but it is outside the scope of this study.

      (4) The authors fit the model using all data points across acquisition and learning. As one of the other reviewers has highlighted, it appears that there is a high chance for overfitting the data with the LCM. Of course, this would result in much better fits than models with substantially fewer free parameters, such as the RW model. As mentioned above, the authors should use a method that takes into account the number of parameters, such as the BIC.

      Please refer to the reply to public review (2A) for the reason we did not take the suggestion to use BIC. In addition, we feel that we have adequately addressed the concern of overfitting in the first round of the review. 

      (5) The authors have stated that they do not think the Barnes maze task can be modeled with the LCM. Whether or not this is the case, if the authors do not model this data with the LCM, the Barnes maze data doesn't appear valuable to the main hypothesis. The authors suggest that more sophisticated models such as the LCM may be beneficial for early detection of diseases such as Alzheimer's, so the Barnes maze data is not valuable for providing evidence of this hypothesis. Rather, the authors make an argument that the memory deficits in the Barnes maze mimic the reinstatement effects providing support that memory is disrupted similarly in these mice. Although, the authors state that the deficits in memory retrieval are similar across the two tasks, the authors are not explicit as to the precise deficits in memory retrieval in the reinstatement task - it's a combination of overgeneralizing latent causes during acquisition, poor learning rate, over differentiation of the stimuli.

      We would like to clarify that we valued the latent cause model not solely because it is more sophisticated and fits more data points, but it is an internal model that implicates the cognitive process. Please also see the reply to the recommendations to authors (3) about the reason why we did not take the suggestion to remove this data.

      Reviewer #3 (Public review):

      Summary:

      This paper seeks to identify underlying mechanisms contributing to memory deficits observed in Alzheimer's disease (AD) mouse models. By understanding these mechanisms, they hope to uncover insights into subtle cognitive changes early in AD to inform interventions for early-stage decline.

      Strengths:

      The paper provides a comprehensive exploration of memory deficits in an AD mouse model, covering early and late stages of the disease. The experimental design was robust, confirming age-dependent increases in Aβ plaque accumulation in the AD model mice and using multiple behavior tasks that collectively highlighted difficulties in maintaining multiple competing memory cues, with deficits most pronounced in older mice.

      In the fear acquisition, extinction, and reinstatement task, AD model mice exhibited a significantly higher fear response after acquisition compared to controls, as well as a greater drop in fear response during reinstatement. These findings suggest that AD mice struggle to retain the fear memory associated with the conditioned stimulus, with the group differences being more pronounced in the older mice.

      In the reversal Barnes maze task, the AD model mice displayed a tendency to explore the maze perimeter rather than the two potential target holes, indicating a failure to integrate multiple memory cues into their strategy. This contrasted with the control mice, which used the more confirmatory strategy of focusing on the two target holes. Despite this, the AD mice were quicker to reach the target hole, suggesting that their impairments were specific to memory retrieval rather than basic task performance.

      The authors strengthened their findings by analyzing their data with a leading computational model, which describes how animals balance competing memories. They found that AD mice showed somewhat of a contradiction: a tendency to both treat trials as more alike than they are (lower α) and similar stimuli as more distinct than they are (lower σx) compared to controls.

      Weaknesses:

      While conceptually solid, the model struggles to fit the data and to support the key hypothesis about AD mice's inability to retain competing memories. These issues are evident in Figure 3:

      (1) The model misses trends in the data, including the gradual learning of fear in all groups during acquisition, the absence of a fear response at the start of the experiment, and the faster return of fear during reinstatement compared to the gradual learning of fear during acquisition. It also underestimates the increase in fear at the start of day 2 of extinction, particularly in controls.

      (2) The model explains the higher fear response in controls during reinstatement largely through a stronger association to the context formed during the unsignaled shock phase, rather than to any memory of the conditioned stimulus from acquisition (as seen in Figure 3C). In the experiment, however, this memory does seem to be important for explaining the higher fear response in controls during reinstatement (as seen in Author Response Figure 3). The model does show a necessary condition for memory retrieval, which is that controls rely more on the latent causes from acquisition. But this alone is not sufficient, since the associations within that cause may have been overwritten during extinction. The Rescorla-Wagner model illustrates this point: it too uses the latent cause from acquisition (as it only ever uses a single cause across phases) but does not retain the original stimulus-shock memory, updating and overwriting it continuously. Similarly, the latent cause model may reuse a cause from acquisition without preserving its original stimulus-shock association.

      These issues lead to potential overinterpretation of the model parameters. The differences in α and σx are being used to make claims about cognitive processes (e.g., overgeneralization vs. over differentiation), but the model itself does not appear to capture these processes accurately.

      The authors could benefit from a model that better matches the data and captures the retention and retrieval of fear memories across phases. While they explored alternatives, including the Rescorla-Wagner model and a latent state model, these showed no meaningful improvement in fit. This highlights a broader issue: these models are well-motivated but may not fully capture observed behavior.

      Conclusion:

      Overall, the data support the authors' hypothesis that AD model mice struggle to retain competing memories, with the effect becoming more pronounced with age. While I believe the right computational model could highlight these differences, the current models fall short in doing so.

      We thank the reviewer for the insightful comments. For the comments (1) and (2), please refer to our previous author response to comments #26 and #27. We recognize that the models tested in this study have limitations and, as noted, do not fully capture all aspects of the observed behavioral data. We see this as an important direction for future research and value the reviewer’s suggestions.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      I have maintained some of the main concerns included in the first round of reviews as I think they remain concerns with the new draft, even though the authors have included substantially more analysis of their data, which is appreciated. I particularly found the inclusion of the comparative modeling valuable, although I think the analysis comparing the models should be improved.

      (1) This relates to point 1 in the public assessment or #16 in the response to reviewers from the authors. The authors raise the point that even a low posterior can drive behavioral expression (lines 361-365 in the response to authors), and so the acquisition latent cause may partially drive reinstatement. Yet in the stimulation shown in figure 1D, this does not seem to be the case. As I mentioned in the public response, in figure 1, the posteriors for z<sub>A</sub> are similar on day 34 and day 36, yet only on day 36 is there a strong CR. At least in this example, it does not appear that z<sub>A</sub> contributes to the increased responding from day 34 (test 2) to day 36 (test 3). There may be a slight increase in z1 in figure 3C, but the dominant change from day 34 to day 36 appears to be the increase in the posterior of z3 and the substantial increase in w3. The authors then cite several papers which have shown the shift in balance between what it is the putative acquisition memory and extinction memory (i.e. Lacagnina et al.). Yet I do not see how this modeling fits with most of the previous findings. For example, in the Lacagnina et al. paper, activation of the acquisition ensemble or inhibition of the extinction ensemble drives freezing, whereas the opposite pattern reduces freezing. What appears to be the pattern in the modeling in this paper is primarily learning of context in the extinction latent cause to predict the shock. As I mention in point 2C of the public review, it's not clear why this pattern of results would require a latent cause model. Would a high alpha for context and not the CS not give a similar pattern of results in the RW model? At least for giving similar results of the DIs in figure 2?

      First, we would like to clarify that the x-axis in Figure 1D is labeled “Trial,” not “Day.” Please refer to the reply to public review (1), where we clarified the posterior probability of the latent cause from trials 34 to 36. Second, although we did not have direct neural circuit evidence in this study, we discussed the similarities between previous findings and the modeling in the first review. Briefly, our main point focuses on the interaction between acquisition and extinction memory. In other words, responses at different times arise from distinct internal states made up of competing memories. We assume that the reviewer expects a modeling result showing nearly full recovery of acquisition memory, which aligns with previous findings where optogenetic activation of the acquisition engram can partially mimic reinstatement (Zaki et al., 2022; see also the response to comment #12 in the first round of review). We acknowledge that such a modeling result cannot be achieved with the latent cause model and see it as a potential future direction for model improvement.

      Please also refer to the reply to public review (2) about how a high alpha for context in the RW model cannot explain the pattern we observed in the reinstatement paradigm.

      (2) This is related to point 3 in the public comments and #13 in the response to reviewers. I raised the question of comparing the preCS/ITI period with the CS period, but my main point was why not include these periods in the LCM itself as mentioned in more detail in point 3 in the current public review. The inclusion of the comparisons the authors performed helped, but my main point was that the authors could have a better measure of wcontext if they included the preCS period as a stimulus each day (when only the context is included in the stimulus). This would provide better estimates of wcontext. As stated in the public review, perhaps the authors did this, but my understanding of the methods this was not the case, rather, it seems the authors only included the CS period for CRs within the model (at least on days when the CS was present).

      Please refer to the reply to public review (3) about the reason why we did not model the preCS freezing rate.

      (3) This relates to point 4 in the public review and #15 and #24 in the response to authors. The authors have several points for why the two experiments are similar and how results may be extrapolated - lines 725-733. The first point is that associative learning is fundamental in spatial learning. I'm not sure that this broad connection between the two studies is particularly insightful for why one supports the other as associative learning is putatively involved in most behavioral tasks. In the second point about reversals, why not then use a reversal paradigm that would be easier to model with LCM? This data is certainly valuable and interesting, yet I don't think it's helpful for this paper to state qualitatively the similarities in the potential ways a latent cause framework might predict behavior on the Barnes maze. I would recommend that the authors either model the behavior with LCM, remove the experiment from the paper, or change the framing of the paper that LCM might be an ideal approach for early detection of dementia or Alzheimer's disease.

      We would like to clarify that our aim was not to present the LCM as an ideal tool for early detection of AD symptoms. Rather, our focus is on the broader idea of utilizing internal models and estimating individual internal states in early-stage AD. Regarding using a reversal paradigm that would be easier to model with LCM, the most straightforward approach is to use another type of paradigm for fear conditioning, then to examine the extent to which similar behavioral characteristics are observed between paradigms within subjects. However, re-exposing the same mice to such paradigms is constrained by strong carry-over effects, limiting the feasibility of this experiment. Other behavioral tasks relevant to AD that avoid shock generally involve action selection for subsequent observation (Webster et al., 2014), which falls outside the structure of LCM. Our rationale for including the Barnes maze task is that spatial memory deficit is implicated in the early stage of AD, making it relevant for translational research. While we acknowledge that exact modeling of Barnes maze behavior would require a more sophisticated model (as discussed in the first round of review), our intention to use the reversal Barnes maze paradigm is to suggest a presumable memory modification learning in a non-fear conditioning paradigm. We also discussed whether similar deficits in memory modification could be observed across two behavioral tasks.

      (4) Reviewer # mentioned that the change in pattern of behavior only shows up in the older mice questioning the clinical relevance of early detection. I do think this is a valid point and maybe should be addressed. There does seem to be a bit of a bump in the controls on day 23 that doesn't appear in the 6-month group. Perhaps this was initially a spontaneous recovery test indicated by the dotted vertical line? This vertical line does not appear to be defined in the figure 1 legend, nor in figures 2 and 3.

      We would like to emphasize that the App<sup>NL-G-F</sup> knock-in mouse is widely considered a model of early-stage AD, characterized by Aβ accumulation with little to no neurofibrillary tangle pathology or neuronal loss (see Introduction). By examining different ages, we can assess the contribution of both the amount and duration of Aβ accumulation as well as age-related factors. Modeling the deficit in the memory modification process in the older App<sup>NL-G-F</sup> knock-in mice, we suggested a diverged internal state in early-stage AD in older age, and this does not diminish the relevance of the model for studying early cognitive changes in AD.

      We would also like to clarify again that the x-axis in the figure is “Trial,” not “Day.” The vertical dashed lines in these figures indicate phase boundaries, and they were defined in the figure legend: in Figure 1C, “The vertical dashed lines separate the phases.”; in Figure 2B, “The dashed vertical line separates the extinction 1 and extinction 2 phases.”; in Figure 3, “The vertical dashed lines indicate the boundaries of phases.”

      (5) Are the examples in figure 3 good examples? The example for the 12-month-old control shows a substantial increase in weights for the context during test 3, but not for the CS. Yet in the bar plots in Figure 4 G and H, this pattern seems to be different. The weights for the context appear to substantially drop in the "after extinction" period as compared to the "extinction" period. It's hard to tell the change from "extinction" to "after extinction" for the CS weights (the authors change the y-axis for the CS weights but not for the context weights from panels G to H).

      We would like to clarify that in Figure 3C, the increase in weights for context is not presented during test 3 (trial 36), noted by the reviewer; rather, it is the unsignaled shock phase (trial 35).

      We assumed that the reviewer might misunderstand that the labels on the left in Figure 4, “Acquisition”, “Extinction”, and “After extinction”, indicate the time point. However, the data shown in Figure 4C to 4H are all from the same time point: test 3 (trial 36). The grouping reflects the classification of latent causes based on the trial in which they were inferred. In addition, for Figures 4G and 4H, the y‐axis limits were not set identically because the data range for “Sum of w<sub>CS</sub>” varied. This was done to ensure the visibility of all data points. In Figure 4, each dot represents one animal. Take Figure 3D as an example. The point in Figure 4G is the sum of w3 and w4 in trial 36, and the point in Figure 4H is w5 in trial 36, note that the subscript numerals indicate latent cause index. We hope this addresses the reviewer’s question about the difference between the two figures.


      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      The authors show certain memory deficits in a mouse knock-in model of Alzheimer's Disease (AD). They show that the observed memory deficits can be explained by a computational model, the latent cause model of associative memory. The memory tasks used include the fear memory task (CFC) and the 'reverse' Barnes maze. Research on AD is important given its known huge societal burden. Likewise, better characterization of the behavioral phenotypes of genetic mouse models of AD is also imperative to advance our understanding of the disease using these models. In this light, I applaud the authors' efforts.

      Strengths:

      (1) Combining computational modelling with animal behavior in genetic knock-in mouse lines is a promising approach, which will be beneficial to the field and potentially explain any discrepancies in results across studies as well as provide new predictions for future work.

      (2) The authors' usage of multiple tasks and multiple ages is also important to ensure generalization across memory tasks and 'modelling' of the progression of the disease.

      Weaknesses:

      [#1] (1) I have some concerns regarding the interpretation of the behavioral results. Since the computational model then rests on the authors' interpretation of the behavioral results, it, in turn, makes judging the model's explanatory power difficult as well. For the CFC data, why do knock-in mice have stronger memory in test 1 (Figure 2C)? Does this mean the knock-in mice have better memory at this time point? Is this explained by the latent cause model? Are there some compensatory changes in these mice leading to better memory? The authors use a discrimination index across tests to infer a deficit in re-instatement, but this indicates a relative deficit in re-instatement from memory strength in test 1. The interpretation of these differential DIs is not straightforward. This is evident when test 1 is compared with test 2, i.e., the time point after extinction, which also shows a significant difference across groups, Figure 2F, in the same direction as the re-instatement. A clarification of all these points will help strengthen the authors' case.

      We appreciate the reviewer for the critical comments. According to the latent cause framework, the strength of the memory is influenced by at least 2 parameters: associative weight between CS and US given a latent cause and posterior probability of the latent cause. The modeling results showed that a higher posterior probability of acquisition latent cause, but not higher associative weight, drove the higher test 1 CR in App<sup>NL-G-F</sup> mice (Results and Discussion; Figure 4 – figure supplement 3B, 3C). In terms of posterior, we agree that App<sup>NL-G-F</sup> mice have strong fear memory. On the other hand, this suggests that App<sup>NL-G-F</sup> mice exhibited a tendency toward overgeneralization, favoring modification of old memories, which adversely affected the ability to retain competing memories. The strong memory in test 1 would be a compensatory effect of overgeneralization.    

      To estimate the magnitude of reinstatement, at least, one would have to compare CRs between test 2 (extinction) and test 3 (reinstatement), as well as those between test 1 (acquisition) and test 3. These comparisons represent the extent to which the memory at the reinstatement is far from that in the extinction, and close to that in the acquisition. Since discrimination index (DI) has been widely used as a normalized measure to evaluate the extent to which the system can distinguish between two conditions, we applied DI consistently to behavioral and simulated data in the reinstatement experiment, and the behavioral data in the reversal Barnes maze experiment, allowing us to evaluate the discriminability of an agent in these experiments. In addition, we used DI to examine its correlation with estimated parameters, enabling us to explore how individual discriminability may relate to the internal state. We have already discussed the differences in DI between test 3 and test 1, as well as CR in test 1 between control and App<sup>NL-G-F</sup> in the manuscript and further elaborated on this point in Line 232, 745-748.   

      [#2] (2) I have some concerns regarding the interpretation of the Barnes maze data as well, where there already seems to be a deficit in the memory at probe test 1 (Figure 6C). Given that there is already a deficit in memory, would not a more parsimonious explanation of the data be that general memory function in this task is impacted in these mice, rather than the authors' preferred interpretation? How does this memory weakening fit with the CFC data showing stronger memories at test 1? While I applaud the authors for using multiple memory tasks, I am left wondering if the authors tried fitting the latent cause model to the Barnes maze data as well.

      While we agree that the deficits shown in probe test 1 may imply impaired memory function in App<sup>NL-G-F</sup> mice in this task, it would be difficult to explain this solely in terms of impairments in general memory function. The learning curve and the daily strategy changes suggested that App<sup>NL-G-F</sup> mice would have virtually intact learning ability in the initial training phase (Figure 6B, 6F, Figure 6 – figure supplement 1 and 3). For the correspondence relationship between the reinstatement and the reversal Barnes maze learning from the aspect of memory modification process, please also see our reply to comment #24. We have explained why we did not fit the latent cause model to the Barnes maze data in the provisional response.

      [#3] (3) Since the authors use the behavioral data for each animal to fit the model, it is important to validate that the fits for the control vs. experimental groups are similar to the model (i.e., no significant differences in residuals). If that is the case, one can compare the differences in model results across groups (Figures 4 and 5). Some further estimates of the performance of the model across groups would help.

      We have added the residual (i.e., observed CR minus simulated CR) in Figure 3 – figure supplement 1D and 1E. The fit was similar between control and App<sup>NL-G-F</sup> mice groups in the test trials, except test 3 in the 12-month-old group. The residual was significantly higher in the 12-month-old control mice than App<sup>NL-G-F</sup> mice, suggesting the model underestimated the reinstatement in the control, yet the DI calculated from the simulated CR replicates the behavioral data (Figure 3 – figure supplement 1A to 1C). These results suggest that the latent cause model fits our data with little systematic bias such as an overestimation of CR for the control group in the reinstatement, supporting the validity of the comparisons in estimated parameters between groups. These results and discussion have been added in the manuscript Line 269-276.

      One may notice that the latent cause model overestimated the CR in acquisition trials in all groups in Figure 3 – figure supplement 1D and 1E. We have discussed this point in the reply to comment #26, 34 questioned by reviewer 3.

      [#4] (4) Is there an alternative model the authors considered, which was outweighed in terms of prediction by this model? 

      Yes, we have further evaluated two alternative models: the Rescorla-Wagner (RW; Rescorla & Wagner, 1972) model and the latent state model (LSM; Cochran & Cisler, 2019). The RW model serves as a baseline, given its known limitations in explaining fear return after extinction. The LSM is another contemporary model that shares several concepts with the latent cause model (LCM) such as building upon the RW model, assuming a latent variable inferred by Bayes’ rule, and involving a ruminative update for memory modification. We evaluated the three models in terms of the prediction accuracy and reproducibility of key behavioral features. Please refer to the Appendix 1 for detailed methods and results for these two models.

      As expected, the RW model fit well to the data till the end of extinction but failed to reproduce reinstatement (Appendix 1 – figure 1A to 1D). Due to a large prediction error in test 3, few samples met the acceptance criteria we set (Appendix 1 – figure 2 and 3A). Conversely, the LSM reproduced reinstatement, as well as gradual learning in acquisition and extinction phases, particularly in the 12month-old control (Appendix 1 – figure 1G). The number of accepted samples in the LSM was higher than in the RW model but generally lower than in the LCM (Appendix 1 – figure 2). The sum of prediction errors over all trials in the LSM was comparable to that in the LCM in the 6-month-old group (Appendix 1 – figure 4A), it was significantly lower in the 12-month-old group (Appendix 1 – figure 4B). Especially the LSM generated smaller prediction errors during the acquisition trials than in the LCM, suggesting that the LSM might be better at explaining the behaviors of acquisition (Appendix 1 – figure 4A and 4B; but see the reply for comment #34). While the LSM generated smaller prediction errors than the LCM in test 2 of the control group, it failed to replicate the observed DIs, a critical behavioral phenotype difference between control and App<sup>NL-G-F</sup> mice (Appendix 1 – figure 6A to 6C; cf. Figure 2F to 2H, Figure 3 – figure supplement 1A to 1C).

      Thus, although each model could capture different aspects of reinstatement, standing on the LCM to explain the reinstatement better aligns with our purpose. It should also be noted that we did not explore all parameter spaces of the LSM, hence we cannot rule out the possibility that alternative parameter sets could provide a better fit and explain the memory modification process well. A more comprehensive parameter search in the LSM may be a valuable direction for future research. 

      [#5] One concern here is also parameter overfitting. Did the authors try leaving out some data (trials/mice) and predicting their responses based on the fit derived from the training data?

      Following the reviewer’s suggestion, we confirmed if overfitting occurred using all trials to estimate parameters. Estimating parameters while actually leaving out trials would disorder the time lapse across trials, and thereby the prior of latent causes in each trial. Instead, we removed the constraint of prediction error by setting the error threshold to 1 for certain trials to virtually leave these trials out. We treated these trials as a virtual “training” dataset, while the rest of the trials were a “test” dataset. For the median CR data of each group (Figure 3), we estimated parameters under 6 conditions with unique training and test trials, then evaluated the prediction error for the training and test trials. Note that training and test trials were arbitrarily decided. Also, the error threshold for the acquisition trial was set to 1 as described in Materials and Methods, which we have further discussed the reason in the reply to comment #34 and treated acquisition trials separately from the test trials. We expect that the contribution of the data from the acquisition and test trials for parameter estimation could be discounted compared to those from the training trials with the constraint, and if overfitting occurred, the prediction error in the test data would be worse than that in the training trials.

      Author response image 1A to 1F showed the simulated and observed CR under each condition, where acquisition trials were in light-shaded areas, test trials were in dark-shaded areas, and the rest of the trials were training trials. Author response image 1G showed mean squared prediction error across the acquisition, training and test trials under each condition. The dashed gray line showed the mean squared prediction error of training trials in Figure 3 as a baseline.

      In conditions i and ii, where two or four trials in the extinction were used for training (Author response image 1A and 1B), the prediction error was generally higher in test trials than in training trials. In conditions iii and iv where ten trials in the extinction were used for training (Author response image 1C and 1D), the difference in prediction error between testing and training trials became smaller. These results suggest that providing more extinction trial data would reduce overfitting. In condition v (Author response image 1E), the results showed that using trials until extinction can predict reinstatement in control mice but not App<sup>NL-G-F</sup> mice. Similarly, in condition vi (Author response image 1F), where test phase trials were left out, the prediction error differences were greater in App<sup>NL-G-F</sup> mice. These results suggest that the test trials should be used for the parameter estimation to minimize prediction error for all groups. Overall, this analysis suggests that using all trials would reduce prediction error with few overfitting. 

      Author response image 1.

      Leaving trials out in parameter estimation in the latent cause model. (A – F) The observed CR (colored line) is the median freezing rate during the CS presentation over the mice within each group, which is the same as that in Figure 3. The colors indicate different groups: orange represents 6-month-old control, light blue represents 6-month-old App<sup>NL-G-F</sup> mice, pink represents 12-month-old control, and dark blue represents 12-month-old App<sup>NL-G-F</sup> mice. Under six different leave-out conditions (i – vi), parameters were estimated and used for generating simulated CR (gray line). In each condition, trials were categorized as acquisition (light-shaded area), training data (white area), and test data (dark-shaded area) based on the error threshold during parameter estimation. Only the error threshold of the test data trial was different from the original method (see Material and Method) and set to 1. In conditions i to vi, the number of test data trials is 27, 25, 19, and 19 in extinction phases. In condition v, the number of test data trials is 2 (trials 35 and 36). In condition vi, test data trials were the 3 test phases (trials 4, 34, and 36). (G) Each subplot shows the mean squared prediction error for the test data trial (gray circles), training data trial (white squares), and acquisition trial (gray triangles) in each group. The left y-axis corresponds to data from test and training trials, and the right y-axis corresponds to data from acquisition trials. The dashed line indicates the results calculated from Figure 3 as a baseline.  

      Reviewer #1 (Recommendations for the authors):

      Minor:

      [#6] (1) I would like the authors to further clarify why 'explaining' the reinstatement deficit in the AD mouse model is important in working towards the understanding of AD i.e., which aspect of AD this could explain etc.

      In this study, we utilized the reinstatement paradigm with the latent cause model as an internal model to illustrate how estimating internal states can improve understanding of cognitive alteration associated with extensive Aβ accumulation in the brain. Our findings suggest that misclassification in the memory modification process, manifesting as overgeneralization and overdifferentiation, underlies the memory deficit in the App<sup>NL-G-F</sup> knock-in model mice. 

      The parameters in the internal model associated with AD pathology (e.g., α and σ<sub>x</sub><sup>2</sup> in this study) can be viewed as computational phenotypes, filling the explanatory gap between neurobiological abnormalities and cognitive dysfunction in AD. This would advance the understanding of cognitive symptoms in the early stages of AD beyond conventional behavioral endpoints alone.

      We further propose that altered internal states in App<sup>NL-G-F</sup> knock-in mice may underlie a wide range of memory-related symptoms in AD as we observed that App<sup>NL-G-F</sup> knock-in mice failed to retain competing memories in the reversal Barnes maze task. We speculate on how overgeneralization and overdifferentiation may explain some AD symptoms in the manuscript:

      - Line 565-569: overgeneralization may explain deficits in discriminating highly similar visual stimuli reported in early-stage AD patients as they misclassify the lure as previously learned object

      - Line 576-579: overdifferentiation may explain impaired ability to transfer previously learned association rules in early-stage AD patients as they misclassify them as separated knowledge. 

      - Line 579-582: overdifferentiation may explain delusions in AD patients as an extended latent cause model could simulate the emergence of delusional thinking

      We provide one more example here that overgeneralization may explain that early-stage AD patients are more susceptible to proactive interference than cognitively normal elders in semantic memory tests (Curiel Cid et al., 2024; Loewenstein et al., 2015, 2016; Valles-Salgado et al., 2024), as they are more likely to infer previously learned material. Lastly, we expect that explaining memory-related symptoms within a unified framework may facilitate future hypothesis generation and contribute to the development of strategies for detecting the earliest cognitive alteration in AD.  

      [#7] (2) The authors state in the abstract/introduction that such computational modelling could be most beneficial for the early detection of memory disorders. The deficits observed here are pronounced in the older animals. It will help to further clarify if these older animals model the early stages of the disease. Do the authors expect severe deficits in this mouse model at even later time points?

      The early stage of the disease is marked by abnormal biomarkers associated with Aβ accumulation and neuroinflammation, while cognitive symptoms are mild or absent. This stage can persist for several years during which the level of Aβ may reach a plateau. As the disease progresses, tau pathology and neurodegeneration emerge and drive the transition into the late stage and the onset of dementia. The App<sup>NL-G-F</sup> knock-in mice recapitulate the features present in the early stage (Saito et al., 2014), where extensive Aꞵ accumulation and neuroinflammation worsen along with ages (Figure 2 – figure supplement 1). Since App<sup>NL-G-F</sup> knock-in mice are central to Aβ pathology without tauopathy and neurodegeneration, it should be noted that it does not represent the full spectrum of the disease even at advanced ages. Therefore, older animals still model the early stages of the diseases and are suitable to study the long-term effect of Aβ accumulation and neuroinflammation. 

      The age tested in previous reports using App<sup>NL-G-F</sup> mice spanned a wide range from 2 months old to 24 months old. Different behavioral tasks have varied sensitivity but overall suggest the dysfunction worsens with aging (Bellio et al., 2024; Mehla et al., 2019; Sakakibara et al., 2018). We have tested the reinstatement experiment with 17-month-old App<sup>NL-G-F</sup> mice before (Author response image 2). They showed more advanced deficits with the same trends observed in 12-month-old App<sup>NL-G-F</sup> mice, but their freezing rates were overall at a lower level. There is a concern that possible hearing loss may affect the results and interpretation, therefore we decided to focus on 12-month-old data.

      Author response image 2.

      Freezing rate across reinstatement paradigm in the 17-month-old App<sup>NL-G-F</sup> mice. Dashed and solid lines indicate the median freezing rate over 34 mice before (preCS) and during (CS) tone presentation, respectively. Red, blue, and yellow backgrounds represent acquisition, extinction, and unsignaled shock in Figure 2A. The dashed vertical line separates the extinction 1 and extinction 2 phases.

      [#8] (3) There are quite a few 'marginal' p-values in the paper at p>0.05 but near it. Should we accept them all as statistically significant? The authors need to clarify if all the experimental groups are sufficiently powered.

      For our study, we decided a priori that p < 0.05 would be considered statistically significant, as described in the Materials and Methods. Therefore, in our Results, we did not consider these marginal values as statistically significant but reported the trend, as they may indicate substantive significance.

      We described our power analysis method in the manuscript Line 897-898 and have provided the results in Tables S21 and S22.

      [#9] (4) The authors emphasize here that such computational modelling enables us to study the underlying 'reasoning' of the patient (in the abstract and introduction), I do not see how this is the case. The model states that there is a latent i.e. another underlying variable that was not previously considered.

      Our use of the term “reasoning” was to distinguish the internal model, which describes how an agent makes sense of the world, from other generative models implemented for biomarker and disease progression prediction. However, we agree that using “reasoning” may be misleading and imprecise, so to reduce ambiguity we have removed this word in our manuscript Line 27: Nonetheless, internal models of the patient remain underexplored in AD; Line 85: However, previous approaches did not suppose an internal model of the world to predict future from current observation given prior knowledge.   

      [#10] (5) The authors combine knock-in mice with controls to compute correlations of parameters of the model with behavior of animals (e.g. Figure 4B and Figure 5B). They run the risk of spurious correlations due to differences across groups, which they have indeed shown to exist (Figure 4A and 5A). It would help to show within-group correlations between DI and parameter fit, at least for the control group (which has a large spread of data).

      We agree that genotype (control, App<sup>NL-G-F</sup>) could be a confounder between the estimated parameters and DI, thereby generating spurious correlations. To address this concern, we have provided withingroup correlation in Figure 4 – figure supplement 2 for the 12-month-old group and Figure 5 – figure supplement 2 for the 6-month-old group.

      In the 12-month-old group, the significant positive correlation between σx2 and DI remained in both control and App<sup>NL-G-F</sup> mice even if we adjusted the genotype effect, suggesting that it is very unlikely that the correlations in Figure 4B are due to the genotype-related confounding. On the other hand, the positive correlation between α and DI was found to be significant in the control mice but not in the App<sup>NL-G-F</sup> mice. Most of α were distributed around the lower bound in App<sup>NL-G-F</sup> mice, which possibly reduced the variance and correlation coefficient. These results support our original conclusion that α and σ<sub>x</sub><sup>2</sup> are parameters associated with a lower magnitude of reinstatement in aged App<sup>NL-G-F</sup> mice.

      In the 6-month-old group, the correlations shown in Figure 5B were not preserved within subgroups, suggesting genotype would be a confounder for α, σ<sub>x</sub><sup>2</sup>, and DI. We recognized that significant correlations in Figure 5B may arise from group differences, increased sample size, or greater variance after combining control and App<sup>NL-G-F</sup> mice. 

      Therefore, we concluded that α and σ<sub>x</sub><sup>2</sup> are associated with the magnitude of reinstatement but modulated by the genotype effect depending on the age. 

      We have added interpretations of within-group correlation in the manuscript Line 307-308, 375-378.

      [#11] (6) It is unclear to me why overgeneralization of internal states will lead to the animals having trouble recalling a memory. Would this not lead to overgeneralization of memory recall instead?

      We assume that the reviewer is referring to “overgeneralization of internal states,” a case in which the animal’s internal state remained the same regardless of the observation, thereby leading to “overgeneralization of memory recall.” We agree that this could be one possible situation and appears less problematic than the case in which this memory is no longer retrievable. 

      However, in our manuscript, we did not deal with the case of “overgeneralization of internal states”. Rather, our findings illustrated how the memory modification process falls into overgeneralization or overdifferentiation and how it adversely affects the retention of competing memories, thereby causing App<sup>NL-G-F</sup> mice to have trouble recalling the same memory as the control mice. 

      According to the latent cause model, retrieval failure is explained by a mismatch of internal states, namely when an agent perceives that the current cue does not match a previously experienced one, the old latent cause is less likely to be inferred due to its low likelihood (Gershman et al., 2017). For example, if a mouse exhibited higher CR in test 2, it would be interpreted as a successful fear memory retrieval due to overgeneralization of the fear memory. However, it reflects a failure of extinction memory retrieval due to the mismatch between the internal states at extinction and test 2. This is an example that overgeneralization of memory induces the failure of memory retrieval. 

      On the other hand, App<sup>NL-G-F</sup> mice exhibited higher CR in test 1, which is conventionally interpreted as a successful fear memory retrieval. When estimating their internal states, they would infer that their observation in test 1 well matches those under the acquisition latent causes, that is the overgeneralization of fear memory as shown by a higher posterior probability in acquisition latent causes in test 1 (Figure 4 – figure supplement 3). This is an example that over-generalization of memory does not always induce retrieval failure as we explained in the reply to comment #1. 

      Reviewer #2 (Public review):

      Summary:

      This manuscript proposes that the use of a latent cause model for the assessment of memory-based tasks may provide improved early detection of Alzheimer's Disease as well as more differentiated mapping of behavior to underlying causes. To test the validity of this model, the authors use a previously described knock-in mouse model of AD and subject the mice to several behaviors to determine whether the latent cause model may provide informative predictions regarding changes in the observed behaviors. They include a well-established fear learning paradigm in which distinct memories are believed to compete for control of behavior. More specifically, it's been observed that animals undergoing fear learning and subsequent fear extinction develop two separate memories for the acquisition phase and the extinction phase, such that the extinction does not simply 'erase' the previously acquired memory. Many models of learning require the addition of a separate context or state to be added during the extinction phase and are typically modeled by assuming the existence of a new state at the time of extinction. The Niv research group, Gershman et al. 2017, have shown that the use of a latent cause model applied to this behavior can elegantly predict the formation of latent states based on a Bayesian approach, and that these latent states can facilitate the persistence of the acquisition and extinction memory independently. The authors of this manuscript leverage this approach to test whether deficits in the production of the internal states, or the inference and learning of those states, may be disrupted in knock-in mice that show both a build-up of amyloid-beta plaques and a deterioration in memory as the mice age.

      Strengths:

      I think the authors' proposal to leverage the latent cause model and test whether it can lead to improved assessments in an animal model of AD is a promising approach for bridging the gap between clinical and basic research. The authors use a promising mouse model and apply this to a paradigm in which the behavior and neurobiology are relatively well understood - an ideal situation for assessing how a disease state may impact both the neurobiology and behavior. The latent cause model has the potential to better connect observed behavior to underlying causes and may pave a road for improved mapping of changes in behavior to neurobiological mechanisms in diseases such as AD.

      Weaknesses:

      I have several substantial concerns which I've detailed below. These include important details on how the behavior was analyzed, how the model was used to assess the behavior, and the interpretations that have been made based on the model.

      [#12] (1) There is substantial data to suggest that during fear learning in mice separate memories develop for the acquisition and extinction phases, with the acquisition memory becoming more strongly retrieved during spontaneous recovery and reinstatement. The Gershman paper, cited by the authors, shows how the latent causal model can predict this shift in latent states by allowing for the priors to decay over time, thereby increasing the posterior of the acquisition memory at the time of spontaneous recovery. In this manuscript, the authors suggest a similar mechanism of action for reinstatement, yet the model does not appear to return to the acquisition memory state after reinstatement, at least based on the examples shown in Figures 1 and 3. Rather, the model appears to mainly modify the weights in the most recent state, putatively the 'extinction state', during reinstatement. Of course, the authors must rely on how the model fits the data, but this seems problematic based on prior research indicating that reinstatement is most likely due to the reactivation of the acquisition memory. This may call into question whether the model is successfully modeling the underlying processes or states that lead to behavior and whether this is a valid approach for AD.

      We thank the reviewer for insightful comments. 

      We agree that, as demonstrated in Gershman et al. (2017), the latent cause model accounts for spontaneous recovery via the inference of new latent causes during extinction and the temporal compression property provided by the prior. Moreover, it was also demonstrated that even a relatively low posterior can drive behavioral expression if the weight in the acquisition latent cause is preserved. For example, when the interval between retrieval and extinction was long enough that acquisition latent cause was not dominant during extinction, spontaneous recovery was observed despite the posterior probability of acquisition latent cause (C1) remaining below 0.1 in Figure 11D of Gershman et al. (2017). 

      In our study, a high response in test 3 (reinstatement) is explained by both acquisition and extinction latent cause. The former preserves the associative weight of the initial fear memory, while the latter has w<sub>context</sub> learned in the unsignaled shock phase. These positive w were weighted by their posterior probability and together contributed to increased expected shock in test 3. Though the posterior probability of acquisition latent cause was lower than extinction latent cause in test 3 due to time passage, this would be a parallel instance mentioned above. To clarify their contributions to reinstatement, we have conducted additional simulations and the discussion in reply to the reviewer’s next comment (see the reply to comment #13).

      We recognize that our results might appear to deviate from the notion that reinstatement results from the strong reactivation of acquisition memory, where one would expect a high posterior probability of the acquisition latent cause. However, we would like to emphasize that the return of fear emerges from the interplay of competing memories. Previous studies have shown that contextual or cued fear reinstatement involves a neural activity switch back to fear state in the medial prefrontal cortex (mPFC), including the prelimbic cortex and infralimbic cortex, and the amygdala, including ventral intercalated amygdala neurons (ITCv), medial subdivision of central nucleus of the amygdala (CeM), and the basolateral amygdala (BLA) (Giustino et al., 2019; Hitora-Imamura et al., 2015; Zaki et al., 2022). We speculate that such transition is parallel to the internal states change in the latent cause model in terms of posterior probability and associative weight change.

      Optogenetic manipulation experiments have further revealed how fear and extinction engrams contribute to extinction retrieval and reinstatement. For instance, Gu et al. (2022) used a cued fear conditioning paradigm and found that inhibition of extinction engrams in the BLA, ventral hippocampus (vHPC), and mPFC after extinction learning artificially increased freezing to the tone cue. Similar results were observed in contextual fear conditioning, where silencing extinction engrams in the hippocampus dentate gyrus (DG) impaired extinction retrieval (Lacagnina et al., 2019). These results suggest that the weakening extinction memory can induce a return of fear response even without a reminder shock. On the other hand, Zaki et al. (2022) showed that inhibition of fear engrams in the BLA, DG, or hippocampus CA1 attenuated contextual fear reinstatement. However, they also reported that stimulation of these fear engrams was not sufficient to induce reinstatement, suggesting these fear engram only partially account for reinstatement. 

      In summary, reinstatement likely results from bidirectional changes in the fear and extinction circuits, supporting our interpretation that both acquisition and extinction latent causes contribute to the reinstatement. Although it remains unclear whether these memory engrams represent latent causes, one possible interpretation is that w<sub>context</sub> update in extinction latent causes during unsignaled shock indicates weakening of the extinction memory, while preservation of w in acquisition latent causes and their posterior probability suggests reactivation of previous fear memory. 

      [#13] (2) As stated by the authors in the introduction, the advantage of the fear learning approach is that the memory is modified across the acquisition-extinction-reinstatement phases. Although perhaps not explicitly stated by the authors, the post-reinstatement test (test 3) is the crucial test for whether there is reactivation of a previously stored memory, with the general argument being that the reinvigorated response to the CS can't simply be explained by relearning the CS-US pairing, because re-exposure the US alone leads to increase response to the CS at test. Of course there are several explanations for why this may occur, particularly when also considering the context as a stimulus. This is what I understood to be the justification for the use of a model, such as the latent cause model, that may better capture and compare these possibilities within a single framework. As such, it is critical to look at the level of responding to both the context alone and to the CS. It appears that the authors only look at the percent freezing during the CS, and it is not clear whether this is due to the contextual US learning during the US re-exposure or to increased response to the CS - presumably caused by reactivation of the acquisition memory. For example, the instance of the model shown in Figure 1 indicates that the 'extinction state', or state z6, develops a strong weight for the context during the reinstatement phase of presenting the shock alone. This state then leads to increased freezing during the final CS probe test as shown in the figure. By not comparing the difference in the evoked freezing CR at the test (ITI vs CS period), the purpose of the reinstatement test is lost in the sense of whether a previous memory was reactivated - was the response to the CS restored above and beyond the freezing to the context? I think the authors must somehow incorporate these different phases (CS vs ITI) into their model, particularly since this type of memory retrieval that depends on assessing latent states is specifically why the authors justified using the latent causal model.

      To clarify the contribution of context, we have provided preCS freezing rate across trials in Figure 2 – figure supplement 2. As the reviewer pointed out, the preCS freezing rate did not remain at the same level across trials, especially within the 12-month-old control and App<sup>NL-G-F</sup> group (Figure 2 – figure supplement 2A and 2B), suggesting the effect context. A paired samples t-test comparing preCS freezing (Figure 2 – figure supplement 2E) and CS freezing (Figure 2E) in test 3 revealed significant differences in all groups: 6-month-old control, t(23) = -6.344, p < 0.001, d = -1.295; 6-month-old App<sup>NL-G-F</sup>, t(24) = -4.679, p < 0.001, d = -0.936; 12-month-old control, t(23) = -4.512, p < 0.001, d = 0.921; 12-month-old App<sup>NL-G-F</sup>, t(24) = -2.408, p = 0.024, d = -0.482. These results indicate that the response to CS was above and beyond the response to context only. We also compared the change in freezing rate (CS freezing rate minus preCS freezing rate) in test 2 and test 3 to examine the net response to the tone. The significant difference was found in the control group, but not in the App<sup>NL-GF</sup> group (Author response image 3). The increased net response to the tone in the control group suggested that the reinstatement was partially driven by reactivation of acquisition memory, not solely by the contextual US learning during the unsignaled shock phase. We have added these results and discussion in the manuscript Line 220-231.

      Author response image 3.

      Net freezing rate in test 2 and test 3. Net freezing rate is defined as the CS freezing rate (i.e., freezing rate during 1 min CS presentation) minus the preCS freezing rate (i.e., 1 min before CS presentation). The dashed horizontal line indicates no freezing rate change from the preCS period to the CS presentation. *p < 0.05 by paired-sample Student’s t-test, and the alternative hypothesis specifies that test 2 freezing rate change is less than test 3. Colors indicate different groups: orange represents 6-month-old control (n = 24), light blue represents 6-month-old App<sup>NL-G-F</sup> mice (n = 25), pink represents 12-month-old control (n = 24), and dark blue represents 12-month-old App<sup>NL-G-F</sup> mice (n = 25). Each black dot represents one animal. Statistical results were as follows: t(23) = -1.927, p = 0.033, Cohen’s d = -0.393 in 6-month-old control; t(24) = -1.534, p = 0.069, Cohen’s d = -0.307 in 6-month-old App<sup>NL-G-F</sup>; t(23) = -1.775, p = 0.045, Cohen’s d = -0.362 in 12-month-old control; t(24) = 0.86, p = 0.801, Cohen’s d = 0.172 in 12-monthold App<sup>NL-G-F</sup>

      According to the latent cause model, if the reinstatement is merely induced by an association between the context and the US in the unsignaled shock phase, the CR given context only and that given context and CS in test 3 should be equal. However, the simulation conducted for each mouse using their estimated parameters confirmed that this was not the case in this study. The results showed that simulated CR was significantly higher in the context+CS condition than in the context only condition (Author response image 4). This trend is consistent with the behavioral results we mentioned above.

      Author response image 4.

      Simulation of context effect in test 3. Estimated parameter sets of each sample were used to run the simulation that only context or context with CS was present in test 3 (trial 36). The data are shown as median with interquartile range, where white bars with colored lines represent CR for context only and colored bars represent CR for context with CS. Colors indicate different groups: orange represents 6-month-old control (n = 15), light blue represents 6-month-old App<sup>NL-G-F</sup> mice (n = 12), pink represents 12-month-old control (n = 20), and dark blue represents 12-month-old App<sup>NL-G-F</sup> mice (n = 18). Each black dot represents one animal. **p < 0.01, and ***p < 0.001 by Wilcoxon signed-rank test comparing context only and context + CS in each group, and the alternative hypothesis specifies that CR in context is not equal to CR in context with CS. Statistical results were as follows: W = 15, p = 0.008, effect size r = -0.66 in 6-month-old control; W = 0, p < 0.001, effect size r = -0.88 in 6-month-old App<sup>NL-G-F</sup>; W = 25, p = 0.002, effect size r = -0.67 in 12-month-old control; W = 9, p = 0.002 , effect size r = -0.75 in 12-month-old App<sup>NL-G-F</sup>

      [#14] (3) This is related to the second point above. If the question is about the memory processes underlying memory retrieval at the test following reinstatement, then I would argue that the model parameters that are not involved in testing this hypothesis be fixed prior to the test. Unlike the Gershman paper that the authors cited, the authors fit all parameters for each animal. Perhaps the authors should fit certain parameters on the acquisition and extinction phase, and then leave those parameters fixed for the reinstatement phase. To give a more concrete example, if the hypothesis is that AD mice have deficits in differentiating or retrieving latent states during reinstatement which results in the low response to the CS following reinstatement, then perhaps parameters such as the learning rate should be fixed at this point. The authors state that the 12-month-old AD mice have substantially lower learning rate measures (almost a 20-fold reduction!), which can be clearly seen in the very low weights attributed to the AD mouse in Figure 3D. Based on the example in Figure 3D, it seems that the reduced learning rate in these mice is most likely caused by the failure to respond at test. This is based on comparing the behavior in Figures 3C to 3D. The acquisition and extinction curves appear extremely similar across the two groups. It seems that this lower learning rate may indirectly be causing most of the other effects that the authors highlight, such as the low σx, and the changes to the parameters for the CR. It may even explain the extremely high K. Because the weights are so low, this would presumably lead to extremely low likelihoods in the posterior estimation, which I guess would lead to more latent states being considered as the posterior would be more influenced by the prior.

      We thank the reviewer for the suggestion about fitting and fixing certain parameters in different phases.

      However, this strategy may not be optimal for our study for the following scientific reasons.

      Our primary purpose is to explore internal states in the memory modification process that are associated with the deficit found in App<sup>NL-G-F</sup> mice in the reinstatement paradigm. We did not restrict the question to memory retrieval, nor did we have a particular hypothesis such that only a few parameters of interest account for the impaired associative learning or structure learning in App<sup>NL-G-F</sup> mice while all other parameters are comparable between groups. We are concerned that restricting questions to memory retrieval at the test is too parsimonious and might lead to misinterpretation of the results. As we explain in reply to comment #5, removing trials in extinction during parameter estimation reduces the model fit performance and runs the risk of overfitting within the individual. Therefore, we estimated all parameters for each animal, with the assumption that the estimated parameter set represents individual internal state (i.e., learning and memory characteristics) and should be fixed within the animal across all trials.  

      Figure 3 is the parameter estimation and simulation results using the median data of each group as an individual. The estimated parameter value is one of the possible cases in that group to demonstrate how a typical learning curve fits the latent cause model. The reviewer mentioned “20-fold reduction in learning rate” is the comparison of two data points, not the actual comparison between groups. The comparison between control and App<sup>NL-G-F</sup> mice in the 12-month-old group for all parameters was provided in Table S7. The Mann-Whitney U test did not reveal a significant difference in learning rate (η): 12-month-old control (Mdn = 0.09, IQR=0.23) vs. 12-month-old App<sup>NL-G-F</sup> (Mdn = 0.12, IQR=0.23), U = 199, p = 0.587.  

      We agree that lower learning rate could bias the learning toward inferring a new latent cause. However, this tendency may depend on the value of other parameters and varied in different phases in the reinstatement paradigm. Here, we used ⍺ as an example and demonstrate their interaction in Appendix 2 – table 2 with relatively extreme values: ⍺ \= {1, 3} and η \= {0.01, 0.5} while the rest of the parameters fixed at the initial guess value. 

      When ⍺ = 1, the number of latent causes across phases (K<sub>acq</sub>, K<sub>ext</sub>, K<sub>rem</sub>) remain unchanged and their posterior probability in test 3 were comparable even if η increased from 0.01 to 0.5. This is an example that lower η does not lead to inferring new latent causes because of low ⍺. The effect of low learning rate manifests in test 3 CR due to low w<sub>context, acq</sub> and w<sub>context, ext</sub>

      When ⍺ = 3, the number of acquisition latent causes (K<sub>acq</sub>) was higher in the case of η = 0.01 than that of η = 0.5, showing the effect mentioned by the reviewer. However, test 1 CR is much lower when η = 0.01, indicating unsuccessful learning even after inferring a new latent cause. This is none of the cases observed in this study. During extinction phases, the effect of η is surpassed by the effect of high ⍺, where the number of extinction latent causes (K<sub>ext</sub>) is high and not affected by η. After the extinction phases, the effect of K kicks in as the total number of latent causes reaches its value (K = 33 in this example), especially in the case of η = 0.01. A new latent cause is inferred after extinction in the condition of η = 0.5, but the CR 3 is still high as the w<sub>context, acq</sub> and w<sub>context, ext</sub> are high. This is an example that a new latent cause is inferred in spite of higher η

      Overall, the learning rate would not have a prominent effect alone throughout the reinstatement paradigm, and it has a joint effect with other parameters. Note that the example here did not cover our estimated results, as the estimated learning rate was not significantly different between control and App<sup>NL-G-F</sup> mice (see above). Please refer to the reply to comment #31 for more discussion about the interaction among parameters when the learning rate is fixed. We hope this clarifies the reviewer’s concern.

      [#15] (4) Why didn't the authors use the latent causal model on the Barnes maze task? The authors mention in the discussion that different cognitive processes may be at play across the two tasks, yet reversal tasks have been suggested to be solved using latent states to be able to flip between the two different task states. In this way, it seems very fitting to use the latent cause model. Indeed, it may even be a better way to assess changes in σx as there are presumably 12 observable stimuli/locations.

      Please refer to our provisional response about the application of the latent cause model to the reversal Barnes maze task. Briefly, it would be difficult to directly apply the latent cause model to the Barnes maze data because this task involves operant learning, and thereby almost all conditions in the latent cause model are not satisfied. Please also see our reply to comment #24 for the discussion of the link between the latent cause model and Barnes maze task. 

      Reviewer #2 (Recommendations for the authors):

      [#16] (1) I had a bit of difficulty finding all the details of the model. First, I had to mainly rely on the Gershman 2017 paper to understand the model. Even then, there were certain aspects of the model that were not clear. For instance, it's not quite clear to me when the new internal states are created and how the maximum number of states is determined. After reading the authors' methods and the Gershman paper, it seems that a new internal state is generated at each time point, aka zt, and that the prior for that state decays onwards from alpha. Yet because most 'new' internal states don't ever take on much of a portion of the posterior, most of these states can be ignored. Is that a correct understanding? To state this another way, I interpret the equation on line 129 to indicate that the prior is determined by the power law for all existing internal states and that each new state starts with a value of alpha, yet I don't see the rule for creating a new state, or for iterating k other than that k iterates at each timestep. Yet this seems to not be consistent with the fact that the max number of states K is also a parameter fit. Please clarify this, or point me to where this is better defined.

      I find this to be an important question for the current paper as it is unclear to me when the states were created. Most notably, in Figure 3, it's important to understand why there's an increase in the posterior of z<sub>5</sub> in the AD 12-month mice at test. Is state z<sub>5</sub> generated at trial 5? If so, the prior would be extremely small by trial 36, making it even more perplexing why z<sub>5</sub> has such a high posterior. If its weights are similar to z<sub>3</sub> and z<sub>4</sub>, and they have been much more active recently, why would z<sub>5</sub> come into play?

      We assume that the “new internal state" the reviewer is referring to is the “new latent cause." We would like to clarify that “internal state" in our study refers to all the latent causes at a given time point and observation. As this manuscript is submitted as a Research Advance article in eLife, we did not rephrase all the model details. Here, we explain when a new latent cause is created (i.e., the prior probability of a new latent cause is greater than 0) with the example of the 12-month-old group (Figure 3C and 3D). 

      Suppose that before the start of each trial, an agent inferred the most likely latent cause with maximum posterior, and it inferred k latent causes so far. A new latent cause can be inferred at the computation of the prior of latent causes at the beginning of each trial.  

      In the latent cause model, it follows a distance-dependent Chinese Restaurant Process (CRP; Blei and Frazier, 2011). The prior of each old latent cause is its posterior probability, which is the final count of the EM update before the current. In addition, the prior of old latent causes is sensitive to the time passage so that it exponentially decreases as a forgetting function modulated by g (see Figure 2 in Gershman et al., 2017). Simultaneously, the prior of a new cause is assigned ⍺. The new latent cause is inferred at this moment. Hence, the prior of latent causes is jointly determined by ⍺, g and its posterior probability. The maximum number of latent causes K is set a priori and does not affect the prior while k < K (see also reply to comment #30 for the discussion of boundary set for K and comment #31 for the discussion of the interaction between ⍺ and K). Note that only one new latent cause can be inferred in each trial, and (k+1)<sup>th</sup> latent cause, which has never been inferred so far, is chosen as the new latent cause.

      In our manuscript, the subscript number in zₖ denotes the order in which they were inferred, not the trial number. In Figures 3C and 3D, z<sub>3</sub> and z<sub>4</sub> were inferred in trials 5 and 6 during extinction; z<sub>5</sub> is a new latent cause inferred in trial 36. Therefore, the prior of z<sub>5</sub> is not extremely small compared to z<sub>4</sub> and z<sub>3</sub>.

      In both control and App<sup>NL-G-F</sup> mice in the 12-month-old (Figures 3C and 3D), z<sub>3</sub> is dominant until trial 35. The unsignaled shock at trial 35 generates a large prediction error as only context is presented and followed by the US. This prediction error reduces posterior of z<sub>3</sub>, while increasing the posterior of z<sub>4</sub> and w<sub>context</sub> in z<sub>3</sub> and z<sub>4</sub>. This decrease of posterior of z<sub>3</sub> is more obvious in the App<sup>NL-G-F</sup> than in the control group, prompting them to infer a new latent cause z<sub>5</sub> (Figure 3C and 3D). Although Figure 3C and 3D are illustrative examples as we explained in the reply to comment #14, this interpretation would be plausible as the App<sup>NL-G-F</sup> group inferred a significantly larger number of latent causes after the extinction with slightly higher posteriors of them than those in the control group (Figure 4E).

      [#17] (2) Related to the above, Are the states z<sub>A</sub> and z<sub>B</sub> defined by the authors to help the reader group the states into acquisition and extinction states, or are they somehow grouped by the model? If the latter is true, I don't understand how this would occur based on the model. If the former, could the authors state that these states were grouped together by the author?

      We used z<sub>A</sub> and z<sub>B</sub> annotations to assist with the explanation, so this is not grouped by the model. We have stated this in the manuscript Line 181-182.

      [#18] (3) This expands on the third point above. In Figure 3D, internal states z<sub>3</sub>, z<sub>4</sub>, and z<sub>5</sub> appear to be pretty much identical in weights in the App group. It's not clear to me why then the posterior of z<sub>5</sub> would all of a sudden jump up. If I understand correctly, the posterior is the likelihood of the observations given the internal state (presumably this should be similar across z<sub>3</sub>, z<sub>4</sub>, and z<sub>5</sub>), multiplied by the prior of the state. Z3 and Z4 are the dominant inferred states up to state 36. Why would z<sub>5</sub> become more likely if there doesn't appear to be any error? I'm inferring no error because there are little or no changes in weights on trial 36, most prominently no changes inz<sub>3</sub> which is the dominant internal state in step 36. If there's little change in weights, or no errors, shouldn't the prior dominate the calculation of the posterior which would lead to z<sub>3</sub> and z<sub>4</sub> being most prominent at trial 36?

      We have explained how z<sub>5</sub> of the 12-month-old App<sup>NL-G-F</sup> was inferred in the reply to comment #16. Here, we explain the process underlying the rapid changes of the posterior of z<sub>3</sub>, z<sub>4</sub>, and z<sub>5</sub> from trial 35 to 36.

      During the extinction, the mice inferred z<sub>3</sub> given the CS and the context in the absence of US. In trial 35, they observed the context and the unsignaled shock in the absence of the CS. This reduced the likelihood for the CS under z<sub>3</sub> and thereby the posterior of z<sub>3</sub>, while relatively increasing the posterior of z<sub>4</sub>. The associative weight between the context and the US , w<sub>context</sub>, indeed increased in both z<sub>3</sub> and z<sub>4</sub>, but w<sub>context</sub> of z<sub>4</sub> was updated more than that of z<sub>3</sub> due to its higher posterior probability. At the beginning of trial 36, a new latent cause z<sub>5</sub> was inferred with a certain prior (see also the reply for comment #16), and w<sub>5</sub> = w<sub>0</sub>, where w<sub>0</sub> is the initial value of weight. After normalizing the prior over latent causes, the emergence of z<sub>5</sub> reduced the prior probability of other latent causes compared to the case where the prior of z<sub>5</sub> is 0. Since the CS was presented while the US was absent in trial 36, the likelihood of the CS and that of the US under z<sub>3</sub>, and especially z<sub>4</sub>, given the cues and w became lower than the case in which z<sub>5</sub> has not been inferred yet. Consequently, the posterior of z<sub>5</sub> became salient (Figure 3D).

      To maintain consistency across panels, we used a uniform y-axis range. However, we acknowledge that this may make it harder to notice the changes of associative weights in Figure 3D. We have provided the subpanel in Figure 3D with a smaller y-axis limit to reveal the weight changes at trial 35 in Author response image 5.

      Author response image 5.

      Magnified view of w<sub>context</sub> and wCS in the last 3 trials in Figure 3D. The graph format is the same as in Figure 3D. The weight for CS (w<sub>CS</sub>) and that for context (w<sub>context</sub>) in each latent cause across trial 34 (test 2), 35 (unsignaled shock), and 36 (test 3) in 12-month-old App<sup>NL-G-F</sup> in Figure 3D was magnified in the upper and lower magenta box, respectively.

      [#19] (8) In Figure 4B - The figure legend didn't appear to indicate at which time points the DIs are plotted.

      We have amended the figure legend to indicate that DI between test 3 and test 1 is plotted.

      [#20] (9) Lines 301-303 state that the posterior probabilities of the acquisition internal states in the 12month AD mice were much higher at test 1 and that this resulted in different levels of CR across the control and 12-month App group. This is shown in the Figure 4A supplement, but this is not apparent in Figure 3 panels C and D. Is the example shown in panel D not representative of the group? The CRs across the two examples in Figure 3 C and D look extremely similar at test 1. Furthermore, the posteriors of the internal states look pretty similar across the two groups for the first 4 trials. Both the App and control have substantial posterior probabilities for the acquisition period, I don't see any additional states at test 1. The pattern of states during acquisition looks strikingly similar across the two groups, whereas the weights of the stimuli are considerably different. I think it would help the authors to use an example that better represents what the authors are referring to, or provide data to illustrate the difference. Figure 4C partly shows this, but it's not very clear how strong the posteriors are for the 3rd state in the controls.

      Figure 3 serves as an example to explain the internal states in each group (see also the third paragraph in the reply to comment #14). Figure 4C to H showed the results from each sample for between-group comparison in selected features. Therefore, the results of direct comparisons of the parameter values and internal states between genotypes in Figure 3 are not necessarily the same as those in Figure 4. Both examples in Figure 3C and 3D inferred 2 latent causes during the acquisition. In terms of posterior till test 1 (trial 4), the two could be the same. However, such examples were not rare, as the proportion of the mice that inferred 2 latent causes during the acquisition was slightly lower than 50% in the control, and around 90% in the App<sup>NL-G-F</sup> mice (Figure 4C). The posterior probability of acquisition latent cause in test 1 showed a similar pattern (Figure 4 – figure supplement 3), with values near 1 in around 50% of the control mice and around 90% of the App<sup>NL-G-F</sup> mice.  

      [#21] (10) Line 320: This is a confusing sentence. I think the authors are saying that because the App group inferred a new state during test 3, this would protect the weights of the 'extinction' state as compared to the controls since the strength of the weight updates depends on the probability of the posterior.

      In order to address this, we have revised this sentence to “Such internal states in App<sup>NL-G-F</sup> mice would diverge the associative weight update from those in the control mice after extinction.” in the manuscript Line 349-351.

      [#22] (11) In lines 517-519 the authors address the difference in generalizing the occurrence of stimuli across the App and control groups. It states that App mice with lower alpha generalized observations to an old cause rather than attributing it as a new state. Going back to statement 3 above, I think it's important to show that the model fit of a reduction in alpha does not go hand-in-hand with a reduction in the learning rates and hence the weights. Again, if the likelihoods are diminished due to the low weights, then the fit of alpha might be reduced as well. To reiterate my point above, if the observations in changes in generalization and differentiation occur because of a reduction in the learning rate, the modeling may not be providing a particularly insightful understanding of AD, other than that poor learning leads to ineffectual generalization and differentiation. Do these findings hold up if the learning rates are more comparable across the control and App group?

      These findings were explained on the basis of comparable learning rates between control and App<sup>NL-GF</sup> mice in the 12-month-old group (see the reply to comment #14). In addition, we have conducted simulation for different ⍺ and σ<sub>x</sub><sup>2</sup> values under the condition of the fixed learning rate, where overgeneralization and overdifferentaiton still occurred (see the reply to comment #26).  

      [#23] (12) Lines 391 - 393. This is a confusing sentence. "These results suggest that App NL-G-F mice could successfully form a spatial memory of the target hole, while the memory was less likely to be retrieved by a novel observation such as the absence of the escape box under the target hole at the probe test 1." The App mice show improved behavior across days of approaching the correct hole. Is this statement suggesting that once they've approached the target hole, the lack of the escape box leads to a reduction in the retention of that memory?

      We speculated that when the mice observed the absence of the escape box, a certain prediction error would be generated, which may have driven the memory modification. In App<sup>NL-G-F</sup> mice, such modification, either overgeneralization or overdifferentiation, could render the memory of the target hole vulnerable; if overgeneralization occurred, the memory would be quickly overwritten as the goal no longer exists in this position in this maze, while if overdifferentiation occurred, a novel memory such that the goal does not exist in the maze different from previous one would be formed. In either case of misclassification, the probability of retrieving the goal position would be reduced. To reduce ambiguity in this sentence, we have revised the description in the manuscript Line 432-434 as follows: “These results suggest that App<sup>NL-G-F</sup> mice could successfully form a spatial memory of the target hole, while they did not retrieve the spatial memory of the target hole as strongly as control mice when they observed the absence of the escape box during the probe test.”

      [#24] (13) The connection between the results of Barnes maze and the fear learning paradigm is weak. How can changes in overgeneralization due to a reduction in the creation of inferred states and differentiation due to a reduced σx lead to the observations in the Barnes maze experiment?

      We extrapolated our interpretation in the reinstatement modeling to behaviors in a different behavioral task, to explore the explanatory power of the latent cause framework formalizing mechanisms of associative learning and memory modification. Here, we explain the results of the reversal Barnes maze paradigm in terms of the latent cause model, while conferring the reinstatement paradigm.

      Whilst we acknowledge that fear conditioning and spatial learning are not fully comparable, the reversal Barnes maze paradigm used in our study shares several key learning components with the reinstatement paradigm. 

      First, associative learning is fundamental in spatial learning (Leising & Blaisdell, 2009; Pearce, 2009). Although we did not make any specific assumptions of what kind of associations were learned in the Barnes maze, performance improvements in learning phases likely reflect trial-and-error updates of these associations involving sensory preconditioning or secondary conditioning. Second, the reversal training phases could resemble the extinction phase in the reinstatement paradigm, challenge previously established memory. In terms of the latent cause model, both the reversal learning phase in the reversal Barnes maze paradigm and the extinction phase in the reinstatement paradigm induce a mismatch of the internal state. This process likely introduces large prediction errors, triggering memory modification to reconcile competing memories.  

      Under the latent cause framework, we posit that the mice would either infer new memories or modify existing memories for the unexpected observations in the Barnes maze (e.g., changed location or absence of escape box) as in the reinstatement paradigm, but learn a larger number of association rules between stimuli in the maze compared to those in the reinstatement. In the reversal Barnes maze paradigm, the animals would infer that a latent cause generates the stimuli in the maze at certain associative weights in each trial, and would adjust behavior by retaining competing memories.

      Both overgeneralization and overdifferentiation could explain the lower exploration time of the target hole in the App<sup>NL-G-F</sup> mice in probe test 1. In the case of overgeneralization, the mice would overwrite the existing spatial memory of the target hole with a memory that the escape box is absent. In the case of overdifferentiation, the mice would infer a new memory such that the goal does not exist in the novel field, in addition to the old memory where the goal exists in the previous field. In both cases, the App<sup>NL-G-F</sup> mice would not infer that the location of the goal is fixed at a particular point and failed to retain competing spatial memories of the goal, leading to relying on a less precise, non-spatial strategy to solve the task.  

      Since there is no established way to formalize the Barnes maze learning in the latent cause model, we did not directly apply the latent cause model to the Barnes maze data. Instead, we used the view above to explore common processes in memory modification between the reinstatement and the Barnes maze paradigm. 

      The above description was added to the manuscript on page 13 (Line 410-414) and page 19-20 (Line 600-602, 626-639).

      [#25] (14) In the fear conditioning task, it may be valuable to separate responding to the context and the cue at the time of the final test. The mice can learn about the context during the reinstatement, but there must be an inference to the cue as it's not present during the reinstatement phase. This would provide an opportunity for the model to perhaps access a prior state that was formed during acquisition. This would be more in line with the original proposal by Gershman et al. 2017 with spontaneous recovery.

      Please refer to the reply to comment #13 regarding separating the response to context in test 3.  

      Reviewer #3 (Public review):

      Summary:

      This paper seeks to identify underlying mechanisms contributing to memory deficits observed in Alzheimer's disease (AD) mouse models. By understanding these mechanisms, they hope to uncover insights into subtle cognitive changes early in AD to inform interventions for early-stage decline.

      Strengths:

      The paper provides a comprehensive exploration of memory deficits in an AD mouse model, covering the early and late stages of the disease. The experimental design was robust, confirming age-dependent increases in Aβ plaque accumulation in the AD model mice and using multiple behavior tasks that collectively highlighted difficulties in maintaining multiple competing memory cues, with deficits most pronounced in older mice.

      In the fear acquisition, extinction, and reinstatement task, AD model mice exhibited a significantly higher fear response after acquisition compared to controls, as well as a greater drop in fear response during reinstatement. These findings suggest that AD mice struggle to retain the fear memory associated with the conditioned stimulus, with the group differences being more pronounced in the older mice.

      In the reversal Barnes maze task, the AD model mice displayed a tendency to explore the maze perimeter rather than the two potential target holes, indicating a failure to integrate multiple memory cues into their strategy. This contrasted with the control mice, which used the more confirmatory strategy of focusing on the two target holes. Despite this, the AD mice were quicker to reach the target hole, suggesting that their impairments were specific to memory retrieval rather than basic task performance.

      The authors strengthened their findings by analyzing their data with a leading computational model, which describes how animals balance competing memories. They found that AD mice showed somewhat of a contradiction: a tendency to both treat trials as more alike than they are (lower α) and similar stimuli as more distinct than they are (lower σx) compared to controls.

      Weaknesses:

      While conceptually solid, the model struggles to fit the data and to support the key hypothesis about AD mice's ability to retain competing memories. These issues are evident in Figure 3:

      [#26] (1) The model misses key trends in the data, including the gradual learning of fear in all groups during acquisition, the absence of a fear response at the start of the experiment, the increase in fear at the start of day 2 of extinction (especially in controls), and the more rapid reinstatement of fear observed in older controls compared to acquisition.

      We acknowledge these limitations and explained why they arise in the latent cause model as follows.

      a. Absence of a fear response at the start of the experiment and the gradual learning of fear during acquisition 

      In the latent cause model, the CR is derived from a sigmoidal transformation from the predicted outcome with the assumption that its mapping to behavioral response may be nonlinear (see Equation 10 and section “Conditioned responding” in Gershman et al., 2017). 

      The magnitude of the unconditioned response (trial 1) is determined by w<sub>0</sub>, θ, and λ. An example was given in Appendix 2 – table 3. In general, a higher w<sub>0</sub> and a lower θ produce a higher trial 1 CR when other parameters are fixed. During the acquisition phase, once the expected shock exceeds θ, CR rapidly approaches 1, and further increases in expected shock produce few changes in CR. This rapid increase was also evident in the spontaneous recovery simulation (Figure 11) in Gershman et al. (2017). The steepness of this rapid increase is modulated by λ such that a higher value produces a shallower slope. This is a characteristic of the latent cause model, assuming CR follows a sigmoid function of expected shock, while the ordinal relationship over CRs is maintained with or without the sigmoid function, as Gershman et al. (2017) mentioned. If one assumes that the CR should be proportional to the expected shock, the model can reproduce the gradual response as a linear combination of w and posteriors of latent causes while omitting the sigmoid transformation (Figure 3). 

      b. Increase in fear at the start of day 2 extinction

      This point is partially reproduced by the latent cause model. As shown in Figure 3, trial 24 (the first trial of day 2 extinction) showed an increase in both posterior probability of latent cause retaining fear memory and the simulated CRs in all groups except the 6-month-old control group, though the increase in CR was small due to the sigmoid transformation (see above). This can be explained by the latent cause model as 24 h time lapse between extinction 1 and 2 decreases the prior of the previously inferred latent cause, leading to an increase of those of other latent causes. 

      Unlike other groups, the 6-month-old control did not exhibit increased observed CR at trial 24

      but at trial 25 (Figure 3A). The latent cause model failed to reproduce it, as there was no increase in posterior probability in trial 24 (Figure 3A). This could be partially explained by the low value of g, which counteracts the effect of the time interval between days: lower g keeps prior of the latent causes at the same level as those in the previous trial. Despite some failures in capturing this effect, our fitting policy was set to optimize prediction among the test trials given our primary purpose of explaining reinstatement.

      c. more rapid reinstatement of fear observed in older controls compared to acquisition

      We would like to point out that this was replicated by the latent cause model as shown in Figure 3 – figure supplement 1C. The DI between test 3 and test 1 calculated from the simulated CR was significantly higher in 12-month-old control than in App<sup>NL-G-F</sup> mice (cf. Figure 2C to E).  

      [#27] (2) The model attributes the higher fear response in controls during reinstatement to a stronger association with the context from the unsignaled shock phase, rather than to any memory of the conditioned stimulus from acquisition. These issues lead to potential overinterpretation of the model parameters. The differences in α and σx are being used to make claims about cognitive processes (e.g., overgeneralization vs. overdifferentiation), but the model itself does not appear to capture these processes accurately. The authors could benefit from a model that better matches the data and that can capture the retention and recollection of a fear memory across phases.

      First, we would like to clarify that the latent cause model explains the reinstatement not only by the extinction latent cause with increased w<sub>context</sub> but also the acquisition latent cause with preserved wCS and w<sub>context</sub> (see also reply to comment #13). Second, the latent cause model primarily attributes the higher fear reinstatement in control to a lower number of latent causes inferred after extinction (Figure 4E) and higher w<sub>context</sub> in extinction latent cause (Figure 4G). We noted that there was a trend toward significance in the posterior probability of latent causes inferred after extinction (Figure 4E), which in turn influences those of acquisition latent causes. Although the posterior probability of acquisition latent cause appeared trivial and no significance was detected between control and App<sup>NL-G-F</sup> mice (Figure 4C), it was suppressed by new latent causes in App<sup>NL-G-F</sup> mice (Author response image 6).

      This indicates that App<sup>NL-G-F</sup> mice retrieved acquisition memory less strongly than control mice. Therefore, we argue that the latent cause model attributed a higher fear response in control during reinstatement not solely to the stronger association with the context but also to CS fear memory from acquisition. Although we tested whether additional models fit the reinstatement data in individual mice, these models did not satisfy our fitting criteria for many mice compared to the latent cause model (see also reply to comment #4 and #28).

      Author response image 6.

      Posterior probability of acquisition, extinction, and after extinction latent causes in test 3. The values within each bar indicate the mean posterior probability of acquisition latent cause (darkest shade), extinction latent cause (medium shade), and latent causes inferred after extinction (lightest shade) in test 3 over mice within genotype. Source data are the same as those used in Figure 4C–E (posterior of z).

      Conclusion:

      Overall, the data support the authors' hypothesis that AD model mice struggle to retain competing memories, with the effect becoming more pronounced with age. While I believe the right computational model could highlight these differences, the current model falls short in doing so.

      Reviewer #3 (Recommendations for the authors):

      [#28] Other computational models may better capture the data. Ideally, I'd look for a model that can capture the gradual learning during acquisition, and, in some mice, the inferring of a new latent cause during extinction, allowing the fear memory to be retained and referenced at the start of day 2 extinction and during later tests.

      We have further evaluated another computational model, the latent state model, and compared it with the latent cause model. The simulation of reinstatement and parameter estimation method of the latent state model were described in the Appendix.

      The latent state model proposed by Cochran and Cisler (2019) shares several concepts with the latent cause model, and well replicates empirical data under certain conditions. We expect that it can also explain the reinstatement. 

      Following the same analysis flow for the latent cause model, we estimated the parameters and simulated reinstatement in the latent state model from individual CRs and median of them. In the median freezing rate data of the 12-month-old control mice, the simulated CR replicated the observed CR well and exhibited the ideal features that the reviewer looked for: gradual learning during acquisition and an increased fear at the start of the second-day extinction (Appendix 1 – figure 1G). However, a lot of samples did not fit well to the latent state model. The number of anomalies was generally higher than that in the latent cause model (Appendix 1 – figure 2). Within the accepted samples, the sum of squared prediction error in all trials was significantly lower in the latent state model, which resulted from lower prediction error in the acquisition trials (Appendix 1 – figure 4A and 4B). In the three test trials, the squared prediction error was comparable between the latent state model and the latent cause model except for the test 2 trials in the control group (Appendix 1 – figure 4A and 4B, rightmost panel). On the other hand, almost all accepted samples continued to infer the acquisition latent states during extinction without inferring new states (Appendix 1 – figure 5B and 5E, left panel), which differed from the ideal internal states the reviewer expected. While the latent state model fit performance seems to be better than the latent cause model, the accepted samples cannot reproduce the lower DI between test 3 and test 1 in aged App<sup>NL-G-F</sup> mice (Appendix 1 – figure 6C). These results make the latent state model less suitable for our purpose and therefore we decided to stay with the latent cause model. It should also be noted that we did not explore all parameter spaces of the latent state model hence we cannot rule out the possibility that alternative parameter sets could provide a better fit and explain the memory modification process well. A more comprehensive parameter search in the LSM may be a valuable direction for future research.

      If you decide not to go with a new model, my preference would be to drop the current modeling. However, if you wish to stay with the current model, I'd like to see justification or acknowledgment of the following:

      [#29] (1) Lower bound on alpha of 1: This forces the model to infer new latent causes, but it seems that some mice, especially younger AD mice, might rely more on classical associative learning (e.g., Rescorla-Wagner) rather than inferring new causes.

      We acknowledge that the default value set in Gershman et al. (2017) is 0.1, and the constraint we set is a much higher value. However, ⍺ = 1 does not always force the model to infer new latent causes.

      In the standard form Chinese restaurant process (CRP), the prior that n<sup>th</sup> observation is assigned to a new cluster is given by ⍺ / (n - 1 + ⍺) (Blei & Gershman, 2012). When ⍺ = 1, the prior of the new cluster for the 2nd observation will be 0.5; when ⍺ = 3, this prior increases to 0.75. Thus, when ⍺ > 1, the prior of the new cluster is above chance early in the sequence, which may relate to the reviewer’s concern. However, this effect diminishes as the number of observations increases. For instance, the prior of the new cluster drops to 0.1 and 0.25 for the 10th observation when ⍺ = 1 and 3, respectively. Furthermore, the prior in the latent cause model is governed by not only α but also g, a scaling parameter for the temporal difference between successive observations (see Results in the manuscript) following “distance-dependent” CRP, then normalized over all latent causes including a new latent cause. Thus, it does not necessarily imply that ⍺ greater than 1 forces agents to infer a new latent cause_. As shown in Appendix 2 – table 4, the number of latent causes does not inflate in each trial when _α = 1. On the other hand, the high number of latent causes due to α = 2 can be suppressed when g = 0.01. More importantly, the driving force is the prediction error generated in each trial (see also comment #31 about the interaction between ⍺ and σ<sub>x</sub><sup>2</sup>). Raising the value of ⍺ per se can be viewed as increasing the probability to infer a new latent cause, not forcing the model to do so by higher α alone. 

      During parameter exploration using the median behavioral data under a wider range of ⍺ with a lower boundary at 0.1, the estimated value eventually exceeded 1. Therefore, we set the lower bound of ⍺ to be 1 is to reduce inefficient sampling. 

      [#30] (2) Number of latent causes: Some mice infer nearly as many latent causes as trials, which seems unrealistic.

      We set the upper boundary for the maximum number of latent causes (K) to be 36 to align with the infinite features of CRP. This allowed some mice to infer more than 20 latent causes in total. When we checked the learning curves in these mice, we found that they largely fluctuated or did not show clear decreases during the extinction (Author response image 7, colored lines). The simulated learning curves were almost flat in these trials (Author response image 7, gray lines). It might be difficult to estimate the internal states of such atypical mice if the sampling process tried to fit them by increasing the number of latent causes. Nevertheless, most of the samples have a reasonable total number of latent causes: 12-month-old control mice, Mdn = 5, IQR = 4; 12-month-old App<sup>NL-G-F</sup> mice, Mdn = 5, IQR = 1.75; 6-month-old control mice, Mdn = 7, IQR = 12.5; 6-month-old App<sup>NL-G-F</sup> mice, Mdn = 5, IQR = 5.25. These data were provided in Tables S9 and S12.  

      Author response image 7.

      Samples with a high number of latent causes. Observed CR (colored line) and simulated CR (gray line) for individual samples with a total number of inferred latent causes exceeding 20. 

      [#31] (3) Parameter estimation: With 10 parameters fitting one-dimensional curves, many parameters (e.g., α and σx) are likely highly correlated and poorly identified. Consider presenting scatter plots of the parameters (e.g., α vs σx) in the Supplement.

      We have provided the scatter plots with a correlation matrix in Figure 4 – figure supplement 1 for the 12-month-old group and Figure 5 – figure supplement 1 for the 6-month-old group. As pointed out by the reviewer, there are significant rank correlations between parameters including ⍺ and σ<sub>x</sub><sup>2</sup> in both the 6 and 12-month-old groups. However, we also noted that there are no obvious linear relationships between the parameters.

      The correlation above raises a potential problem of non-identifiability among parameters. First, we computed the variance inflation index (VIF) for all parameters to examine the risk of multicollinearity, though we did not consider a linear regression between parameters and DI in this study. All VIF values were below the conventional threshold 10 (Appendix 2 – table 5), suggesting that severe multicollinearity is unlikely to bias our conclusions. Second, we have conducted the simulation with different combinations of ⍺, σ<sub>x</sub><sup>2</sup>, and K to clarify their contribution to overgeneralization and overdifferentiation observed in the 12-month-old group. 

      In Appendix 2 – table 6, the values of ⍺ and σ<sub>x</sub><sup>2</sup> were either their upper or lower boundary set in parameter estimation, while the value K was selected heuristically to demonstrate its effect. Given the observed positive correlation between alpha and σ<sub>x</sub><sup>2</sup>, and their negative correlation with K (Figure 4 - figure supplement 1), we consider the product of K \= {4, 35}, ⍺ \= {1, 3} and σ<sub>x</sub><sup>2</sup> \= {0.01, 3}. Among these combinations, the representative condition for the control group is α = 3, σ<sub>x</sub><sup>2</sup> = 3, and that for the App<sup>NL-G-F</sup> group is α = 1, σ<sub>x</sub><sup>2</sup> = 0.01. In the latter condition, overgeneralization and overdifferentiation, which showed higher test 1 CR, lower number of acquisition latent causes (K<sub>acq</sub>), lower test 3 CR, lower DI between test 3 and test 1, and higher number of latent causes after extinction (K<sub>rem</sub>), was extremely induced. 

      We found conditions that fall outside of empirical correlation, such as ⍺ = 3, σ<sub>x</sub><sup>2</sup> = 0.01, also reproduced overgeneralization and overdifferentiation. Similarly, the combination, ⍺ = 1, σ<sub>x</sub><sup>2</sup> = 3, exhibited control-like behavior when K = 4 but shifted toward App<sup>NL-G-F</sup>-like behavior when K = 36. The effect of K was also evident when ⍺ = 3 and σ<sub>x</sub><sup>2</sup> = 3, where K = 36 led to over-differentiation. We note that these conditions were artificially set and likely not representative of biologically plausible. These results underscore the non-identifiability concern raised by the reviewer. Therefore, we acknowledge that merely attributing overgeneralization to lower ⍺ or overdifferentiation to lower σ<sub>x</sub><sup>2</sup> may be overly reductive. Instead, these patterns likely arise from the joint effect of ⍺, σ<sub>x</sub><sup>2</sup>, and K. We have revised the manuscript accordingly in Results and Discussion (page 11-13, 18-19).

      [#32] (4) Data normalization: Normalizing the data between 0 and 1 removes the interpretability of % freezing, making mice with large changes in freezing indistinguishable seem similar to mice with small changes.

      As we describe in our reply to comment #26, the conditioned response in the latent cause model was scaled between 0 and 1, and we assume 0 and 1 mean the minimal and maximal CR within each mouse, respectively. Furthermore, although we initially tried to fit simulated CRs to raw CRs, we found that the fitting level was low due to the individual difference in the degree of behavioral expression: some mice exhibited a larger range of CR, while others showed a narrower one. Thus, we decided to normalize the data. We agree that this processing will make the mice with high changes in freezing% indistinguishable from those with low changes. However, the freezing% changes within the mouse were preserved and did not affect the discrimination index.

      [#33] (5) Overlooking parameter differences: Differences in parameters, like w<sub>0</sub>, that didn't fit the hypothesis may have been ignored.

      Our initial hypothesis is that internal states were altered in App<sup>NL-G-F</sup> mice, and we did not have a specific hypothesis on which parameter would contribute to such a state. We mainly focus on the parameters (1) that are significantly different between control and App</sup>NL-G</sup>- mice and (2) that are significantly correlated to the empirical behavioral data, DI between test 3 and test 1. 

      In the 12-month-old group, besides ⍺ and σ<sub>x</sub><sup>2</sup>, w<sub>0</sub> and K showed marginal p-value in Mann-Whitney U test (Table S7) and moderate correlation with the DI (Table S8). While differences in K were already discussed in the manuscript, we did miss the point that w<sub>0</sub> could contribute to the differences in w between control and App<sup>NL-G-F</sup> (Figure 4G) in the previous manuscript. We explain the contribution of w<sub>0</sub> on the reinstatement results here. When other parameters are fixed, higher w<sub>0</sub> would lead to higher CR in test 3, because higher w<sub>0</sub> would allow increasing w<sub>context</sub> by the unsignaled shock, leading to reinstatement (Appendix 2 – table 7). It is likely that higher w<sub>0</sub> would be sampled through the parameter estimation in the 12-month-old control but not App<sup>NL-G-F</sup>. On the other hand, the number of latent causes is not sensitive to w<sub>0</sub> when other parameters were fixed at the initial guess value (Appendix 2 – table 1), suggesting w<sub>0</sub> has a small contribution to memory modification process. 

      Thus, we speculate that although the difference in w<sub>0</sub> between control and App<sup>NL-G-F</sup> mice may arise from the sampling process, resulting in a positive correlation with DI between test 3 and test 1, its contribution to diverged internal states would be smaller relative to α or σ<sub>x</sub><sup>2</sup> as a wide range of w<sub>0</sub> has no effect on the number of latent causes (Appendix 2 – table 7). We have added the discussion of differences in w<sub>0</sub> in the 12-month-old group in manuscript Line 357-359.

      In the 6-month-old group, besides ⍺ and σ<sub>x</sub><sup>2</sup>, 𝜃 is significantly higher in the AD mice group (Table S10) but not correlated with the DI (Table S11). We have already discussed this point in the manuscript.  

      [#34] (6) Initial response: Higher initial responses in the model at the start of the experiment may reflect poor model fit.

      Please refer to our reply to comment #26 for our explanation of what contributes to high initial responses in the latent cause model.

      In addition, achieving a good fit for the acquisition CRs was not our primary purpose, as the response measured in the acquisition phase includes not only a conditioned response to the CS and context but also an unconditioned response to the novel stimuli (CS and US). This mixed response presumably increased the variance of the measured freezing rate over individuals, therefore we did not cover the results in the discussion.

      Rather, we favor models at least replicating the establishment of conditioning, extinction and reinstatement of fear memory in order to explain the memory modification process. As we mentioned in the reply for comment #4, alternative models, the latent state model and the Rescorla-Wagner model, failed to replicate the observation (cf. Figure 3 – figure supplement 1A-1C). Thus, we chose to stand on the latent cause model as it aligns better with the purpose of this study. 

      [#35] In addition, please be transparent if data is excluded, either during the fitting procedure or when performing one-way ANCOVA. Avoid discarding data when possible, but if necessary, provide clarity on the nature of excluded data (e.g., how many, why were they excluded, which group, etc?).

      We clarify the information of excluded data as follows. We had 25 mice for the 6-month-old control group, 26 mice for the 6-month-old App<sup>NL-G-F</sup> group, 29 mice for the 12-month-old control group, and 26 mice for the 12-month-old App<sup>NL-G-F</sup> group (Table S1). 

      Our first exclusion procedure was applied to the freezing rate data in the test phase. If the mouse had a freezing rate outside of the 1.5 IQR in any of the test phases, it is regarded as an outlier and removed from the analysis (see Statistical analysis in Materials and Methods). One mouse in the 6-month-old control group, one mouse in the 6-month-old App<sup>NL-G-F</sup> group, five mice in the 12-month-old control group, and two mice in the 12-month-old App<sup>NL-G-F</sup> group were excluded.

      Our second exclusion procedure was applied during the fitting and parameter estimation (see parameter estimation in Materials and Methods). We have provided the number of anomaly samples during parameter estimation in Appendix 1 – figure 2.   

      Lastly, we would like to state that all the sample sizes written in the figure legends do not include outliers detected through the exclusion procedure mentioned above.

      [#36] Finally, since several statistical tests were used and the differences are small, I suggest noting that multiple comparisons were not controlled for, so p-values should be interpreted cautiously.

      We have provided power analyses in Tables S21 and S22 with methods described in the manuscript (Line 897-898) and added a note that not all of the multiple comparisons were corrected for in the manuscript (Line 898-899).

      References cited in the response letter only 

      Bellio, T. A., Laguna-Torres, J. Y., Campion, M. S., Chou, J., Yee, S., Blusztajn, J. K., & Mellott, T. J. (2024). Perinatal choline supplementation prevents learning and memory deficits and reduces brain amyloid Aβ42 deposition in App<sup>NL-G-F</sup> Alzheimer’s disease model mice. PLOS ONE, 19(2), e0297289. https://doi.org/10.1371/journal.pone.0297289

      Blei, D. M., & Frazier, P. I. (2011). Distance Dependent Chinese Restaurant Processes. Journal of Machine Learning Research, 12(74), 2461–2488.

      Cochran, A. L., & Cisler, J. M. (2019). A flexible and generalizable model of online latent-state learning. PLOS Computational Biology, 15(9), e1007331. https://doi.org/10.1371/journal.pcbi.1007331

      Curiel Cid, R. E., Crocco, E. A., Duara, R., Vaillancourt, D., Asken, B., Armstrong, M. J., Adjouadi, M., Georgiou, M., Marsiske, M., Wang, W., Rosselli, M., Barker, W. W., Ortega, A., Hincapie, D., Gallardo, L., Alkharboush, F., DeKosky, S., Smith, G., & Loewenstein, D. A. (2024). Different aspects of failing to recover from proactive semantic interference predicts rate of progression from amnestic mild cognitive impairment to dementia. Frontiers in Aging Neuroscience, 16. https://doi.org/10.3389/fnagi.2024.1336008

      Giustino, T. F., Fitzgerald, P. J., Ressler, R. L., & Maren, S. (2019). Locus coeruleus toggles reciprocal prefrontal firing to reinstate fear. Proceedings of the National Academy of Sciences, 116(17), 8570–8575. https://doi.org/10.1073/pnas.1814278116

      Gu, X., Wu, Y.-J., Zhang, Z., Zhu, J.-J., Wu, X.-R., Wang, Q., Yi, X., Lin, Z.-J., Jiao, Z.-H., Xu, M., Jiang, Q., Li, Y., Xu, N.-J., Zhu, M. X., Wang, L.-Y., Jiang, F., Xu, T.-L., & Li, W.-G. (2022). Dynamic tripartite construct of interregional engram circuits underlies forgetting of extinction memory. Molecular Psychiatry, 27(10), 4077–4091. https://doi.org/10.1038/s41380-022-01684-7

      Lacagnina, A. F., Brockway, E. T., Crovetti, C. R., Shue, F., McCarty, M. J., Sattler, K. P., Lim, S. C., Santos, S. L., Denny, C. A., & Drew, M. R. (2019). Distinct hippocampal engrams control extinction and relapse of fear memory. Nature Neuroscience, 22(5), 753–761. https://doi.org/10.1038/s41593-019-0361-z

      Loewenstein, D. A., Curiel, R. E., Greig, M. T., Bauer, R. M., Rosado, M., Bowers, D., Wicklund, M., Crocco, E., Pontecorvo, M., Joshi, A. D., Rodriguez, R., Barker, W. W., Hidalgo, J., & Duara, R. (2016). A Novel Cognitive Stress Test for the Detection of Preclinical Alzheimer’s Disease: Discriminative Properties and Relation to Amyloid Load. The American Journal of Geriatric Psychiatry : Official Journal of the American Association for Geriatric Psychiatry, 24(10), 804–813. https://doi.org/10.1016/j.jagp.2016.02.056

      Loewenstein, D. A., Greig, M. T., Curiel, R., Rodriguez, R., Wicklund, M., Barker, W. W., Hidalgo, J., Rosado, M., & Duara, R. (2015). Proactive Semantic Interference Is Associated With Total and Regional Abnormal Amyloid Load in Non-Demented Community-Dwelling Elders: A Preliminary Study. The American Journal of Geriatric Psychiatry : Official Journal of the American Association for Geriatric Psychiatry, 23(12), 1276–1279. https://doi.org/10.1016/j.jagp.2015.07.009

      Valles-Salgado, M., Gil-Moreno, M. J., Curiel Cid, R. E., Delgado-Á lvarez, A., Ortega-Madueño, I., Delgado-Alonso, C., Palacios-Sarmiento, M., López-Carbonero, J. I., Cárdenas, M. C., MatíasGuiu, J., Díez-Cirarda, M., Loewenstein, D. A., & Matias-Guiu, J. A. (2024). Detection of cerebrospinal fluid biomarkers changes of Alzheimer’s disease using a cognitive stress test in persons with subjective cognitive decline and mild cognitive impairment. Frontiers in Psychology, 15. https://doi.org/10.3389/fpsyg.2024.1373541

      Zaki, Y., Mau, W., Cincotta, C., Monasterio, A., Odom, E., Doucette, E., Grella, S. L., Merfeld, E., Shpokayte, M., & Ramirez, S. (2022). Hippocampus and amygdala fear memory engrams reemerge after contextual fear relapse. Neuropsychopharmacology, 47(11), 1992–2001. https://doi.org/10.1038/s41386-022-01407-0

    1. Author Response

      On behalf of my co-authors, I thank you very much for sending our manuscript (# eLifeRP-RA-2023-91223) entitled “Elimination of subtelomeric repeat sequences exerts little effect on telomere functions in Saccharomyces cerevisiae” for review and providing us an opportunity for revision. We also thank the reviewers for their critical and constructive comments and suggestions which have helped us to strengthen our study. We have performed more experiments to address the concerns the reviewers raised, and we have also revised or corrected some of our statements as the reviewers suggested.

      Reviewer #1

      1) The author’s data indicate that cells with many chromosomes are more dependent on possibly homologous recombination than SY12 cells with three chromosomes. Telomerase-deficient cells exhibit the type I and type II telomere structures, whereas telomerase-deficient SY12 cells often generate different telomere structures (named Type X survivors or atypical survivors). Type I survivor depends on Rad51 possessing tandem Y' elements whereas Type II survivor depends on Rad59 carrying long TG sequences (line 60-70). Both types require Rad52 (line 66-70). At the moment, it is not determined how Type X or atypical survivors are generated in telomerase-deficient SY12 cells.

      The authors need to determine whether Type X or atypical survivors depend on other repair pathways from Type I and Type II, and what DNA sequences are retained adjacent to telomeres in Type X or atypical survivors by sequencing analysis (Fig. 2).

      We thank the reviewer’s valuable comments and suggestions. Atypical survivor is a subtype of survivor that exhibits non-uniform telomere patterns, distinct from those observed in Type I, Type II, Type X, or circular survivors. To further determine its genetic requirements, we deleted RAD52 in SY12 tlc1Δ, SY12YΔ tlc1Δ, SY12XYΔ tlc1Δ, and SY12XYΔ+Y tlc1Δ strains. Southern blotting results showed that neither Type I nor Type II survivors were found in the series of strains; circular survivor was in the predomination; beside circular survivor, some survivors exhibiting non-uniform telomere patterns suggested they were atypical survivor. These results have been presented as Figure 2—figure supplement 6B, Figure 5—figure supplement 2B and Figure 6—figure supplement 4B in the revised version. The results showed that atypical survivors still emerged when Rad52 pathway was repressed, indicating that the formation of atypical survivors does not strictly rely on the homologous recombination.

      Given that "atypical" clones exhibit non-uniform telomere patterns, it’s not surprising that their chromosome structures are variable and tanglesome. Consequently, it is hard for us to amplify and sequence the DNA sequences retained adjacent to telomeres.

      Since no Type X survivor was detected in SY12 tlc1Δ rad52Δ strain (Author response image 1A), we deleted RAD50 or RAD51 in SY12 tlc1Δ strain to investigate on which pathway the formation of the Type X survivor relied. Results showed that Type X survivor emerged in the absence of Rad51 but not Rad50, suggesting that the formation of Type X survivor depended on Rad50 pathway. These results have been presented as Figure 2—figure supplement 7.

      To determine the chromosomal end structure of the Type X survivor, we randomly selected a typical Type X survivor, and performed PCR-sequencing analysis. The results revealed the intact chromosome ends for I-L, X-R, XIII-L, XI-R, and XIV-R, albeit with some mismatches compared with the S. cerevisiae S288C genome, which possibly arising from recombination events that occurred during survivor formation. Notably, the sequence of the Y’-element in XVI-L could not be detected, while the X-element remained intact. Figure 2—figure supplement 5 in the revised manuscript.

      2) Survivor generation of each type (Type I, Type II, Type X or atypical and circularization) needs to be accurately quantitated. The authors concluded that X or Y' elements are not strictly necessary for survivor formation (Fig. 5 and Fig. 6). However, their removal appears to increase atypical survivor and chromosome circularization (Fig. 2 vs Fig. 5 and 6).

      We are grateful for the reviewer’s critical and constructive suggestions. According to the reviewer’s requirement, we quantified each type of survivors in SY12 tlc1Δ, SY12YΔ tlc1Δ, SY12XYΔ tlc1Δ and SY12XYΔ+Y tlc1Δ strains (Figure 2D, 5C, 6A and 6B). In SY12 tlc1Δ strain, Type I survivors accounted for 16%, Type II survivors for 2%, Type X survivors for 24%, circular survivors for 20% and atypical survivors for 38%. In SY12YΔ tlc1Δ strain, 4% were Type II survivors, 52% were circular survivors and 44% were atypical survivors.

      For the SY12XYΔ tlc1Δ strain, 8% were Type II survivors, 48% were circular survivors and 44% were atypical survivors. In SY12XYΔ+Y tlc1Δ strain, the proportions of Type II, circular and atypical survivors were 14%, 44%, and 42%, respectively (Author response image 1).

      In comparing SY12YΔ with SY12XYΔ, we observed a similar ratio of circular and atypical survivors. This result indicates that the remove of X-elements exert little effect on the formation of circular and atypical survivors. Similarly, in SY12XYΔ+Y strain, the proportions of circular and atypical survivors were comparable to those in SY12XYΔ strain, indicating that Y’-elements also have little effect on the formation of circular and atypical survivors. However, due to the unknown frequency of survivor formation, alternative explanations of these data are possible. For example, subtelomeric elements previously suggested to have no impact on the formation of any survivor types might influence every type to similar extents, leading to similar ratios across all survivor types. With our present data, it is still unclear whether the absence of X and Y'-elements enhances the formation of circular and atypical survivors. Therefore, we did not present these results in the revised manuscript.

      Author response image 1.

      Quantitation of each survivor type in SY12 subtelomerice engineered strains. The ratio of survivor types in SY12 tlc1Δ, SY12YΔ tlc1Δ, SY12XYΔ tlc1Δ and SY12XYΔ+Y tlc1Δ strains. Type I, pulper; Type II, green; Type X, gray; atypical survivor, orange; circular survivor, blue.

      3)The authors asked whether X and Y' elements are required for cell proliferation, stress response, telomere length control and telomere silencing (Fig. 4). Similar studies have been previously carried out by using synthetic chromosomes (see PMID: 28300123). The authors need to discuss this point.

      Thanks for your suggestion, we have added the information in the revised version. (p.24 line 449-453)

      4) The Fig. 7 data support that circular chromosomes do not require Ku-dependent DNA end protection. This is consistent with the current view that Ku binds and protects DNA ends. This finding by itself does not contribute significantly to our understanding of telomere maintenance. The authors need to more extensively discuss the significance of their findings in SY12 cells compared to wild-type cells with 16 chromosomes.

      We agree with the logic that this reviewer has pointed out. Our results demonstrate that combinatorial deletion of YKU70 and TLC1 caused synthetic lethality in SY12 cells, which possess three linear chromosomes, However, it did not affect the viability of "circular survivors", supporting the notion that telomere deprotection leads to the synthetic lethality in yku70Δ tlc1Δ double mutants. Nevertheless, this conclusion merely confirms the current view observed in wild-type cells that Ku binds and protects DNA ends.

      To avoid confusing readers and maintain the logical flow of the manuscript, we have deleted this section in the revised version.

      Minor issues:

      1) Line 112-113: " for SY13, which contains two chromosomes, could also have a high probability of circularizing all chromosomes for survival": The reference or the supplemental data are required.

      Thank this reviewer for the suggestion. According to the reviewer’s comments, we performed a Southern blotting assay to examine the types of survivors in SY13 tlc1Δ strain. We found that the majority of SY13 tlc1Δ clones exhibited hybridization signal similar to SY14 tlc1Δ circular survivors, pointing to the possibility that two chromosomes in these survivors may undergo intra-chromosomal fusions. This result has been added to figure 1D in the revised version.

      2) Line 349-350: The BY4742 mre11Δ haploid strain serves as a negative control. The authors need to explain why mre11 cells serve as a negative control.

      Thank this reviewer for the comment. We employed mre11Δ as negative control because Mre11 is a member of the RAD52 epistasis group, which is involved in the repair of double-stranded breaks in DNA, and mutants in MRE11 exhibit defects in the repair of DNA damages caused by DNA damage drugs (Krogh and Symington, 2004; Lewis et al., 2004; Symington, 2002). (p.23 line 420-422)

      Reviewer #2

      1) The qualification of survivor types mostly relies on molecular patterns in Southern blots. While this is a valid method for a standard strain, it might be more difficult to apply to the strains used in this study. For example, in SY8, SY11 and SY12, the telomere signal at 1-1.2 kb can be very faint due to the small number of terminal Y' elements left. As another example, for the Y'-less strain, it might seem obvious that no Type I survivor can emerge given that Y' amplification is a signature of Type I, but maybe Type-I-specific molecular mechanisms might still be used. To reinforce the characterization of survivor types, an analysis of the genetic requirements for Type I and Type II survivors (e.g. RAD51, RAD54, RAD59, RAD50) could complement the molecular characterization in specific result sections.

      We thank this reviewer for his/her constructive comments and suggestions. To investigate whether Type-I-specific molecular mechanisms are still utilized in the survivor formation in Y'-less strain, we deleted RAD51 in SY12XYΔ tlc1Δ. SY12XYΔ tlc1Δ rad51Δ strain was able to generate three types of survivors, including Type II survivor, circular survivor and atypical survivor, similar to the observations in SY12XYΔ tlc1Δ strain. However, the ratios of circular and atypical survivors were 36% and 32%, respectively, lower than the 48% and 44% observed in SY12XYΔ tlc1Δ strain (supplementary file 5). This result indicates that Type-I-specific molecular mechanisms contribute to the survivor formation. Given that our work primarily focuses on the function of subtelomeric elements, we chose not to include this result in our revised manuscript to maintain a coherent logical flow.

      To reinforce the characterization of survivor types, we deleted RAD50, RAD51 and RAD52 in SY12 tlc1Δ strain, respectively. Southern blotting assay revealed that in the absence of Rad51, no Type I survivor was detected; in the absence of Rad50, neither Type I nor Type X survivor was detected. However, circular and atypical survivors still emerged in the absence of Rad52, suggesting that the RAD52-mediated homologous recombination is not strictly necessary for the formation of circular and atypical survivors. These results have been presented as Figure 2—figure supplement 6 and Figure 2— figure supplement 7.

      2) In the title, the abstract and throughout the discussion, the authors chose to focus on the effect of X- and Y'-element deletion on different phenotypes and on survivor formation, as the main message to convey. While it is a legitimate and interesting message, other important results of this work might benefit from more spotlight. Namely, the observation that strains with different chromosome numbers show different survivor patterns and that several survival strategies beyond Type I and II exist and can reach substantial frequencies depending on the chromosomal context.

      Thanks for your valuable suggestion. While we value your suggestion to highlight additional aspects of our work, we would like to express our perspective on the current emphasis on the effect of X- and Y'-element deletion. We believe that by maintaining this focus, we can present a more coherent and impactful narrative for our readers. Additionally, we recognize that the relationship between chromosome numbers and survivor type frequencies is complex and warrants further experimental validation. We are considering exploring this aspect in more detail in our future projects. However, we fully acknowledge the importance of the observations you raised concerning strains with different chromosome numbers and the diversity of survival strategies.

      3) In SY12 strain, while X- and Y'-elements are not essential for survivor emergence, they do modulate the frequency of each type of survivors, with more chromosome circularization events observed for SY12YΔ, SY12XYΔ and SY12XYΔ+Y strains. This result should be stated and discussed, maybe alongside the change in survivor patterns in the other SY strains, to more accurately assess the roles of these subtelomeric elements.

      Following the reviewer’s suggestion, we compared the circular survivor ratios in SY12 tlc1Δ, SY12YΔ tlc1Δ, SY12XYΔ tlc1Δ and SY12XYΔ+Y tlc1Δ strains (supplementary file 5). It appears that the formation of circular survivors is less efficient in the SY12 tlc1Δ, with a ratio of 20%, much lower than that in SY12YΔ tlc1Δ, SY12XYΔ tlc1Δ or SY12XYΔ+Y tlc1Δ strains. However, it should be noted that SY12 tlc1Δ can generate Type I and Type X survivors, potentially decreasing the ratio of circular survivors.

      Therefore, we further compared the circular survivor ratios in SY12YΔ tlc1Δ, SY12XYΔ tlc1Δ and SY12XYΔ+Y tlc1Δ strains. In the SY12YΔ tlc1Δ strain, circular survivors accounted for 52% (26/50), comparable to 48% (24/50) in the SY12XYΔ tlc1Δ strain, indicating that X- elements exert little effect on the formation of circular survivor. Additionally, the ratio of circular survivors was 44% (22/50) in SY12XYΔ+Y tlc1Δ strain, also comparable to 48% (24/50) in the SY12XYΔ tlc1Δ strain, suggesting that Y’-element also has little effect on chromosome circularization. However, due to the unknown frequency of survivor formation, alternative explanations of these data are possible. For example, subtelomeric elements previously suggested to have no impact on the formation of any survivor types might influence every type to similar extents, resulting in similar ratios across all survivor types. With our current data, it is still uncertain whether X and Y'-elements modulate the frequency of each type of survivors. Therefore, we did not include these results in the revised manuscript.

      4) The authors might want to update some general information about subtelomere structure and their diversity across yeast strain with the recent paper by O'Donnell et al. 2023 Nature Genetics, "Telomere-to-telomere assemblies of 142 strains characterize the genome structural landscape in Saccharomyces cerevisiae".

      Thanks for your advice. We have added this information in the revised manuscript. (p.3 line 51-54)

      5) Although it is cited in the discussion, the recent work by the Malkova lab (Kockler et al. 2021 Mol Cell) could be mentioned in the introduction as it conceptually changes our views on survivor formation, its dynamics and the categorization into Type I and Type II.

      Thanks for your advice. We have added this information in the revised manuscript. (p.5 line 75-78)

      6) p.7 line 128-130: rather than chromosome number, the ratio of survivor types might be controlled by the fraction of subtelomeres with Y'-elements and their relative configuration across chromosomes. A map of the structure of remaining subtelomeres in the SYn strains might be good to have.

      We have added this information in supplementary file 2 in the revised manuscript.

      7) Fig. 1C: in SY9 tlc1Δ, the lane with triangle mark looks like a type II.

      The hybridization pattern of SY9 tlc1Δ clone 2 has both amplified Y’L-element and long heterogeneous TG1-3 repeats, it might be the “hybrid” survivor mentioned by Kockler’s work (Kockler et al., 2021). Therefore, we classify it as a no-classical survivor.

      8) p.9 line 149: the title of this result section "Y'-element is not essential for the viability of cells carrying linear chromosomes" doesn't reflect well the content of the section, which is more about characterizing the survivor pattern in SY12.

      Thanks for your advice. We have changed the title of this section into “Characterizing the survivor pattern in SY12” in the revised manuscript. (p.9 line 155)

      9) p.10 line 167: that type I can emerge in SY12 indicates that multiple Y'-elements in tandem are not required for type I recombination. I am not sure if this was already known, but it could be noted.

      We appreciate the reviewer’s comment. We have added this information in the revised manuscript. (p.10 175-177)

      10) p.18 line 318-320: the deletion of the Y' element also seems to remove the centromere-proximal telomere sequence adjacent to it. Maybe it should be stated as well. Even more importantly, in lines 327-329, the Y'-element that is reintroduced in the strain does not include the centromere-proximal short telomere sequence. This is important to interpret the Southern blots.

      We thank the reviewer for this critical suggestion. The deletion of Y'-element including both Y’- and X- element sequence in XVI-L (supplementary file 4), and the Y’element in the XVI-L does not contain the centromere-proximal telomere sequence. The Y'-element reintroduced into the left arm of Chr 3 in SY12XYΔ strain was cloned from native left arm of XVI in SY12 strain which does not contain the centromere-proximal short telomere sequence. Besides listing these details in supplementary file 4, we also emphasize it in the revised manuscript (p.21 line 397-398).

      11) p.29 lines 496-497: it seems that X and Y'-elements tend to inhibit formation of circular survivors either directly (by participating in end protection), or by promoting type I and type II, thus reducing the fraction of circular survivors. Maybe this could be added to the conclusion of this section.

      We thank the reviewer for his/her comments and have analyzed survivor types in SY12 tlc1Δ, SY12YΔ tlc1Δ, SY12XYΔ tlc1Δ and SY12XYΔ+Y tlc1Δ strains (supplementary file 5). Circular survivor formation appears less efficient in the SY12 tlc1Δ, with a ratio of 20%, significantly lower than SY12YΔ tlc1Δ, SY12XYΔ tlc1Δ or SY12XYΔ+Y tlc1Δ strains. However, it is noteworthy that SY12 tlc1Δ can generate Type I and Type X survivors, potentially impacting the circular survivor ratio.

      We further compared circular survivor ratios in SY12YΔ tlc1Δ, SY12XYΔ tlc1Δ and SY12XYΔ+Y tlc1Δ strains. SY12YΔ tlc1Δ had 52% circular survivors, similar to SY12XYΔ tlc1Δ with 48%, indicating minimal impact of X- elements. Additionally, SY12XYΔ+Y tlc1Δ had 44% circular survivors, also similar to SY12XYΔ tlc1Δ, suggesting that Y’-element has little effect on chromosome circularization. However, due to unknown frequency of survivor formation, alternative explanations, like subtelomeric elements affecting all the type of survivor similarly, are possible. With our current data, it remains unclear whether X and Y'-elements are involved in end protection and consequently inhibit the formation of circular survivors.

      Therefore, these results were not included in the revised manuscript.

      12) p.32 line 533: this result section doesn't really fit with the rest of the paper, does it?

      Thanks for your valuable advice. To avoid confusing readers and to keep the fluency of logic flow of the manuscript we have deleted this section in the revised version.

      13) The methods section does not describe the experiments sufficiently and it often lacks specific details such as the manufacturer or references. Some sections of the methods are more exhaustive than others. They should all be written with the same level of detail in my opinion.

      Thanks for your advice. We have described the experiments more sufficiently and added the manufacturer or references in the ‘materials and methods’ part in the revised manuscript. (p.41 line741-745, p.42 line 755-756, p.42 line 762-770, p.43 line 788 and p.45 line 812-813)

      Minor comments, typos and grammatical errors:

      p.3 line 33: "INTROUDUCTION" should be "INTRODUCTION".

      We have corrected it in the revised manuscript. (p.3 line 33) p.4 line 54: "S, cerevisiae", use dot instead of comma. R15: We have corrected it in the revised manuscript. (p.4 line 57)

      p.4 line 55: I believe TLC1 as the RNA moiety should be in (non-italicized) capital letters and not written as a protein.

      We have corrected it in the revised manuscript. (p.4 line 58)

      p.7 line 115: please indicate that pRS316 uses URA3 as a marker, otherwise the counterselection with 5'-FOA is not obvious.

      Thank this reviewer for the comment. We have added this statement in the revised manuscript. (p.7 line 121-122)

      p.12 line 206: tlc1Δ should be in italic.

      We have corrected it in the revised manuscript. (p.10 line 184)

      p.13 lines 227-229: "where only one hybridization signal", a verb seems to be missing.

      We thank the reviewer’s kind reminder and have corrected the mentioned errors in the revised manuscript. (p.14 line 254-255)

      Reviewer #3

      1) A weakness of the manuscript is the analysis of telomere transcriptional silencing. They state: "The results demonstrated a significant increase in the expression of the MPH3 and HSP32 upon Sir2 deletion, indicating that telomere silencing remains effective in the absence of X and Y'-elements". However, there are no statistical analyses performed as far as I can see. For some of the strains, the significance of the increased expression in sir2 (especially for MPH3) looks questionable. In addition, a striking observation is that the SY12 strain (with only three chromosomes) express much less of both MPH3 and HSP32 than the parental strain BY4742 (16 chromosomes), both in the presence and absence of Sir2. In fact, the expression of both MPH3 and HSP32 in the SY12 sir2 strain is lower than in the BY4742 SIR2+ strain. In addition, relating this work to previous studies of subtelomeric sequences in other organisms would make the discussion more interesting.

      First, I enjoyed reading your manuscript. It would be great if you performed the statistical analysis on the RT-qPCR data in figure 4B and addressed the issue of the difference of the BY4742 and SY12 strains. A model could be that this is a titration effect of silencing proteins due to fewer telomeres, which could be investigated by performing the analyses on more SY-strains with variable numbers of telomeres.

      We highly appreciate the reviewer’s valuable comments and suggestions, which included a point that has also left us confused. We conducted statistical analyses on the RT-qPCR data, and the t-test result revealed that upon the deletion of Sir2, SY12YΔ, SY12XYΔ and SY12XYΔ+Y strains exhibited a significant increase in MPH3 expression (located on the right arm of chr X) with a P value < 0.05. In the case of SY12, the deletion of Sir2 resulted in an increase in gene expression (P value < 0.1). Similar tendencies were observed in the BY4742 strain. The statistical analyses of RTqPCR results on XVI-L mirrored those of X-R.

      The results demonstrated a significant increase in MPH3 and HSP32 expression upon SIR2 deletion in SY12YΔ, SY12XYΔ and SY12XYΔ+Y strains, leading to the conclusion that telomere silencing remains effective in the absence of X-and Y’-elements. However, as the reviewer has pointed out, no statistically significant differences in MPH3 and HSP32 expression were observed between the SY12 and SY12 sir2Δ strain. For HSP32, this lack of significance may be attributed to the greater distance between HSP32 and telomere XVI-L in SY12 compared to SY12YΔ, SY12XYΔ or SY12XYΔ+Y strains, resulting in a weaker telomere position effect on HSP32 and a non-significant increase in gene expression in SY12. However, this explanation does not apply to MPH3, as SY12YΔ, with a same distance between MPH3 and telomere X-R as in SY12, still exhibits an effective telomere position effect on MPH3. We cannot provide a compelling explanation at this moment, and we suspect that the lack of statistically significant differences may be due to random clonal variation.

      Additionally, the SY12 strain (with three chromosomes) exhibited lower expression levels of both MPH3 and HSP32 compared to the parental strain BY4742 (with 16 chromosomes). Notably, it has been reported that the expression of genes coding silencing proteins in SY14 (with one chromosomes) were nearly identical to that of BY4742 (with 16 chromosomes)(Shao et al., 2018). Consequently, with respect to the reduced chromosome numbers, the silencing proteins appeared to be relatively overexpressed. Therefore, as pointed out by the reviewer, this observed phenomenon may be attributed to a titration effect of silencing proteins due to fewer telomeres. We have added the statistical analyses result in Figure 4B.

      We have related our work with previous studies of subtelomeric sequences in fission yeast in the discussion part. (p.37 line 655-676)

      Minor points are to correct the figure legend for Figure 6 supplement 1 (the strain designations) and line 55, RNAs are written with all caps, i.e. TLC1, and line 537 delete the "which" in the sentence.

      Thanks for your advice. We have corrected them in the revised manuscript.

      1) The strain has been replaced with SY12XYΔ+Y (p.35 line 617, 618 and 620)

      2) “Tlc1” has been replaced with “TLC1” (p.4 line 58).

      3) We have deleted the section of “Circular chromosome maintain stable when double knockout of yku70 and tlc1” according to the suggestions raised by reviewer 1 and 2, the deleted section contain the sentence in line 537 you mentioned.

      Kockler, Z.W., Comeron, J.M., and Malkova, A. (2021). A unified alternative telomerelengthening pathway in yeast survivor cells. Molecular Cell 81, 1816-1829.e1815. Krogh, B.O., and Symington, L.S. (2004). Recombination proteins in yeast. Annu Rev Genet 38, 233-271.

      Lewis, L.K., Storici, F., Van Komen, S., Calero, S., Sung, P., and Resnick, M.A. (2004). Role of the nuclease activity of Saccharomyces cerevisiae Mre11 in repair of DNA double-strand breaks in mitotic cells. Genetics 166, 1701-1713.

      Shao, Y., Lu, N., Wu, Z., Cai, C., Wang, S., Zhang, L.L., Zhou, F., Xiao, S., Liu, L., Zeng, X., et al. (2018). Creating a functional single-chromosome yeast. Nature 560, 331-335. Symington, L.S. (2002). Role of RAD52 epistasis group genes in homologous recombination and double-strand break repair. Microbiol Mol Biol Rev 66, 630-670, table of contents.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      The chosen classification scheme for aGPCRs may require reassessment and amendment by the authors in order to prevent confusion with previously issued classification attempts of this family. (…) Can the authors suggest another scheme (mind to avoid the subfamily IIX or the alternative ADGRA-G,L,V subfamily schemes of metazoan aGPCRs), and adapt their numbering throughout the text and all figures/supplementary figures/supplementary files?

      We appreciate the reviewer's comment and agree that a different nomenclature should be used for choanoflagellate aGPCRs to avoid possible confusion. We have now re-labeled the choanoflagellate aGPCR subfamilies, previously numbered from I to XIX, using alphabetical enumeration (from A to S). Changes have been made throughout the main text, in Figure 5, and in Supplementary Figures S6 and S7.

      line 10: The abbreviation 'GPCR-TKL/Ks' is not explained.

      Thank you for pointing this out. We have now revised the text to explain the abbreviation:

      “Adhesion GPCRs and a class of GPCRs fused to kinases (the GPCR-TKL/Ks) are the most abundant GPCRs in choanoflagellates.”

      line 30: "7TM domain is diagnostic for GPCRs": strange wording. Use an alternative expression.

      We changed the wording to: 

      “A conserved seven transmembrane (7TM) domain is a hallmark of GPCRs, while the wide spectrum of extracellular and intracellular domains in some GPCRs reflects the diversification of the gene family and its functions (Schiöth and Lagerström 2008).”

      line 33: In the case of rhodopsins, not the GPCR (i.e., the apoprotein) responds directly to photons, but the retinal, which isomerises upon illumination.

      We thank the reviewer for bringing this to our attention, and we have now removed mention of photons from the list of cues detected by GPCRs.

      “For example, the extracellular N-terminus and the three extracellular loops of the 7TM domain respond to a wide range of cues, including odorant molecules, peptides, amines, lipids, nucleotides, and other molecules (Yang et al. 2021).”

      line 111: What are "genome-enabled choanoflagellates"? Explain the term. As it stands, it doesn't make sense to me.

      We meant only to highlight that these two species have sequenced genomes. We have deleted the phrase “genome enabled.”

      “To assess the predictive power of our protein-detection pipeline, we then compared the new GPCR and cytosolic signaling component datasets from two choanoflagellates – Salpingoeca rosetta and Monosiga brevicollis – with previously published GPCR and downstream GPCR signaling component counts for these two species (Nordström et al. 2009a; Krishnan et al. 2012; De Mendoza et al. 2014; Krishnan et al. 2015; Lokits et al. 2018).”

      line 145: Please give a reasoning for the naming of each of the new families (e.g., RemiSens, Hidden Gold, GPCR-TLK/K, etc.) or at least the explanations of the acronyms/names early in the manuscript, even if they are discussed later in more detail.

      Thank you for identifying this as an area of confusion. While we feel that going into the rationale behind each of the names here would interrupt the flow of the manuscript, we have added a phrase encouraging readers to “hold that thought” with the hope that they can wait for the sections that specifically focus on each of these new GPCR families.

      “This left twelve new GPCR families that had not, to our knowledge, been previously detected in choanoflagellates: Rhodopsin, TMEM145, GPR180, TMEM87, GPR155, GPR157, and six additional GPCR families that appear to fall outside all previously characterized GPCR families in eukaryotes. For reasons that will be discussed further below, we have named these six new GPCR families “Rémi-Sans-Famille” (RSF), “Hidden Gold” (Hi-GOLD), GPCR-TKL/K, GPRch1, GPRch2, and GPRch3. (Fig. 1B; Table 1).”

      lines 297/298 and 2049: Rename tethered agonist "peptide" to "element". Synthetic peptides resembling the TA were used in experiments to test for the sufficiency of the TA for receptor activation, but because the naturally occurring TAs are part of the receptor protein, they are not peptides.

      Thank you for pointing this out. We have revised the text as suggested.

      line 2026: I think the letters in the acronym "CMR" are mixed up and were intended to read "CRM".

      Good catch! We have corrected the text.

      line 2048: "diagnostic" again. Change to "tell-tale", "hallmark", or another similar descriptor.

      We have corrected the text accordingly.

      2058: Strike "motif" in order to avoid confusion with the now obsolete term "GPS motif", which entailed the five most C-terminal β-strands of GAIN subdomain B (not thus neither the full GAIN domain nor the GPS).

      Thank you for pointing this out. We have corrected the text.

      Figure 5: Did the authors also find homologs placed in the aGPCR family based on their 7TM domain sequence but lacking a GAIN domain similar to vertebrate ADGRA/GPR123, the only aGPCR known to lack a GAIN domain (10.1016/j.tips.2013.06.002)? Irrespective of the authors' findings or non-finding on that matter, please insert a note on this in the results text.

      We thank the reviewer for bringing this interesting point to our attention. We have now added a new supplementary figure A in Fig. S9 to answer the reviewer's comment. We also modified the legend of Fig. S9  to take into account this change and uploaded a new supplementary data file 20 to support Fig. S9A. Finally, we revised the main text under the section “Adhesion GPCRs” as requested: 

      Lines 328-331: “ While the GAIN and aGPCR 7TM domains evolved before the origin of opisthokonts (Araç et al.2012; Krishnan et al. 2012; De Mendoza et al. 2014), we detected the fusion of these two domains into a single module (GAIN/7TM) in most, but not all, holozoan aGPCRs (Fig. 5D, Fig.S7B and S9A; Supplementary file 20; Prömel et al, 2013; Krishnan et al. 2014).

      Reviewer #2:

      While the study contributes several interesting observations, it does not radically revise the evolutionary history of the GPCR family. However, in an era increasingly concerned with the reproducibility of scientific findings, this is arguably a strength rather than a weakness. It is encouraging to see that previously established patterns largely hold, and that with expanded sampling and improved methods, new insights can be gained, especially at the level of specific GPCR subfamilies. Then, no functional follow-ups are provided in the model system Salpingoeca rosetta, but I am sure functional work on GPCRs in choanoflagellates is set to reveal very interesting molecular adaptations in the future.

      We agree with the reviewer and anticipate that this work will provide a useful resource to motivate the future functional characterization of GPCRs in choanoflagellates, other CRMs, as well as in metazoans.

      The GPCR-TKL fusion is a particularly interesting finding, especially given the presence of such sequences in sponges. This could potentially represent a synapomorphy shared between sponges and choanoflagellates, later lost in other animals. The authors mention that BLASTP searches using the kinase domain recover the sponge GPCR-TKLs, suggesting the fusion may be ancestral. It would be useful to include phylogenetic trees of both the GPCR and TKL domains to assess this possibility. The authors might also consider examining sponge genomes released by the DTOL project to increase representation from this group.

      We agree and thank the reviewer for this suggestion. We have now added the requested phylogenetic analyses to the new Figure S17, revised the supplementary files and Methods accordingly, and commented on these results in the main text under the section “GPCR-TKL/K and GPCR-TKs“.  

      Lines 579 – 589: “While no metazoan homologs were found when using the 7TM domain of choanoflagellate GPCR-TKs as queries, using the conserved tyrosine kinase domains as queries recovered GPCR-TKs in sponges but not in other metazoan lineages or other holozoans (Fig. S17E). To test whether GPCR-TKs in sponges and choanoflagellates are homologous, we performed phylogenetic analyses of their TK and 7TM domains (Fig. S17F and G). While the TK domains of GPCR-TKs from sponges and choanoflagellates formed a well-supported clade, their 7TM domains did not. These results point to a heterogeneous evolutionary history that may include domain swapping (i.e. ancestral GPCR-TKs in which the 7TM domain was replaced in either the sponge or choanoflagellate lineages) or convergent evolution, in which homologous 7TM domains fused with unrelated 7TM domains in the sponge and choanoflagellate lineages.”

      Added to the Method section “Sequence alignment and phylogenetic analyses”:

      Lines 913 – 933: “Phylogenetic analyses of holozoan aGPCRs, Glutamate Receptors, and Gα subunits, and the 7TM and Kinase domains from GPCR TK/TKL/Ks were performed in this study. (…) To construct the phylogenies of the Kinase domain and 7TM domain from the GPCR TK/TKL/Ks, we first built a dataset including all the GPCR TK/TKL/Ks sequences identified in choanoflagellates and in sponges, as well as the GPCR TKL/Ks previously published in oomycetes and amoebozoans (Van Den Hoogen et al. 2018). We extracted the 7TM domain and Kinase domain from each sequence by combining the transmembrane domain prediction tool TMHMM-2.0 and the protein domain prediction tool InterProScan with the alignment tool MAFFT (E-INS-I algorithm) on Geneious Prime v2024.07 (Supplementary Files 30 and 32). We then aligned the aGPCR, Glutamate and Glutamate GPCR TK/TKL/K Receptor 7TMs, the GPCR TK/TKL/Ks Kinase domain, or the full-length Gα sequences using MAFFT with the E-INS-I algorithm. The resulting alignments were then used for Maximum-likelihood and/or Bayesian inference of phylogenies (Fig. 3B, Fig. 5A, Fig. S3D, and Fig. S6A, and Fig. S17F and G; Supplementary Files 5, 9, 16,18, 31, and 33).”

      Rhodopsin-like receptors are proposed in the discussion to be potential cases of lateral gene transfer (LGT) between eukaryotes. To support or refute this hypothesis, it would be valuable to place the choanoflagellate and ichthyosporean Rhodopsins within a broader phylogeny of this family, including (a few) representatives from animals and other eukaryotes. Even if deep branching relationships remain unresolved, signs such as unusually short branches could point toward recent LGT events.

      Thank you for your suggestion. While we originally considered testing these alternative hypotheses in this manuscript by building a phylogeny, the rapid sequence evolution of the Rhodopsin family has stymied similar efforts in the past and instead motivated others to use clustering approaches like those used in our study (Hu et al. 2017; Thiel et al. 2023). Unfortunately, these types of analyses cannot be used to readily identify instances of LGT.

      Therefore, following the suggestion of the reviewer, we bit the bullet and performed phylogenetic analyses on the sequences in question. Unfortunately, these analyses were completely inconclusive, and we feel they do not warrant inclusion in the manuscript. The topologies of the sequence trees recovered were poorly supported and sensitive to most of the variables we tested – the set of rhodopsin sequences included, the multiple alignment algorithms used, and the probabilistic methods employed to infer the phylogenies. 

      Instead, we have revised the manuscript to highlight the challenge of differentiating between the different hypotheses that are consistent with the phylogenetic distribution of Rhodopsins:

      Lines 670 – 678: “Thus, while it is formally possible that Rhodopsins existed in stem choanoflagellates and were lost in most modern choanoflagellate lineages, either horizontal gene transfer or convergent evolution in the shared ancestor of S. macrocollata and S. punica are similarly plausible explanations for their presence in these species. Differentiating between these alternative evolutionary scenarios is challenging because of rapid rate of sequence evolution within the family and the resultant loss of phylogenetic signal. Our own preliminary investigations of Rhodopsin evolution in non-metazoans were inconclusive. Therefore, ambiguities about the provenance and function of CRM Rhodopsins currently obscure the ancestry of metazoan Rhodopsins and opsins.”

      While the study surveys most available holozoan genomes, it appears that the genomes of Amoebidium spp.-which are cited in the manuscript- were not included. It may not be necessary to repeat all analyses with these two species (A. appalachense and A. parasiticum), but a preliminary search indicates the presence of four candidate 7tm_1 (Rhodopsin-like) proteins in their proteomes. These may warrant closer inspection (e.g., via BLASTP against animal databases) to confirm whether they are genuine GPCRs or false positives.

      Author response image 1.

      We thank the reviewer for bringing these sequences to our attention. To be clear, we did not analyze the Amoebidium spp. genome and we can find no reference to it in our manuscript. If the reviewer had the impression that the genome was analyzed, we would be grateful to know the source of the confusion so that it can be corrected. (We did not intentionally exclude the genome; it simply was not available on the Multicell Genome database from which we retrieved the ichthyosporean genomes and transcriptomes used in this study.)

      Nevertheless, out of curiosity, we have now analyzed the sequences provided by the reviewer and summarize our findings here for the interest of the reviewer. Although the sequences were annotated as 7tm_1 (Rhodopsin-like) proteins in the original genome study, none of these sequences group with metazoan or choanoflagellate Rhodopsins in our clustering analysis; instead, we found that these putative GPCRs form a distinct cluster that only weakly resembles cAMP receptors, both on the basis of their sequence and predicted structures. 

      It is not surprising to find new GPCR clusters as new taxa are folded into the study, and these Amoebidium sequences do not add to our understanding of Rhodopsin evolution. Therefore, we have not added their analysis to the manuscript, but we hope the reviewer finds our quick analysis of interest.

      Author response image 2.

      In Figure 2, perhaps expanding the other holozoan clades would have been nice, as there are not too many species, but I understand if that's beyond the point of the manuscript, focused on choanoflagellates.

      Thank you for this comment. However, given the focus of this study, we feel that an expansion of the other holozoan clades would reduce the clarity of the figure.

      line 87 - "To this end, the 671 validated choanoflagellate GPCRs were sorted by sequence similarity, resulting in 18 clusters. "Some details in the results section would be nice, or at least clear references to where this is explained in more detail. How were the extra choanoflagellate GPCRs added if they failed to be identified with quite sensitive HMM profiles?

      We apologize for the possible confusion and thank the reviewer for the suggestion; we have now added specific references to the related sections from the material and methods for interested readers.

      We believe that the "extra choanoflagellate GPCRs" mentioned by the reviewer refer to the choanoflagellate GPCRs that failed to be detected when the choanoflagellate genomes and transcriptomes were searched with the predominantly metazoan-derived GPCRHMM and HMMs from the GPCR_A Pfam clan (CL0192). We were able to recover these extra choanoflagellate GPCRs by using custom choanoflagellate-specific GPCR HMMs and by blasting the choanoflagellate GPCRs previously identified as queries against the 23 choanoflagellate proteomes. We hope that the referencing of the Methods section "Recovering additional choanoflagellate GPCRs using choanoflagellate GPCR BLAST queries and custom choanoflagellate GPCR HMMs", in lines 91 and 93, will help clarify this point.

      line 108 - Well, from the figure it seems that most eukaryotes have an 'animal-like' G protein signalling, so that's perhaps more of an eukaryotic signature than something that puts choanoflagellates and animals together.

      Excellent point! We have revised the text.

      line 132 - It is unclear what the criteria are to include these taxa as helpers for choanoflagellate classification, and not adding the other unicellular holozoans. Just some text justification could help.

      Thank you for pointing this out. We have added an explanation of the rationale to the methods — section “Clustering of the 918 validated choanoflagellate GPCRs” — and referred to it in the main text.

      New text added to methods:

      “The non-choanoflagellate sequences added to the dataset were either top blast hits recovered after searching the entire Eukprot v3 dataset (993 species) with choanoflagellate GPCRs as queries, or previously published and well-documented GPCR sequences from metazoans.”

      line 145 - These families are listed, but perhaps it would be nice to explicitly mention that they will be covered in more detail later on in the manuscript. I found myself wondering about those exotic names, until I reached the sections in the manuscript where they are explained.

      Thank you for this suggestion. We have now modified our sentence to refer to the related sections.

      “For reasons that will be discussed further below, we have named these six new GPCR families “Rémi-Sans-Famille” (RSF), “Hidden Gold” (Hi-GOLD), GPCR-TKL/K, GPRch1, GPRch2, and GPRch3. (Fig. 1B; Table 1).”

      line 199 - perhaps would be nice to explain domain architecture of validated Dictyostelium GABA-like receptors (ANF domain?).

      Thank you for your suggestion. We have now modified the sentence to mention the protein domain composition of the validated GABA-like receptor, GrlE, in Dictyostelium.

      “The Glutamate Receptors from the amoebozan Dictyostelium discoideum, of which at least one, GrlE, binds both GABA and Glutamate presumably through its conserved ANF domain (Anjard and Loomis 2006; Taniura et al. 2006; Wu and Janetopoulos 2013), grouped separately from metazoan and CRM GPCRs in our analysis.”

      Figure S4 - Perhaps a stacked bar chart would be easier to browse than a bunch of pie charts, notoriously difficult to quantify.

      Thank you for this comment. Opinions differ on how best on whether pie charts or bar charts are more effective in this context (including between the authors of this manuscript). However, we think the point of Figure S4 a minor point, only to be appreciated by a tiny number of readers, and therefore have left the data presentation as it was in the original submission.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Li et al. investigate Ca2+ signaling in T. gondii and argue that Ca2+ tunnels through the ER to other organelles to fuel multiple aspects of T. gondii biology. They focus in particular on TgSERCA as the presumed primary mechanism for ER Ca2+ filling. Although, when TgSERCA was knocked out there was still a Ca2+ release in response to TG present.

      Note that we did not generate a complete SERCA knockout, as this gene is essential, and its complete loss would not permit the isolation of viable parasites. Instead, we created conditional mutants that downregulate the expression of SERCA. Importantly, some residual activity is present in the mutant after 24 h of ATc treatment as shown in Fig 4C. This is consistent with our Western blots, which demonstrate the presence of residual SERCA protein at 1, 1.5 and 2 days post ATc treatment (Fig. 3B). We have clarified this point in the revised manuscript (lines 232233). See also lines 97-102.

      Overall the Ca2+ signaling data do not support the conclusion of Ca2+ tunneling through the ER to other organelles in fact they argue for direct Ca2+ uptake from the cytosol. The authors show EM membrane contact sites between the ER and other organelles, so Ca2+ released by the ER could presumably be taken up by other organelles but that is not ER Ca2+ tunneling. They clearly show that SERCA is required for T. gondii function.

      Overall, the data presented to not fully support the conclusions reached

      We agree that the data does not support Ca<sup>2+</sup> tunneling as defined and characterized in mammalian cells. In response to this comment, we have modified the title and the text accordingly.

      However, we respectfully would like to emphasize that the study demonstrates more than just the role of SERCA in T. gondii “function”. Our findings reveal that the ER, through SERCA activity, sequesters calcium following influx through the PM (see reviewer 2 comment). The ER calcium pool is important for replenishing other intracellular compartments.

      The experiments support a model in which the ER actively takes up cytosolic Ca²⁺ as it enters the parasite and contributes to intracellular Ca²⁺ redistribution during transitions between distinct extracellular calcium environments. We believe that the role of the ER in modulating intracellular calcium dynamics is demonstrated in Figures 1H–K, 4G-H, and 5H–K. To highlight the relevance of these findings, we have included an expanded discussion in the revised manuscript. See lines 443-449 and 510-522.

      Data argue for direct Ca2+ uptake from the cytosol

      The ER most likely takes up calcium from the cytosol following its entry through the PM and redistributes it to the other organelles. We deleted any mention of the word “tunneling” and replaced it with transfer and re-distribution as they reflect our experimental findings more accurately.

      We interpret the experiments shown in Figure 1 H and I as re-distribution because the amount of calcium released after nigericin or GPN are greatly enhanced after TG addition. We first add calcium to allow intracellular stores to become filled, followed by the addition of TG, which allows calcium leakage from the ER. This leaked calcium can either enter the cytosol and be pumped out or be taken up by other organelles. Our interpretation is that this process leads to an increased calcium content in acidic compartments.

      We conducted an additional experiment in which SERCA was inhibited prior to calcium addition, allowing cytosolic calcium to be exported or taken up by acidic stores. We observed a change in the GPN response (Fig. S2A), possibly indicating that the PLVAC can sequester calcium when SERCA is inactive. While this may support the reviewer’s view, TG treatment does not reflect physiological conditions and may enhance calcium transfer to other compartments. Although the result is interesting, interpretation is complicated by the use of parasites in suspension and drug exposure in solution. Single-parasite measurements are not feasible due to weak signals, and adhered parasites are even less physiological than those in suspension.

      In support of our view, the experiments shown in Figs 4G and H show that down regulating SERCA reduces significantly the response to GPN indicating diminished acidic store loading. In Fig 5I we observe that mitochondrial calcium uptake is reduced in the iDSERCA (+ATc) mutant in response to GPN. Fig 2B demonstrates that TgSERCA can take up calcium at 55 nM, close to resting cytosolic calcium while in Figures 5E and S5B we show that the mitochondrion is not responsive to an increase of cytosolic calcium. Uptake by the mitochondria requires much higher concentrations (Fig 5B-C), which may be achieved within microdomains at MCS between the ER and mitochondrion. This is also consistent with findings reported by Li et al (Nat Commun. 2021) where similar microdomains mediated transfer of calcium to the apicoplast (Fig. 7 E and F of the mentioned reference) was observed.

      Reviewer 2 (Public review):

      The role of the endoplasmic reticulum (ER) calcium pump TgSERCA in sequestering and redistributing calcium to other intracellular organelles following influx at the plasma membrane.

      T. gondii transitions through life cycle stages within and exterior to the host cells, with very different exposures to calcium, adds significance to the current investigation of the role of the ER in redistributing calcium following exposure to physiological levels of extracellular calcium

      They also use a conditional knockout of TgSERCA to investigate its role in ER calcium store-filling and the ability of other subcellular organelles to sequester and release calcium. These knockout experiments provide important evidence that ER calcium uptake plays a significant role in maintaining the filling state of other intracellular compartments.

      We thank the reviewer.

      While it is clearly demonstrated, and not surprising, that the addition of 1.8 mM extracellular CaCl2 to intact T. gondii parasites preincubated with EGTA leads to an increase in cytosolic calcium and subsequent enhanced loading of the ER and other intracellular compartments, there is a caveat to the quantitation of these increases in calcium loading. The authors rely on the amplitude of cytosolic free calcium increases in response to thapsigargin, GPN, nigericin, and CCCP, all measured with fura2. This likely overestimates the changes in calcium pool sizes because the buffering of free calcium in the cytosol is nonlinear, and fura2 (with a Kd of 100-200 nM) is a substantial, if not predominant, cytosolic calcium buffer. Indeed, the increases in signal noise at higher cytosolic calcium levels (e.g. peak calcium in Figure 1C) are indicative of fura2 ratio calculations approaching saturation of the indicator dye.

      We acknowledge the limitations associated with using Fura-2 for cytosolic calcium measurements. However, according to the literature (Grynkiewicz, Get al. (1985). J. Biol. Chem. 260 (6): 3440–3450. PMID 3838314) Fura-2 is suited for measurements between 100 nM and 1 µM calcium. The responses in our experiments were within that range and the experiments with the SERCA mutant and mitochondrial GCaMPfs supports the conclusions of our work.

      However, we agree with the reviewer that the experiment shown in Fig 1C (now Fig 1D) presents a response that approaches the limit of the linear range of Fura-2. In response to this, we have replaced this panel with a more representative experiment that remains within the linear range of the indicator (revised Fig 1D). Additionally, we have included new experiments adding GPN along with corresponding quantifications, which further support our conclusions regarding calcium dynamics in the parasite.

      Another caveat, not addressed, is that loading of fura2/AM can result in compartmentalized fura2, which might modify free calcium levels and calcium storage capacity in intracellular organelles.

      We are aware of the potential issue of Fura-2 compartmentalization, and our protocol was designed to minimize this effect. We load cells with Fura-2 for 26 min at room temperature, then maintain them on ice, and restrict the use of loaded parasites to 2-3 hours. We have observed evidence of compartmentalization as this is reflected in increasing concentrations of resting calcium with time. We carry out experiments within a time frame in which the resting calcium stays within the 100 nM range. We have included a sentence in the Materials and Methods section. Lines 604-606.

      Additionally, following this reviewer’s suggestion, we performed further experiments to directly assess compartmentalization. See below the full response to reviewer 2.

      The finding that the SERCA inhibitor cyclopiazonic acid (CPA) only mobilizes a fraction of the thapsigargin-sensitive calcium stores in T. gondii coincides with previously published work in another apicomplexan parasite, P. falciparum, showing that thapsigargin mobilizes calcium from both CPA-sensitive and CPA-insensitive calcium pools (Borges-Pereira et al., 2020, DOI: 10.1074/jbc.RA120.014906). It would be valuable to determine whether this reflects the off-target effects of thapsigargin or the differential sensitivity of TgSERCA to the two inhibitors.

      This is an interesting observation, and we now include a discussion of this result considering the Plasmodium study and include the citation. Lines 436-442.

      Figure S1 suggests differential sensitivity, and it shows that thapsigargin mobilizes calcium from both CPA-sensitive and CPA-insensitive calcium pools in T. gondii. Also important is that we used 1 µM TG as we are aware that TG has shown off-target effects at higher concentrations. TG is a well-characterized, irreversible SERCA inhibitor that ensures complete and sustained inhibition of SERCA activity. In contrast, CPA is a reversible inhibitor whose effectiveness is influenced by ATP levels, and it may only partially inhibit SERCA or dissociate over time, allowing residual Ca²⁺ reuptake into the ER.

      Additionally, as suggested by the reviewer we performed experiments using the Mag-Fluo-4 protocol to compare the inhibitory effects of CPA and TG. These results are presented in Fig. S3 (Lines 217-223). Under the conditions of the Mag-Fluo-4 assay with digitonin-permeabilized cells, both TG and CPA showed similar rates of Ca<sup>2+</sup> leakage following the addition of the inhibitor. This may indicate that under the conditions of the Mag-Fluo-4 experiments the rate of Ca<sup>2+</sup> leak is mostly determined by the intrinsic leak mechanism and not by the nature of the inhibitor. By contrast, in intact Fura-2–loaded cells, CPA induces a smaller cytosolic Ca²⁺ increase than TG, consistent with less efficient SERCA inhibition likely due to its reversibility and possibly incomplete inhibition under cellular conditions.

      The authors interpret the residual calcium mobilization response to Zaprinast observed after ATc knockdown of TgSERCA (Figures 4E, 4F) as indicative of a target calcium pool in addition to the ER. While this may well be correct, it appears from the description of this experiment that it was carried out using the same conditions as Figure 4A where TgSERCA activity was only reduced by about 50%.

      We partially agree with the reviewer that 50% knockdown of TgSERCA means that the ER may still be targeted by zaprinast, and that there is no definitive evidence of the involvement of another calcium pool. The Mag-Fluo-4 experiment, while we acknowledge that the fluorescence of MagFluo-4 is not linear to calcium, indicates that SERCA activity is present even after 24 hr of ATc treatment. However, when Zaprinast is added after TG, we observed a significant calcium release in wild type cells. This result suggests the presence of another large calcium pool than the one mobilized by TG (PMID: 2693306).

      We recently published work describing the Golgi as a calcium store in Toxoplasma (PMID: 40043955) and we showed in Fig. S4 D-G of that work, that GPN treatment of tachyzoites loaded with Fura-2 diminished the Zaprinast response indicating that they could be impacting a similar store. In the present study we performed additional experiments in which TG was followed by GPN and Zaprinast showing a similar pattern. GPN significantly diminished the Zaprinast response. These results are shown now in Figure S2B. We address these possibilities in the discussion and interpretation of the result. Lines 451-460.

      The data in Figures 4A vs 4G and Figures 4B vs 4H indicate that the size of the response to GPN is similar to that with thapsigargin in both the presence and absence of extracellular calcium. This raises the question of whether GPN is only releasing calcium from acidic compartments or whether it acts on the ER calcium stores, as previously suggested by Atakpa et al. 2019 DOI: 10.1242/jcs.223883. Nonetheless, Figure 1H shows that there is a robust calcium response to GPN after the addition of thapsigargin.

      The results of the indicated experiments did not exclude the possibility that GPN can also mobilize some calcium from the ER besides acidic organelles. We don’t have any evidence to support that GPN can mobilize calcium from the ER either. Based on our unpublished work, we think GPN mainly release calcium from the PLVAC. We included the mentioned citation and discuss the result considering the possibility that GPN may be acting on more than one store. Lines 451-460.

      An important advance in the current work is the use of state-of-the-art approaches with targeted genetically encoded calcium indicators (GECIs) to monitor calcium in important subcellular compartments. The authors have previously done this with the apicoplast, but now add the mitochondria to their repertoire. Despite the absence of a canonical mitochondrial calcium uniporter (MCU) in the Toxoplasma genome, the authors demonstrate the ability of T. gondii mitochondrial to accumulate calcium, albeit at high calcium concentrations. Although the calcium concentrations here are higher than needed for mammalian mitochondrial calcium uptake, there too calcium uptake requires calcium levels higher than those typically attained in the bulk cytosolic compartment. And just like in mammalian mitochondria, the current work shows that ER calcium release can elicit mitochondrial calcium loading even when other sources of elevated cytosolic calcium are ineffective, suggesting a role for ER-mitochondrial membrane contact sites. With these new tools in hand, it will be of great value to elucidate the bioenergetics and transport pathways associated with mitochondrial calcium accumulation in T. gondii.

      We thank this reviewer praising our work. Studies of bioenergetics and transport pathways associated with mitochondrial calcium accumulation is part of our future plans mentioned in lines 520-522 and 545.

      The current studies of calcium pools and their interactions with the ER and dependence on SERCA activity in T. gondi are complemented by super-resolution microscopy and electron microscopy that do indeed demonstrate the presence of close appositions between the ER and other organelles (see also videos). Thus, the work presented provides good evidence for the ER acting as the orchestrating organelle delivering calcium to other subcellular compartments through contact sites in T. gondi, as has become increasingly clear from work in other organisms.

      Thank you

      Reviewer #3 (Public review):

      This manuscript describes an investigation of how intracellular calcium stores are regulated and provides evidence that is in line with the role of the SERCA-Ca2+ATPase in this important homeostasis pathway. Calcium uptake by mitochondria is further investigated and the authors suggest that ER-mitochondria membrane contact sites may be involved in mediating this, as demonstrated in other organisms.

      The significance of the findings is in shedding light on key elements within the mechanism of calcium storage and regulation/homeostasis in the medically important parasite Toxoplasma gondii whose ability to infect and cause disease critically relies on calcium signalling. An important strength is that despite its importance, calcium homeostasis in Toxoplasma is understudied and not well understood.

      We agree with the reviewer. Thank you

      A difficulty in the field, and a weakness of the work, is that following calcium in the cell is technically challenging and thus requires reliance on artificial conditions. In this context, the main weakness of the manuscript is the extrapolation of data. The language used could be more careful, especially considering that the way to measure the ER calcium is highly artificial - for example utilising permeabilization and over-loading the experiment with calcium. Measures are also indirect - for example, when the response to ionomycin treatment was not fully in line with the suggested model the authors hypothesise that the result is likely affected by other storage, but there is no direct support for that.

      The Mag-Fluo-4-based protocol for measuring intraluminal calcium is well established and has been extensively used in mammalian cells, DT40 cells and other cells for measuring intraluminal calcium, activity of SERCA and response to IP3 (Some examples: PMID: 32179239, PMID: 15963563, PMID: 19668195, PMID: 30185837, PMID: 19920131).

      Furthermore, we have successfully employed this protocol in previous work, including the characterization of the Trypanosoma brucei IP3R (PMID: 23319604) and the assessment of SERCA activity in Toxoplasma (PMID: 40043955 and 34608145). The citation PMID: 32179239 provides a detailed description of the protocol, including references to its prior use. In addition, the schematic at the top of Figure 2 summarizes the experimental workflow, reinforcing that the protocol follows established methodologies. We included more references and an expanded discussion, lines 425-435.

      We respectfully disagree with the concern regarding potential calcium overloading. The cells used in our assays were permeabilized, which is a critical step that allows to precisely control calcium concentrations. All experiments were conducted at 220 nM free calcium, a concentration within the physiological range of cytosolic calcium fluctuations. This concentration was consistently used across all studies described above. Importantly, permeabilization ensures that the dye present in the cytosol becomes diluted, and allows MgATP (which cannot cross intact membranes) to access the ER membrane, in addition to be able to expose the ER to precise calcium concentrations.

      The Mag-Fluo-4 loading conditions are designed to allow compartmentalization of the indicator to all intracellular compartments and the calcium uptake stimulated by MgATP exclusively occurs in the compartment occupied by SERCA as only SERCA is responsive to MgATP-dependent transport in this experimental setup

      Regarding the use of IO, we would like to clarify that its broad-spectrum activity is welldocumented. As a calcium ionophore, IO facilitates calcium release across multiple membranes, and not just the ER leading to a more substantial calcium release compared to the more selective effect of TG. The results observed with IO were consistent with this expected broader activity and support our interpretation.

      Lastly, we emphasize that the experiment in Figure 2 was designed specifically to assess SERCA activity in situ under defined conditions. It was not intended to provide a comprehensive characterization of the role of TgSERCA in the parasite. We now clarify this distinction in the revised Discussion lines 425-435.

      Below we provide some suggestions to improve controls, however, even with those included, we would still be in favour of revising the language and trying to avoid making strong and definitive conclusions. For example, in the discussion perhaps replace "showed" with "provide evidence that are consistent with..."; replace or remove words like "efficiently" and "impressive"; revise the definitive language used in the last few lines of the abstract (lines 13-17); etc. Importantly we recommend reconsidering whether the data is sufficiently direct and unambiguous to justify the model proposed in Figure 7 (we are in favour of removing this figure at this early point of our understanding of the calcium dynamic between organelles in Toxoplasma).

      We thank the reviewer for the suggestions and we modified the language as suggested. We limited the use of the word "showed" to references to previously published work. We deleted the other words

      Figure 7 is intended as a conceptual model to summarize our proposed pathways, and, like all models, it represents a working hypothesis that may not fully capture the complexity of calcium dynamics in the parasite. In light of the reviewer’s comments, we revised the figure and legend to clearly distinguish between pathways for which there is experimental evidence from those that are hypothetical.

      Another important weakness is poor referencing of previous work in the field. Lines 248250 read almost as if the authors originally hypothesised the idea that calcium is shuttled between ER and mitochondria via membrane contact sites (MCS) - but there is extensive literature on other eukaryotes which should be first cited and discussed in this context. Likewise, the discussion of MCS in Toxoplasma does not include the body of work already published on this parasite by several groups. It is informative to discuss observations in light of what is already known.

      The sentence in which we state the hypothesis about the calcium transfer refers specifically to Toxoplasma. To clarify this, we have now added the phrase “In mammalian cells” (Line 311) and included additional citations, as suggested by the reviewer. While only a few studies have described membrane contact sites (MCSs) in Toxoplasma, we do cite several pertinent articles (e.g., lines 479-486). We believe that we cited all articles mentioning MCS in T. gondii

      However, we must clarify to the reviewer that the primary focus of our study is not to characterize or confirm the presence of MCSs in T. gondii, but rather to demonstrate functional calcium transfer between the ER and mitochondria. Our data support the conclusion that this transfer requires close apposition of these organelles, consistent with the presence of MCSs.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Line 45: change influx to release as Ca2+ influx usually referred to Ca2+ entry from the extracellular space. Same for line 71.

      Corrected, line 47 and 73

      (2) Line 54: consider toning down the strong statement of 'widely' accepted as ER Ca2+ subdomain heterogeneity remains somewhat debated.

      Changed the sentence to “it has been proposed”, Line 56

      (3) Line 119-21: A lower release in response to TG is typical and does not reflect TG specific for SERCA. It is due to the slow kinetics of Ca2+ leak out of the ER allowing other buffering and transport mechanisms to act. Also, could be a reflection of the duration after TG treatment to allow complete store depletion. Figure S1A-B shows that there is still Ca2+ in the stores following TG but the TG signal does not go back to baseline arguing that the leak is still active. Hence the current data does not address the specificity of TG for TgSERCA. Please revise the statement accordingly.

      Thank for the suggestion, we changed the sentence to this: “This result could reflect the slow kinetics of Ca²⁺ leak from the ER, allowing other buffering and transport mechanisms to mitigate the phenomenon. Alternatively, it may indicate the duration after TG treatment allowing time to complete store depletion. As shown in Figure S1A-B, residual Ca²⁺ remains in the stores after TG treatment, and the TG-induced phenomenon does not return to baseline, suggesting that the leak remains active”. Lines 124-128

      (4) Figure 1C: the authors interpret the data 'This Ca2+ influx appeared to be immediately taken up by the ER as the response to TG was much greater in parasites previously exposed to extracellular Ca2+'. I don't understand this interpretation, in Ca2+-containing solution it would expected to have a larger signal as TG is likely to activate store-operated Ca2+ entry which would contribute to a larger cytosolic Ca2+ transient. Does T. gondii have SOCE? It cannot be uptake into the ER as SERCA is blocked. Unless the authors are arguing for another ER Ca2+ uptake pathway? But why are Ca2+ uptake in the ER would lower the signal whereas the data show an increased signal?

      We pre-incubated the suspension with calcium to allow filling of the stores, while SERCA is still active, and added thapsigargin (TG) at 400 seconds to measure calcium release. The experiment was designed to introduce the concept that the ER may have access to extracellular calcium, a phenomenon not yet clearly demonstrated in Toxoplasma. We did not expect to have less release by TG but if the ER is not efficient in filling after extracellular calcium entry it would be expected to have a similar response to TG. Yes, it is very possible that when we add TG we are also seeing more calcium entry through the PM as we previously proposed that the increased cytosolic Ca<sup>2+</sup> may regulate Ca<sup>2+</sup> entry. However, the evidence does not support that this increased entry would be triggered by store depletion. The experiments with the SERCA mutant (Fig. 4D) shows that in the conditional knockout mutant, the ER is partially depleted, yet this does not lead to enhanced calcium entry, suggesting that the depletion alone is not sufficient to trigger increased influx.

      There is no experimental evidence supporting the regulation of calcium entry by store depletion in Toxoplasma (PMID: 24867952). We revised the text to clarify this point and expanded the discussion on store-operated calcium entry (SOCE). While it is possible that a channel similar to Orai exists in Toxoplasma, it is highly unlikely to be regulated by store depletion, as there is no gene homologous to STIM. If store-regulated calcium entry does occur in Toxoplasma, it is likely mediated through a different, still unidentified, mechanism. Lines 461-467.

      (5) The choice of adding Ca2+ first followed by TG is curious as it is more difficult to interpret. Would be more informative to add TG, allow the leak to complete, and then add Ca2+ which would allow temporal separation between Ca2+ release from stores and Ca2+ influx from the extracellular space. Was this experiment done? If not would be useful to have the data.

      Yes, this experiment was already published: PMID: 24867952 and PMID: 38382669.

      It mainly highlighted that increased cytosolic calcium may regulate calcium entry most likely through a TRP channel. See our response to point 4 and the description of the new Fig. S2 in the response to point 7.

      (6) Line 136-39: these experiments as designed - partly because of the issues discussed above - do not address the ability of organelles to access extracellular Ca2+ or the state of refilling of intracellular Ca2+ stores. They can simply be interpreted as the different agents (TG, Nig, GPN, CCCP) inducing various levels of Ca2+ influx.

      Concerning TG, the experiment shown in Fig. 4D shows that depletion of the ER calcium does not result in stimulation of calcium entry, indicating the absence of classical SOCE activation in Toxoplasma.

      To our knowledge, neither mitochondria nor lysosomes (or other acidic compartments) are capable of triggering classical SOCE in mammalian cells.

      Given that the ER in Toxoplasma lacks the canonical components required to initiate SOCE, it is unclear why the mitochondria or acidic compartments would be able to do so. While it is possible that T. gondii utilizes an alternative mechanism for store-operated calcium entry, investigating such a pathway would require a comprehensive study. In mammalian systems, it took almost 15 years and the efforts of multiple research groups to identify the molecular components of SOCE. Expecting this complex question to be resolved within the scope of a single study is unrealistic.

      Our current data show that the mitochondrion is unable to access calcium from the cytosol, as shown in Figure 5E. Performing a similar experiment for the PLVAC would be ideal; however, expression of fluorescent calcium indicators in this organelle has not been successful. This is likely due to the presence of several proteases that degrade expressed proteins, as well as the acidic environment, which quenches fluorescence. These challenges have made studying calcium dynamics in the PLVAC particularly difficult.

      To address the reviewer’s comment, we performed an additional experiment presented in Fig. S2A. In this experiment, we first inhibited SERCA with thapsigargin (TG), preventing calcium uptake into the ER, and subsequently added calcium to the suspension. Under these conditions, calcium cannot be sequestered by the ER. We then applied GPN and quantified the response, comparing it to a similar experimental condition without TG. Indeed, under these conditions, we observed a significant but modest increase in the GPN-induced response, suggesting that the PLVAC may be capable of directly taking up calcium from the cytosol. However, this occurs under conditions of SERCA inhibition which creates nonphysiological conditions with elevated cytosolic calcium levels and the presence of TG may promote additional ER leakage, both of which could artificially enhance PLVAC uptake. Under physiological conditions, with functional SERCA activity, the ER would likely sequester cytosolic calcium more efficiently, thereby limiting calcium availability for PLVAC direct uptake. Thus, while the result is intriguing, it may not reflect calcium handling under normal cellular conditions. See lines 172-178.

      (7) Figure 1H-I: I disagree with the authors' interpretation of the results (lines 144-153). The data argue that by blocking ER Ca2+ uptake by TG, other organelles take up Ca2+ from the cytosol where it accumulates due to the leak and Ca2+ influx as is evident from the data allowing more release. The data does not argue for ER Ca2+ tunneling to other organelles. Tunneling would be reduced in the presence of TG (see PMID: 30046136, 24867608).

      We partially agree with this concern. In our experiments, TG was used to inhibit SERCA and block calcium uptake into the ER, allowing calcium to leak into the cytosol. We propose that this leaked calcium is subsequently taken up by other intracellular compartments. This effect is observed immediately upon TG addition. However, pre-incubation with TG or knockdown of SERCA reduces calcium storage in the ER, thereby diminishing the transfer of calcium to other stores.

      To further support our claim, we performed additional experiments in the absence of extracellular calcium, now presented in Figure 1J-K. We observed that calcium release triggered by GPN or nigericin was significantly enhanced when both agents were added after TG. These results suggest that calcium initially released from the ER can be sequestered by other compartments. As mentioned, we deleted any mention of “tunneling,” but we believe the data support the occurrence of calcium transfer. New results described in lines 166-171.

      The experiment in Fig S2A described in the response to (6) also addresses this concern. Under physiological conditions with functional SERCA, cytosolic calcium would likely be rapidly sequestered by the ER, limiting its availability to other compartments. See lines 172178.

      (8) Line 175: SERCA-dependent Ca2+ uptake is higher at 880 nM as would be expected yet the authors state that it's optimal at 220 nM Ca2+ ?

      Yes, it is true that the SERCA-dependent Ca<sup>2+</sup> uptake rate is higher at elevated Ca²⁺ concentrations. We chose to use 220 nM free calcium because of several reasons: 1) this concentration is close to physiological cytosolic levels fluctuations; 2) it is commonly used in studies of mammalian SERCA; and 3) calcium uptake is readily detectable at this level. While this may not represent the maximal activity conditions for SERCA, we believe it is a reasonable and physiologically relevant choice for assessing calcium transport activity SERCA-dependent. We added one sentence to the results explaining this reasoning (lines 204-207) and we deleted the word optimal.

      (9) Figure 3H: the saponin egress data support the conclusion that organelles Ca2+ take up cytosolic Ca2+ directly without the need for ER tunneling.

      The saponin concentration used permeabilizes the host cell membrane, allowing the intracellular tachyzoite to be surrounded with the added higher extracellular calcium concentration. The saponin concentration used does not affect the tachyzoite membrane as the parasite is still moving and calcium oscillations were clearly seen under similar conditions (PMID: 26374900 ). The resulting calcium increase in the tachyzoite cytosol is what stimulates parasite motility and egress. Since SERCA activity is reduced in the mutant, cytosolic calcium accumulates more rapidly, reaching the threshold for egress sooner and thereby accelerating parasite exit. The result does not support that the other stores contribute to this because of the Ionomycin response, which shows that egress is diminished in the mutant, likely because the calcium stores are depleted. We added an explanation in the results, lines 262-269 and the discussion, lines 532-539.

      (10) Figure S2: the HA and SERCA signals do not match perfectly? Could this reflect issues with HA tagging, potentially off-target effects? Was this tested?

      These are not off-target effects, as we did not observe them in the control cells lacking HA tagging. The HA signal also disappeared after treatment with ATc, further confirming that the IFA signal is specific. We agree with the reviewer that the signals do not align perfectly. This discrepancy could be due to differences in antibody accessibility or the fact that the two antibodies recognize different regions of the protein. We added a sentence about this in the result; lines 240-243.

      Reviewer #2 (Recommendations for the authors):

      The description of the data of Figures 1B and S1A starting on line 108 would be easier to follow if Figure S1A was actually incorporated into Figure 1. It is not clear why these two complementary experiments were separated since they are both equally important in understanding and interpreting the data.

      We re-arranged figure 1 and incorporated S1A now as Fig 1C.

      As noted in the public comments, loading of fura2/AM can result in compartmentalized fura2, which can contaminate the cytosolic calcium measurements and might modify free calcium levels and calcium storage capacity in intracellular organelles. This can be assessed using the digitonin permeabilization method used in the MagFluo4 measurements, but in this case, detecting the fura2 signal remaining after cell permeabilization.

      As suggested by the reviewer, we measured Fura-2 compartmentalization by permeabilizing cells with digitonin as we do for the Mag-Fluo-4 and the fluorescence was reduced almost completely and was unresponsive to any additions (see Author response image 1).

      Author response image 1.

      T. gondii tachyzoites in suspension exposed to Thapsigargin Calcium and GPN. The dashed lines shows and experiments using the same conditions but parasites were permeabilized with digitonin shows a similar experiment with parasites exposed to MgATP.to release the cytosolic Fura. Part B

      Following the public comment regarding the residual calcium mobilization response to Zaprinast observed after 24 h ATc knockdown of SERCA (Figsures 4E, 4F, as explained in the legend to Figure 4), was there still a response to Zaprinast after 48 h knockdown, where the thapsigargin response was apparently fully ablated?

      Unfortunately, we were unable to perform this experiment as it is not possible to obtain sufficient cells at 48 h with ATc. Due to the essential role of TgSERCA, parasites are unable to replicate after 24 h.

      As noted in the public comments, the data in Figure 4A vs 4G and Figure 4B vs 4H appear to show that the calcium responses to GPN are similar to that with thapsigargin, which seems unexpected if the acidic compartment is loaded from the ER. The results with GPN addition after thapsigargin (Figure 1H) argue against this, but the authors should still cite the work of Atakpa et al.

      We think that the reviewer is concerned that GPN may also be acting on the ER. This is a possibility that we considered, and we now included the suggested citation (line 457). However, we believe that it is difficult to directly compare the responses, as the kinetics of calcium release from the ER may differ from those of release from the PLVAC. This could be due to differences in the calcium buffering capacity between the two compartments. Additionally, it is possible that calcium leaked from the ER is more efficiently sequestered by other stores or extruded through the plasma membrane than calcium released from the PLVAC. Besides, GPN is known to have a more disruptive effect on membranes compared to TG, which may also influence their responses. As noted by the reviewer, Figure 1H also supports the idea that the acidic compartment is loaded from the ER.

      The abbreviation for the plant-like vacuolar compartment (PLVAC) only appears in a figure legend but should be defined in the main text on first use.

      Corrected, lanes 140-143

      The authors should cite the previous study of Borges-Pereira et al., 2020 (PMID: 32848018) that also demonstrates the incomplete overlap of the calcium pools mobilized by thapsigargin and CPA in P. falciparum. The ability to measure calcium in intracellular stores using MagFluo4 opens the possibility to further investigate this discrepancy between CPA and thapsigargin, but CPA does not appear to have been used in the permeabilized cell experiments with MagFluo4. I would suggest that this could be added to Figure 2 and/or Figure 4, or at least as a supplementary figure.

      In response to this reviewer’s critique we performed additional experiments with Mag-Fluo4 loaded parasites. These are presented in the new Figure S3. We added CPA and TG and combined them to inhibit SERCA and to allow calcium leak from the loaded organelle. Under these conditions, we observed a very similar leak rate after the addition of the inhibitors as measured by the slope of Ca<sup>2+</sup> leak. We believe that the leak rate is most likely determined by the intrinsic ER mechanism. See the discussion of this result in lines 436442 and the previous response to the same reviewer comment.

      Reviewer #3 (Recommendations for the authors):

      Suggestions for improved or additional experiments, data, or analyses

      (1) Figure 1A is not mentioned in the main text even though it is discussed.

      Corrected

      (2) Figure 1G: Values do not match, how can GPN be so high?

      These figures were replaced by new traces and individual quantification analyses for each experiment.

      (3) Figure 1H and I: Is this type of data/results also available for the mitochondrion?

      Unfortunately, we were not able to include this experiment because we were unable to accurately quantify the mitochondrial calcium release. Instead, we used mitochondrial GECIs and the results are shown in Figure 5 to study mitochondrial calcium uptake.

      (4) Figure 1H: where does the calcium go after GPN addition? Taken up by another calcium store?

      Most likely calcium is extruded through the plasma membrane by the activity of the Calcium ATPase TgA1.

      However, the reviewer’s suggestion is also possible, and calcium could be taken by another store like the mitochondrion. In this regard, we did observe a large mitochondrial calcium increase (parasites expressing SOD2-GCaMp6) after adding GPN (Fig 5I) suggesting that the mitochondrion may take calcium from the organelle targeted by GPN. However, the calcium affinity of the mitochondrion is very low, so the concentration of calcium needs to be very high to activate it and these concentrations are most likely achieved at the microdomains formed between the mitochondrion and other organelles.

      (5) Figure 2B-C: Further explanation of why these particular values were chosen for the follow-up experiments would be helpful for the reader.

      We tested a wide range of MgATP and free calcium concentrations to measure ER Ca<sup>2+</sup> uptake catalyzed by TgSERCA. The concentrations shown fall within the linear range.

      We followed the free calcium concentrations used by studies of mammalian SERCA (https://doi.org/10.1016/j.ceca.2020.102188 ). In this protocol they used 220 nM free calcium, which was close to cytosolic Ca<sup>2+</sup> levels. TgSERCA can take up calcium efficiently at this concentration, as shown in Fig 2. We used less MgATP than the mammalian cell protocols, since we did not observe a significant increase in SERCA activity beyond 0.5 mM MgATP. We added one more sentence explaining in the results, lines 204-207.

      (6) Figure 3E: Revise the error bar? (and note that colours do not match the graph legend).

      The colors do match; the problem visualizing it is because vacuoles containing a single parasite are virtually absent in the control group without ATc treatment.

      (7) Figure 3H: 'Interestingly, when testing egress after the addition of saponin in the presence of extracellular Ca2+, we observed that the tachyzoites egressed sooner (Figure 3H, saponin egress).' This is the only graph showing egress timing, and thus it is not clear what is the comparison. The egressed here is sooner compared to what condition? Egress in the absence of Ca2+? This requires clarification and might require the control data to be added.

      In the saponin experiment we compare time to egress of the mutant grown with or without ATc. The measurement is for time to egress after adding saponin. This experiment is in the presence of extracellular calcium. The protocol was previously used to measure time to egress: PMID: 40043955, PMID: 38382669, PMID: 26374900. See also response to question 9 of reviewer 1.

      (8) Figure 4C: There is a small peak appearing right after TG addition this should be discussed and explained.

      This trace was generated in a different fluorometer, F-4000. This was an artifact due to jumping of the signal when adding TG. Multiple repeats of the same experiment in the newer F7000 did not show the peak. We included in the MM the use of the F-4000 fluorometer for some experiments. We apologize for the omission. Lines 609-610

      (9) Figure 5A: An important control that is missing is co-localisation with a mitochondrial marker.

      The expression of the SOD2-GCaMP6 has been characterized: PMID: 31758454

      (10) Figure 5H: This line was made for this study however the line genetic verification is missing.

      In response to this concern we now include a new Figure S5 showing the fluorescence of GCaMP6 in the mitochondrion of the iDTgSERCA mutant (Fig. S5A). We include several parasites. In addition, we show fluorescence measurements after addition of Calcium showing that the cells are unresponsive indicating that the indicator is not in the cytosol. Lines 650-651 and 344-348.

      (11) Figure 6D: since the membranes are hard to see, it is not clear whether the arrows show structures that are in line with the definition of membrane contact sites. The authors should provide an in-depth analysis of the length of the interaction between the membranes where the distance is less than 30 nM, and discuss how many structures corresponding to the definition were analysed.

      All the requested details are now included in the legend to Figure S3.

      Minor corrections to the text and figures

      (1) Unify statistical labelling throughout the paper replacing *** with p values.

      Corrected. We changed the *** with the actual p value in some figures. For figure 2 and Fig S1, we still use the *** due to the space limitation.

      (2) Unify ATC vs ATc throughout the paper.

      Corrected

      (3) Unify capitalization of line name (iΔTgserca/i ΔTgSERCA) throughout the paper.

      Corrected

      (4) Unify capitalization of p value (p/P) throughout the paper.

      Corrected in figures

      (5) Unify Fig X vs Fig. X throughout the text.

      Corrected

      (6) Add values of scale bars to legends (eg Figure S2).

      Corrected

      (7) What is the time point for the data in Figures 4E-H, 5H, and S3? 24hrs? include in the legend.

      Added 24 h to the legends. Fig S3 is now S4.

      (8) Figure 3F: The second graph is NS thus perhaps no need for the p-value?

      Corrected

      (8) Figure 3G: Worth considering swapping the two around: first attachment and then invasion?

      Corrected. Invasion and attachment bars were swapped.

      (10) Figure 4A/B: Wrong colour match for Figure 4B.

      Corrected

      (11) Figure 4F: In the main text, the authors reference to Figure 1F, correct to 4F.

      Corrected

      (12) Figure 4H: In the main text, authors reference to Figure 1H, correct to 4H.

      Corrected

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary: The authors of this study sought to define a role for IgM in responses to house dust mites in the lung.

      Strengths:

      Unexpected observation about IgM biology

      Combination of experiments to elucidate function

      Weaknesses:

      Would love more connection to human disease

      We thank the reviewer for these comments. At the time of this publication, we have not made a concrete link with human disease. While there is some anecdotal evidence of diseases such as Autoimmune glomerulonephritis, Hashimoto’s thyroiditis, Bronchial polyp, SLE, Celiac disease and other diseases in people with low IgM. Allergic disorders are also common in people with IgM deficiency, other studies have reported as high as 33-47%. The mechanisms for the high incidence of allergic diseases are unclear as generally, these patients have normal IgG and IgE levels. IgM deficiency may represent a heterogeneous spectrum of genetic defects, which might explain the heterogeneous nature of disease presentations. 

      Reviewer #2 (Public Review):

      Summary:

      The manuscript by Hadebe and colleagues describes a striking reduction in airway hyperresponsiveness in Igm-deficient mice in response to HDM, OVA and papain across the B6 and BALB-c backgrounds. The authors suggest that the deficit is not due to improper type 2 immune responses, nor an aberrant B cell response, despite a lack of class switching in these mice. Through RNA-Seq approaches, the authors identify few differences between the lungs of WT and Igm-deficient mice, but see that two genes involved in actin regulation are greatly reduced in IgM-deficient mice. The authors target these genes by CRISPR-Cas9 in in vitro assays of smooth muscle cells to show that these may regulate cell contraction. While the study is conceptually interesting, there are a number of limitations, which stop us from drawing meaningful conclusions.

      Strengths:

      Fig. 1. The authors clearly show that IgMKO mice have striking reduced AHR in the HDM model, despite the presence of a good cellular B cell response.

      Weaknesses:

      Fig. 2. The authors characterize the cd4 t cell response to HDM in IGMKO mice.<br /> They have restimulated medLN cells with antiCD3 for 5 days to look for IL-4 and IL-13, and find no discernible difference between WT and KO mice. The absence of PBS-treated WT and KO mice in this analysis means it is unclear if HDM-challenged mice are showing IL-4 or IL-13 levels above that seen at baseline in this assay.

      We thank the Reviewer for this comment. We would like to mention that a very minimal level of IL-4 and IL-13 in PBS mice was detected. We have indicated with a dotted line on the Figure to show levels in unstimulated or naïve cytokines. Please see Author response image 1 below from anti-CD3 stimulated cytokine ELISA data. The levels of these cytokines are very low and are not changed between WT and IgM<sup>-/-</sup> mice, this is also true for PMA/ionomycin-stimulated cells.

      Author response image 1.

      The choice of 5 days is strange, given that the response the authors want to see is in already primed cells. A 1-2 day assay would have been better.

      We agree with the reviewer that a shorter stimulation period would work. Over the years we have settled for 5-day re-stimulation for both anti-CD3 and HDM. We have tried other time points, but we consistently get better secretion of cytokines after 5 days.

      It is concerning that the authors state that HDM restimulation did not induce cytokine production from medLN cells, since countless studies have shown that restimulation of medLN would induce IL-13, IL-5 and IL-10 production from medLN. This indicates that the sensitization and challenge model used by the authors is not working as it should.

      We thank the reviewer for this observation. In our recent paper showing how antigen load affects B cell function, we used very low levels of HDM to sensitise and challenge mice (1 ug and 3 ug respectively). See below article, Hadebe et al., 2021 JACI. This is because Labs that have used these low HDM levels also suggested that antigen load impacts B cell function, especially in their role in germinal centres. We believe the reason we see low or undetectable levels of cytokines is because of this low antigen load sensitisation and challenge. In other manuscripts we have published or about to publish, we have shown that normal HDM sensitisation load (1 ug or 100 ug) and challenge (10 ug) do induce cytokine release upon restimulation with HDM. See the below article by Khumalo et al, 2020 JCI Insight (Figure 4A).

      Sabelo Hadebe, Jermaine Khumalo, Sandisiwe Mangali, Nontobeko Mthembu, Hlumani Ndlovu, Amkele Ngomti, Martyna Scibiorek, Frank Kirstein, Frank Brombacher. Deletion of IL-4Ra signalling on B cells limits hyperresponsiveness depending on antigen load. doi.org/10.1016/j.jaci.2020.12.635).

      Jermaine Khumalo, Frank Kirstein, Sabelo Hadebe, Frank Brombacher. IL-4Rα signalling in regulatory T cells is required for dampening allergic airway inflammation through inhibition of IL-33 by type 2 innate lymphoid cells. JCI Insight. 2020 Oct 15;5(20):e136206. doi: 10.1172/jci.insight.136206

      The IL-13 staining shown in panel c is also not definitive. One should be able to optimize their assays to achieve a better level of staining, to my mind.

      We agree with the reviewer that much higher IL-13-producing CD4 T cells should be observed. We don’t think this is a technical glitch or non-optimal set-up as we see much higher levels of IL-13-producing CD4 T cells when using higher doses of HDM to sensitise and challenge, say between 7 -20% in WT mice (see Author response image 2, lung stimulated with PMA/ionomycin+Monensin, please note this is for illustration purposes only and it not linked to the current manuscript, its merely to demonstrate a point from other experiments we have conducted in the lab).

      Author response image 2.

      In d-f, the authors perform a serum transfer, but they only do this once. The half life of IgM is quite short. The authors should perform multiple naïve serum transfers to see if this is enough to induce FULL AHR.

      We thank the reviewer for this comment. We apologise if this was not clear enough on the Figure legend and method, we did transfer serum 3x, a day before sensitisation, on the day of sensitisation and a day before the challenge to circumvent the short life of IgM. In our subsequent experiments, we have now used busulfan to deplete all bone marrow in IgM-deficient mice and replace it with WT bone marrow and this method restores AHR (Figure 3).

      This now appears in line 165 to 169 and reads

      “Adoptive transfer of naïve serum

      Naïve wild-type mice were euthanised and blood was collected via cardiac puncture before being spun down (5500rpm, 10min, RT) to collect serum. Serum (200mL) was injected intraperitoneally into IgM-deficient mice. Serum was injected intraperitoneally at day -1, 0, and a day before the challenge with HDM (day 10).”

      The presence of negative values of total IgE in panel F would indicate some errors in calculation of serum IgE concentrations.

      We thank the reviewer for this observation. For better clarity, we have now indicated these values as undetected in Figure , as they were below our detection limit.

      Overall, it is hard to be convinced that IgM-deficiency does not lead to a reduction in Th2 inflammation, since the assays appear suboptimal.

      We disagree with the reviewer in this instance, because we have shown in 3 different models and in 2 different strains and 2 doses of HDM (high and low) that no matter what you do, Th2 remains intact. Our reason for choosing low dose HDM was based on our previous work and that of others, which showed that depending on antigen load, B cells can either be redundant or have functional roles. Since our interest was to tease out the role of B cells and specifically IgM, it was important that we look at a scenario where B cells are known to have a function (low antigen load). We did find similar findings at high dose of HDM load, but effects on AHR were not as strong, but Th2 was not changed, in fact in some instances Th2 was higher in IgM-deficient mice.

      Fig. 3. Gene expression differences between WT and KO mice in PBS and HDM challenged settings are shown. PCA analysis does not show clear differences between all four groups, but genes are certainly up and downregulated, in particular when comparing PBS to HDM challenged mice. In both PBS and HDM challenged settings, three genes stand out as being upregulated in WT v KO mice. these are Baiap2l1, erdr1 and Chil1.

      Noted

      Fig. 4. The authors attempt to quantify BAIAP2L1 in mouse lungs. It is difficult to know if the antibody used really detects the correct protein. A BAIAP2L1-KO is not used as a control for staining, and I am not sure if competitive assays for BAIAP2L1 can be set up. The flow data is not convincing. The immunohistochemistry shows BAIAP2L1 (in red) in many, many cells, essentially throughout the section. There is also no discernible difference between WT and KO mice, which one might have expected based on the RNA-Seq data. So, from my perspective, it is hard to say if/where this protein is located, and whether there truly exists a difference in expression between wt and ko mice.

      We thank the reviewer for this comment. We are certain that the antibody does detect BAIAP2L1, we have used it in 3 assays, which we admit may show varying specificities since it’s a Polyclonal antibody. However, in our western blot, the antibody detects 1 band at 56.7kDa and no other bands, apart from what we think are isoforms. We agree that BAIAP2L1 is expressed by many cell types, including CD45+ cells and alpha smooth muscle negative cells and we show this in our supplementary Figure 9. Where we think there is a difference in expression between WT and IgM-deficient mice is in alpha-smooth muscle-positive cells. We have tested antibodies from different companies, and we find similar findings. We do not have access to BAIAP2L1 KO mice and to test specificity, we have also used single stain controls with or without secondary antibody and isotype control which show no binding in western blot and Immunofluorescence assays and Fluorescence minus one antibody in Flow cytometry, so that way we are convinced that the signal we are seeing is specific to BAIAP2L1.

      Fig. 5 and 6. The authors use a single cell contractility assay to measure whether BAIAP2L1 and ERDR1 impact on bronchial smooth muscle cell contractility. I am not familiar with the assay, but it looks like an interesting way of analysing contractility at the single cell level.

      The authors state that targeting these two genes with Cas9gRNA reduces smooth muscle cell contractility, and the data presented for contractility supports this observation. However, the efficiency of Cas9-mediated deletion is very unclear. The authors present a PCR in supp fig 9c as evidence of gene deletion, but it is entirely unclear with what efficiency the gene has been deleted. One should use sequencing to confirm deletion. Moreover, if the antibody was truly working, one should be able to use the antibody used in Fig 4 to detect BAIAP2L1 levels in these cells. The authors do not appear to have tried this.

      We thank the reviewer for these observations. We are in a process to optimise this using new polyclonal BAIAP2L1 antibodies from other companies, since the one we have tried doesn’t seem to work well on human cells via western blot. So hopefully in our new version, we will be able to demonstrate this by immunofluorescence or western blot.

      Other impressions:

      The paper is lacking a link between the deficiency of IgM and the effects on smooth muscle cell contraction.

      The levels of IL-13 and TNF in lavage of WT and IGMKO mice could be analysed.

      We have measured Th2 cytokine IL-13 in BAL fluid and found no differences between IgM-deficient mice and WT mice challenged with HDM (Author response image 1). We could not detected TNF-alpha in the BAL fluid, it was below detection limit.

      Author response image 3.

      IL-13 levels are not changed in IgM-deficient mice in the lung. Bronchoalveolar lavage fluid in WT or IgM-deficient mice sensitised and challenged with HDM. TNF-a levels were below the detection limit.

      Moreover, what is the impact of IgM itself on smooth muscle cells? In the Fig. 7 schematic, are the authors proposing a direct role for IgM on smooth muscle cells? Does IgM in cell culture media induce contraction of SMC? This could be tested and would be interesting, to my mind.

      We thank the Reviewer for these comments. We are still trying to test this, unfortunately, we have experienced delays in getting reagents such as human IgM to South Africa. We hope that we will be able to add this in our subsequent versions of the article. We agree it is an interesting experiment to do even if not for this manuscript but for our general understanding of this interaction at least in an in vitro system.

      Reviewer #3 (Public Review):

      Summary:

      This paper by Sabelo et al. describes a new pathway by which lack of IgM in the mouse lowers bronchial hyperresponsiveness (BHR) in response to metacholine in several mouse models of allergic airway inflammation in Balb/c mice and C57/Bl6 mice. Strikingly, loss of IgM does not lead to less eosinophilic airway inflammation, Th2 cytokine production or mucus metaplasia, but to a selective loss of BHR. This occurs irrespective of the dose of allergen used. This was important to address since several prior models of HDM allergy have shown that the contribution of B cells to airway inflammation and BHR is dose dependent.

      After a description of the phenotype, the authors try to elucidate the mechanisms. There is no loss of B cells in these mice. However, there is a lack of class switching to IgE and IgG1, with a concomitant increase in IgD. Restoring immunoglobulins with transfer of naïve serum in IgM deficient mice leads to restoration of allergen-specific IgE and IgG1 responses, which is not really explained in the paper how this might work. There is also no restoration of IgM responses, and concomitantly, the phenotype of reduced BHR still holds when serum is given, leading authors to conclude that the mechanism is IgE and IgG1 independent. Wild type B cell transfer also does not restore IgM responses, due to lack of engraftment of the B cells. Next authors do whole lung RNA sequencing and pinpoint reduced BAIAP2L1 mRNA as the culprit of the phenotype of IgM<sup>-/-</sup> mice. However, this cannot be validated fully on protein levels and immunohistology since differences between WT and IgM KO are not statistically significant, and B cell and IgM restoration are impossible. The histology and flow cytometry seems to suggest that expression is mainly found in alpha smooth muscle positive cells, which could still be smooth muscle cells or myofibroblasts. Next therefore, the authors move to CRISPR knock down of BAIAP2L1 in a human smooth muscle cell line, and show that loss leads to less contraction of these cells in vitro in a microscopic FLECS assay, in which smooth muscle cells bind to elastomeric contractible surfaces.

      Strengths:

      (1) There is a strong reduction in BHR in IgM-deficient mice, without alterations in B cell number, disconnected from effects on eosinophilia or Th2 cytokine production

      (2) BAIAP2L1 has never been linked to asthma in mice or humans

      Weaknesses:

      (1) While the observations of reduced BHR in IgM deficient mice are strong, there is insufficient mechanistic underpinning on how loss of IgM could lead to reduced expression of BAIAP2L1. Since it is impossible to restore IgM levels by either serum or B cell transfer and since protein levels of BAIAP2L1 are not significantly reduced, there is a lack of a causal relationship that this is the explanation for the lack of BHR in IgM-deficient mice. The reader is unclear if there is a fundamental (maybe developmental) difference in non-hematopoietic cells in these IgM-deficient mice (which might have accumulated another genetic mutation over the years). In this regard, it would be important to know if littermates were newly generated, or historically bred along with the KO line.

      We thank the reviewer for asking this question and getting us to think of this in a different way. This prompted us to use a different method to try and restore IgM function and since our animal facility no longer allows irradiation, we opted for busulfan. We present this data as new data in Figure 3. We had to go back and breed this strain and then generated bone marrow chimeras. What we have shown now with chimeras is that if we can deplete bone marrow from IgM-deficient mice and replace it with congenic WT bone marrow when we allow these mice to rest for 2 months before challenge with HDM (new Supplementary Figure 6 a-c) We also show that AHR (resistance and elastance) is partially restored in this way (Figure 3 a and b) as mice that receive congenic WT bone marrow after chemical irradiation can mount AHR and those that receive IgM-deficient bone marrow, can’t mount AHR upon challenge with HDM. If the mice had accumulated an unknown genetic mutation in non-hematopoietic cells, the transfer of WT bone marrow would not make a difference. So, we don’t believe the colony could have gained a mutation that we are unaware of. We have also shipped these mice to other groups and in their hands, this strains still only behaves as an IgM only knockout mice. See their publication below.

      Mark Noviski, James L Mueller, Anne Satterthwaite, Lee Ann Garrett-Sinha, Frank Brombacher, Julie Zikherman 2018. IgM and IgD B cell receptors differentially respond to endogenous antigens and control B cell fate. eLife 2018;7:e35074. DOI: https://doi.org/10.7554/eLife.35074 we have also added methods for bone marrow chimaeras and added results sections and new Figures related to this methods.

      Methods (line 171-182).

      “Busulfan Bone marrow chimeras

      WT (CD45.2) and IgM<sup>-/-</sup> (CD45.2) congenic mice were treated with 25 mg/kg busulfan (Sigma-Aldrich, Aston Manor, South Africa) per day for 3 consecutive days (75 mg/kg in total) dissolved in 10% DMSO and Phosphate buffered saline (0.2mL, intraperitoneally) to ablate bone marrow cells. Twenty-four hours after last administration of busulfan, mice were injected intravenously with fresh bone marrow (10x10<sup>6</sup> cells, 100mL) isolated from hind leg femurs of either WT (CD45.1) or IgM<sup>-/-</sup> mice(33). Animals were then allowed to complement their haematopoietic cells for 8 weeks. In some experiments the level of bone marrow ablation was assessed 4 days post-busulfan treatment in mice that did not receive donor cells. At the end of experiment level of complemented cells were also assessed in WT and IgM<sup>-/-</sup> mice that received WT (CD45.1) bone marrow.”

      Results (line 491-521)

      “Replacement of IgM-deficient mice with functional hematopoietic cells in busulfan mice chimeric mice restores airway hyperresponsiveness.

      We then generated bone marrow chimeras by chemical radiation using busulfan(33). We treated mice three times with busulfan for 3 consecutive days and after 24 hrs transferred naïve bone marrow from congenic CD45.1 WT mice or CD45.2 IgM<sup>-/-</sup> mice (Fig. 3a and Supplementary Fig. 5a). We showed that recipient mice that did not receive donor bone marrow after 4 days post-treatment have significantly reduced lineage markers (CD45+Sca-1+) or lineage negative (Lin-) cells in the bone marrow when compared to untreated or vehicle (10% DMSO) treated mice (Supplementary Figure 5b-c). We allowed mice to reconstitute bone marrow for 8 weeks before sensitisation and challenge with low dose HDM (Figure 3a). We showed that WT (CD45.2) recipient mice that received WT (CD45.1) donor bone marrow had higher airway resistance and elastance and this was comparable to IgM<sup>-/-</sup> (CD45.2) recipient mice that received donor WT (CD45.1) bone marrow (Figure 3b). As expected, IgM<sup>-/-</sup> (CD45.2) recipient mice that received donor IgM<sup>-/-</sup> (CD45.2) bone marrow had significantly lower AHR compared to WT (CD45.2) or IgM<sup>-/-</sup> (CD45.2) recipient mice that received WT (CD45.1) bone marrow (Figure 3b). We confirmed that the differences observed were not due to differences in bone marrow reconstitution as we saw similar frequencies of CD45.1 cells within the lymphocyte populations in the lungs and other tissues (Supplementary Fig. 5d). We observed no significant changes in the lung neutrophils, eosinophils, inflammatory macrophages, CD4 T cells or B cells in WT or IgM<sup>-/-</sup> (CD45.2) recipient mice that received donor WT (CD45.1/CD45.2) or IgM<sup>-/-</sup> (CD45.2) bone marrow when sensitised and challenged with low dose HDM (Fig. 3c)

      Restoring IgM function through adoptive reconstitution with congenic CD45.1 bone marrow in non-chemically irradiated recipient mice or sorted B cells into IgM<sup>-/-</sup> mice (Supplementary Fig.  6a) did not replenish IgM B cells to levels observed in WT mice and as a result did not restore AHR, total IgE and IgM in these mice (Supplementary Fig.  6b-c).”

      The 2 new figures are

      Figure 3 which moved the rest of the Figures down and Supplementary Figure 5, which also moved the rest of the supplementary figures down.

      Discussion appears in line 757-766 of the untracked version of the article.

      To resolve other endogenous factors that could have potentially influenced reduced AHR in IgM-deficient mice, we resorted to busulfan chemical irradiation to deplete bone marrow cells in IgM-deficient mice and replace bone marrow with WT bone marrow. While it is well accepted that busulfan chemical irradiation partially depletes bone marrow cells, in our case it was not possible to pursue other irradiation methods due to changes in ethical regulations and that fact that mice are slow to recover after gamma rays irradiation. Busulfan chemical irradiation allowed us to show that we could mostly restore AHR in IgM-deficient recipient mice that received donor WT bone marrow when challenged with low dose HDM.

      (2) There is no mention of the potential role of complement in activation of AHR, which might be altered in IgM-deficient mice 

      We thank the reviewer for this comment. We have not directly looked at complement in this instance, however, from our previous work on C3-/- mice, there have been comparable AHR to WT mice under the HDM challenge.

      (3) What is the contribution of elevated IgD in the phenotype of the IgM-deficient mice. It has been described by this group that IgD levels are clearly elevated

      We thank the reviewer for this question. We believe that IgD is essentially what drives partial class switching to IgG, we certainly have shown that in the case of VSV virus and Trypanosoma congolense and Trypanosoma brucei brucei that elevated IgD drive delayed but effective IgG in the absence of IgM (Lutz et al, 2001, Nature). This is also confirmed by Noviski studies where they show that both IgM and IgD do share some endogenous antigens, so its likely that external antigens can activate IgD in a similar manner to prompt class switching.

      (4) How can transfer of naïve serum in class switching deficient IgM KO mice lead to restoration of allergen specific IgE and IgG1?

      We thank the Reviewer for these comments, we believe that naïve sera transferred to IgM deficient mice is able to bind to the surface of B cells via IgM receptors (FcμR / Fcα/μR), which are still present on B cells and this is sufficient to facilitate class switching. Our IgM<sup>-/-</sup> mouse lacks both membrane-bound and secreted IgM, and transferred serum contains at least secreted IgM which can bind to surfaces via its Fc portion. We measured HDM-specific IgE and we found very low levels, but these were not different between WT and IgM<sup>-/-</sup> adoptively transferred with WT serum. We also detected HDM-specific IgG1 in IgM<sup>-/-</sup> transferred with WT sera to the same level as WT, confirming a possible class switching, of course, we can’t rule out that transferred sera also contains some IgG1. We also can’t rule out that elevated IgD levels can partially be responsible for class switched IgG1 as discussed above.

      In the discussion line 804-812, we also added the following

      “We speculate that IgM can directly activate smooth muscle cells by binding a number of its surface receptors including FcμR, Fcα/μR and pIgR(52-54). IgM binds to FcμR strictly, but shares Fcα/μR and pIgR with IgA(5,52,54). Both Fcα/μR and pIgR can be expressed by non-structural cells at mucosal sites(54,55). We would not rule out that the mechanisms of muscle contraction might be through one of these IgM receptors, especially the ones expressed on smooth muscle cells(54,55). Certainly, our future studies will be directed towards characterizing the mechanism by which IgM potentially activates the smooth muscle.”

      We have discussed this section under Discussion section, line 731 to 757. In addition, since we have now performed bone marrow chimaeras we have further added the following in our discussion in line 757-766.

      To resolve other endogenous factors that could have potentially influenced reduced AHR in IgM-deficient mice, we resorted to busulfan chemical irradiation to deplete bone marrow cells in IgM-deficient mice and replace bone marrow with WT bone marrow. While it is well accepted that busulfan chemical irradiation partially depletes bone marrow cells, in our case it was not possible to pursue other irradiation methods due to changes in ethical regulations and that fact that mice are slow to recover after gamma rays irradiation. Busulfan chemical irradiation allowed us to show that we could mostly restore AHR in IgM-deficient recipient mice that received donor WT bone marrow when challenged with low dose HDM.

      We removed the following lines, after performing bone marrow chimaeras since this changed some aspects.

      Our efforts to adoptively transfer wild-type bone marrow or sorted B cells into IgM-deficient mice were also largely unsuccessful partly due to poor engraftment of wild-type B cells into secondary lymphoid tissues. Natural secreted IgM is mainly produced by B1 cells in the peritoneal cavity, and it is likely that any transfer of B cells via bone marrow transfer would not be sufficient to restore soluble levels of IgM(3,10).

      (5) Alpha smooth muscle antigen is also expressed by myofibroblasts. This is insufficiently worked out. The histology mentions "expression in cells in close contact with smooth muscle". This needs more detail since it is a very vague term. Is it in smooth muscle or in myofibroblasts.

      Response: We appreciate that alpha-smooth muscle actin-positive cells are a small fraction in the lung and even within CD45 negative cells, but their contribution to airway hyperresponsiveness is major. We also concede that by immunofluorescence BAIAP2L1 seems to be expressed by cells adjacent to alpha-smooth muscle actin (Fig. 5b), however, we know that cells close to smooth muscle (such as extracellular matrix and myofibroblasts) contribute to its hypertrophy in allergic asthma.

      James AL, Elliot JG, Jones RL, Carroll ML, Mauad T, Bai TR, et al. Airway Smooth Muscle Hypertrophy and Hyperplasia in Asthma. Am J Respir Crit Care Med [Internet]. 2012;185:1058–64. Available from: https://doi.org/10.1164/rccm.201110-1849OC

      (6) Have polymorphisms in BAIAP2L1 ever been linked to human asthma?

      No, we have looked in asthma GWAS studies, at least summary statics and we have not seen any SNPs can could be associated with human asthma.

      (7) IgM deficient patients are at increased risk for asthma. This paper suggests the opposite. So the translational potential is unclear

      We thank the reviewer for these comments. At the time of this publication, we have not made a concrete link with human disease. While there is some anecdotal evidence of diseases such as Autoimmune glomerulonephritis, Hashimoto’s thyroiditis, Bronchial polyp, SLE, Celiac disease and other diseases in people with low IgM. Allergic disorders are also common in people with IgM deficiency as the reviewer correctly points out, other studies have reported as high as 33-47%. The mechanisms for the high incidence of allergic diseases are unclear as generally, these patients have normal or higher IgG and IgE levels. IgM deficiency may represent a heterogeneous spectrum of genetic defects, which might explain the heterogeneous nature of disease presentations.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We are thankful to the reviewers and the editor for their detailed feedback, insightful suggestions, and thoughtful assessment of our work. The revised manuscript has taken into account all the comments of the three reviewers. We have also undertaken additional analyses and added materials in response to reviewer suggestions. In brief:

      (1) We have conducted a more in-depth analysis of frequency domain HRV metrics to better depict the change of autonomic tone.

      (2) We have revised the manuscript to provide justifications for the chosen taVNS protocol and to clearly articulate the objectives of the current study.

      (3) In response to comments from reviewer #2, we have included two new tables that present the absolute changes in cardiovascular metrics, clinical characteristics for the two trial arms, and effects of taVNS adjusted for age.

      Other significant amendments include:

      (1) An expanded discussion linking our findings to the existing literature on the effects of taVNS on cardiovascular function, biomarkers for taVNS response, the safety of taVNS, and the dose-response relationship of taVNS.

      (2) Revision to the Method section to provide details of QT interval calculation.

      Reviewer #1 (Public Review):

      The authors report the results of a randomized clinical trial of taVNS as a neuromodulation technique in SAH patients. They found that taVNS appears to be safe without inducing bradycardia or QT prolongation. taVNS also increased parasympathetic activity, as assessed by heart rate variability measures. Acute elevation in heart rate might be a biomarker to identify SAH patients who are likely to respond favorably to taVNS treatment. The latter is very important in light of the need for acute biomarkers of response to neuromodulation treatments.

      Comments:

      (1) Frequency domain heart rate variability measures should be analyzed and reported. Given the short duration of the ECG recording, the frequency domain may more accurately reflect autonomic tone.

      We sincerely appreciate this encouraging summary of our paper. We have analyzed and reported frequency-domain heart rate variability measures, including the relative power of the high-frequency band (0.15–0.4 Hz) and the relative power of the low-frequency band (0.04 – 0.15). We showed the distribution of the two frequency-domain HRV measures in supplementary Figure 2C-D. For 24-hour ECG recording, we found that the change in the relative high-frequency power from Day 1 was not significantly different between the treatment groups. As both high-frequency band and low-frequency band power are relative to the total power, the comparison of the relative power of the low-frequency band between groups would be the opposite of the relative power of the high-frequency band. As both time-domain and frequency-domain HRV measures can reflect the autonomic tone, we performed factor analysis to identify the parasympathetic activity component (Figure 2D). Comparing the change in parasympathetic activity component and relative high-frequency power, we observed similarities and discrepancies. Specifically, both the change in parasympathetic activity component and the change in relative high-frequency power were higher in the taVNS group at the early phase (Day 2 - 4).

      We also observed higher high-frequency power in the Sham group at the later phase. If the factor analysis successfully isolates the parasympathetic activity, there should be other factors than the parasympathetic activity affecting the relative power of the high-frequency band. One such factor is the respiration rate. The high-frequency range is between 0.15 to 0.4 Hz, corresponding to respiration's frequency range of approximately 9 to 24 breaths per minute. If the respiration rate increases and exceeds 24 breaths per minute, the respiratory-driven HRV might occur at a frequency higher than the typical high-frequency band. Given that the respiration rate was higher in the taVNS treatment group, a compensatory mechanism to ensure oxygen delivery (Figure 4E), we hypothesized that observed lower high-frequency power in the taVNS treatment group compared to sham at later phases is a result of increased respiration rate in the taVNS treatment group. Indeed, we found the normalized high-frequency power is higher when RR is less than 25 bpm compared to when RR > 25 bpm (Cohen’s d = 0.85, Supplementary Figure 3A). Moreover, an increase in RR in the taVNS treatment group is associated with a decrease in high-frequency power (Supplementary Figure 3B). These control analyses underscored the necessity of performing factor analysis to robustly measure parasympathetic activities and confirm that taVNS treatment mitigated the sympathetic overactivation during the early phase.

      We have now discussed the results of frequency-domain HRV measures in the Discussion section: taVNS and autonomic system (p23): “A key metric that reflects this restored sympathovagal balance is the increase in heart rate variability (Figure 3F). Specifically, the factor analysis showed that the parasympathetic activity was significantly higher in the taVNS treatment group. This difference was most pronounced during the early phase, particularly between Days 2 and 4 following SAH. In addition to analyzing the correlation between the parasympathetic activity factor and established HRV measures that reflect parasympathetic activity such as RMSSD and pNNI_50 (Figure 3C), we also examined changes in a frequency-domain HRV measure—the relative power of the high-frequency band (0.15–0.4 Hz)—to validate the accuracy of the factor analysis. the relative power of the high-frequency band is widely used to indicate respiratory sinus arrhythmia, a process primarily driven by the parasympathetic nervous system (Supplementary Figure 2). We found that both the change in parasympathetic activity factor and relative high-frequency power were higher in the taVNS group at the early phase (Day 2 - 4). Conversely, we observed higher high-frequency power in the Sham group during the later phase. If the factor analysis successfully isolates the parasympathetic activity, there should be other factors than the parasympathetic activity affecting the relative power of the high-frequency band. One such factor is the respiration rate. The high-frequency range is between 0.15 to 0.4 Hz, corresponding to respiration's frequency range of approximately 9 to 24 breaths per minute. If the respiration rate increases and exceeds 24 breaths per minute, the respiratory-driven HRV might occur at a frequency higher than the typical high-frequency band. Given that the respiration rate was higher in the taVNS treatment group, a compensatory mechanism to ensure oxygen delivery (Figure 4E), we hypothesized that observed lower high-frequency power in the taVNS treatment group compared to sham at later phases is a result of increased respiration rate in the taVNS treatment group. Indeed, we found the normalized high-frequency power is higher when RR is less than 25 bpm compared to when RR > 25 bpm (Cohen’s d = 0.85, Supplementary Figure 3A). Moreover, an increase in RR in the taVNS treatment group is associated with a decrease in high-frequency power (Supplementary Figure 3B). These control analyses underscored the necessity of performing factor analysis to robustly measure parasympathetic activities and confirm that taVNS treatment mitigated the sympathetic overactivation during the early phase.”

      We have also reported the changes in the relative power of the high-frequency band between the two treatment groups in Supplementary Figure 6. We did not find a significant change in relative high-frequency band power between the treatment groups (Treatment – pre-treatment difference: p = 0.74, Cohen’s d = -0.08, N(Sham) = 199, N(taVNS) = 188, Mann-Whitney U test). We reported these results in the Results section: Acute effects of taVNS on cardiovascular function (p18): “There were no significant differences in changes in corrected QT interval or heart rate variability, as measured by RMSSD, SDNN, and relative power of high-frequency band between treatment groups (Figure 5D and E and Supplementary Figure 6).”

      How was the "dose" chosen (20 minutes twice daily)?

      The choice of a 20-minute taVNS session twice daily was informed by findings from Addorisio et al. (2019), where the authors administered 5-minute taVNS twice daily to patients with rheumatoid arthritis for two days. They found that the circulating c-reactive protein (CRP) levels significantly reduced after two days of treatment but returned to baseline at the second clinical assessment by day 7. Given the high inflammatory state associated with subarachnoid hemorrhage (SAH) and our intention to maintain a steady reduction in inflammation, we extended the duration of taVNS to 20 minutes per session. We have clarified this stimulation schedule's rationale in the Results section (p5-6): “This treatment schedule was informed by findings from Addorisio et al., where a 5-minute taVNS protocol was administered twice daily to patients with rheumatoid arthritis for two days.29 Their study found that circulating c-reactive protein (CRP) levels significantly reduced after 2 days of treatment but returned to baseline at the second clinical assessment by day 7. Given the high inflammatory state associated with SAH and our intention to maintain a steady reduction in inflammation, we decided to extend the treatment duration to 20 minutes per session.”

      Addorisio, Meghan E., et al. "Investigational treatment of rheumatoid arthritis with a vibrotactile device applied to the external ear." Bioelectronic Medicine 5 (2019): 1-11.

      The use of an acute biomarker of response is very important. A bimodal response to taVNS has been previously shown in patients with atrial fibrillation (Kulkarni et al. JAHA 2021).

      Thank you for this valuable insight and for bringing the study by Kulkarni et al. to our attention. Their study showed that the response to Low-Level Tragus Stimulation (LLTS) varied among patients with atrial fibrillation, which can be predicted by acute P-wave alternans (PWA) to some degree. We have discussed the implication of the bimodal response to taVNS in the Discussion section (p26-27): “Kulkarni et al. showed that the response to low-level tragus stimulation (LLTS) varied among patients with atrial fibrillation.49 Similarly, in our study, not all patients in the taVNS treatment group showed a reduction in mRS scores (improved degree of disability or dependence). This differential response may be inherent to taVNS and potentially influenced by factors such as anatomical variations in the distribution of the vagus nerve at the outer ear. These findings underscore the importance of using acute biomarkers to guide patient selection and optimize stimulation parameters. Furthermore, we found that increased heart rate was a potential acute biomarker for identifying SAH patients who are most likely to respond favorably to taVNS treatment. Translating this finding into clinical practice will require further research to elucidate the mechanisms by which an acute increase in heart rate may predict the outcomes of patients receiving taVNS, including its relationship with neurological evaluations, vasospasm, echocardiography, and inflammatory markers.”

      Reviewer #2 (Public Review):

      Summary:

      This study investigated the effects of transcutaneous auricular vagus nerve stimulation (taVNS) on cardiovascular dynamics in subarachnoid hemorrhage (SAH) patients. The researchers conducted a randomized clinical trial with 24 SAH patients, comparing taVNS treatment to a Sham treatment group (20 minutes per day twice a day during the ICU stay). They monitored electrocardiogram (ECG) readings and vital signs to assess acute as well as middle-term changes in heart rate, heart rate variability, QT interval, and blood pressure between the two groups. The results showed that repetitive taVNS did not significantly alter heart rate, corrected QT interval, blood pressure, or intracranial pressure. However, it increased overall heart rate variability and parasympathetic activity after 5-10 days of treatment compared to the sham treatment. Acute taVNS led to an increase in heart rate, blood pressure, and peripheral perfusion index without affecting corrected QT interval, intracranial pressure, or heart rate variability. The acute post-treatment elevation in heart rate was more pronounced in patients who showed clinical improvement. In conclusion, the study found that taVNS treatment did not cause adverse cardiovascular effects, suggesting it is a safe immunomodulatory treatment for SAH patients. The mild acute increase in heart rate post-treatment could potentially serve as a biomarker for identifying SAH patients who may benefit more from taVNS therapy.

      Strengths:

      The paper is overall well written, and the topic is of great interest. The methods are solid and the presented data are convincing.

      Weaknesses:

      (1) It should be clearly pointed out that the current paper is part of the NAVSaH trial (NCT04557618) and presents one of the secondary outcomes of that study while the declared first outcomes (change in the inflammatory cytokine TNF-α in plasma and cerebrospinal fluid between day 1 and day 13, rate of radiographic vasospasm, and rate of requirement for long-term CSF diversion via a ventricular shunt) are available as a pre-print and currently under review (doi: 10.1101/2024.04.29.24306598.). The authors should better stress this point as well as the potential association of the primary with the secondary outcomes.

      Thank you for this valuable suggestion. The current study indeed focuses on the trial’s secondary outcomes. The main objective is to evaluate the cardiovascular safety of the taVNS protocol and to provide insights that will inform the application of taVNS in SAH patients. Following your comments, we have clarified this in the Introduction section (p6): “The current study is part of the NAVSaH trial (NCT04557618) and focuses on the trial’s secondary outcomes, including heart rate, QT interval, HRV, and blood pressure.32 This interim analysis aims to evaluate the cardiovascular safety of the taVNS protocol and to provide insights that will inform the application of taVNS in SAH patients. The primary outcomes of this trial, including change in the inflammatory cytokine TNF-α and rate of radiographic vasospasm, are available as a pre-print and currently under review.26”

      The negative association between HRV and inflammatory cytokines has been reported in numerous studies such as (Williams et al., Brain, Behavior, and Immunity, 2019; Haensel et al., Psychoneuroendocrinology. 2008). There are some studies suggesting that increased sympathetic tone following SAH is associated with vasospasm (Bjerkne Wenneberg, S. et al., Acta Anaesthesiologica Scandinavica. 2020; Megjhani et al., Neurocrit Care. 2020). Based on the literature, we compared the effects of taVNS on primary and secondary outcomes. The findings from the two parallel analyses are consistent: taVNS treatment reduced pro-inflammatory cytokines and increased HRV. Furthermore, the analyses of the primary outcomes revealed a reduction in the presence of any radiographic vasospasm in the taVNS treatment group compared to the sham. We have now integrated these findings and discussed them in the Discussion section (p25-26): “Given the negative association between pro-inflammatory markers and HRV, our finding that HRV was higher in the taVNS treatment group aligns with the findings of primary outcomes of this clinical trial, which showed that taVNS treatment reduced pro-inflammatory cytokines, including tumor necrosis factor-alpha (TNF-α) and interleukin-6.26,52 The consistency between these findings strengthens the evidence supporting the anti-inflammatory effects of taVNS. In addition, the sympathetic predominance following SAH is implicated in an increased risk of delayed cerebral vasospasm, which is most commonly detected 5-7 days after SAH.12 Given that taVNS treatment mitigated the sympathetic overactivation before the typical onset of cerebral vasospasm, it could potentially reduce the severity of this complication.”

      (2) The references should be implemented particularly concerning other relevant papers (including reviews and meta-analysis) of taVNS safety, particularly from a cardiovascular standpoint, such as doi: 10.1038/s41598-022-25864-1 and doi: 10.3389/fnins.2023.1227858).

      Thank you for providing the relevant papers. We have provided these references in the Introduction section to provide a more comprehensive background of our study (p6): “While some animal studies have reported a potential risk of bradycardia and decreased blood pressure associated with vagus nerve stimulation, two reviews of human studies have considered the cardiovascular effects of taVNS generally safe, with adverse effects reported only in patients with pre-existing heart diseases. 21,22,23

      (3) The dose-response issue that affects both VNS and taVNS applications in different settings should be mentioned (doi: 10.1093/eurheartjsupp/suac036.) as well as the need for more dose-finding preclinical as well as clinical studies in different settings (the best stimulation protocol is likely to be disease-specific).

      Overall, the present work has the important potential to further promote the usage of taVNS even on critically ill patients and might set the basis for future randomized studies in this setting

      Thank you for this valuable insight. Scientific understanding of the dose-response relationship and determining optimal parameters tailored to specific disease contexts has been recognized as an important part of taVNS research and, more generally, in the electrical neuromodulation field. Studies in this direction are often complex and time-intensive due to the multitude of possible parameter combinations. As such, most taVNS studies opted to use parameters that have been established in previous studies. For example, 20 Hz taVNS is extensively used as a therapeutic intervention in stroke (Matyas Jelinek ,2024, https://www.sciencedirect.com/science/article/pii/S0014488623003138). As we pioneer the application of taVNS as an immunomodulation technique in SAH patients, we also adopt parameters reported in similar studies, aiming to provide a basis for future preclinical and clinical studies of taVNS in this patient population. As you noted, the effects of taVNS are dose-dependent, necessitating systematic exploration of the parameter space, including frequency, intensity, and duration. Our findings of the acute biomarker (heart rate) hold the promise of close-loop taVNS. We have now emphasized the importance of investigating how parameters/dose affect taVNS’s effects on immune function and cardiovascular function in SAH patients (p28): “As we pioneer the application of taVNS as an immunomodulation technique in SAH patients, we adopt parameters (20 Hz, 0.4 mA) reported in similar studies.55 The current study provides a basis for future preclinical and clinical studies of taVNS in this patient population. To build on our findings, a systematic evaluation of the relationship between parameters such as frequency, intensity, and duration and taVNS’s effects on the immune system and cardiovascular function is necessary to establish taVNS as an effective therapeutic option for SAH patients.56”

      Reviewer #2 (Recommendations For The Authors):

      The paper is overall well written, and the topic is of great interest. The reviewer has some major comments:

      (1) It should be clearly pointed out that the current paper is part of the NAVSaH trial and presents one of the secondary outcomes of that study while the declared first outcomes (change in the inflammatory cytokine TNF-α in plasma and cerebrospinal fluid between day 1 and day 13, rate of radiographic vasospasm, and rate of the requirement for long-term CSF diversion via a ventricular shunt) are available as a pre-print and currently under review (doi: 10.1101/2024.04.29.24306598.).

      We have revised the manuscript following your comment. Please see comment Reviewer 2 Public Review and our response.

      The authors should assess the relationship between the impact of taVNS on inflammatory markers in plasma and in cerebrospinal fluid and the autonomic responses. The association between inflammatory markers and noninvasive autonomic markers as well as sympathovagal balance should also be assessed. Specifically, the authors should try to assess whether the acute post-treatment elevation in heart rate was more pronounced in patients who experienced a more pronounced reduction in inflammatory biomarkers. Indeed, since all patients in the current study received the same dose of taVNS (20 Hz frequency, 250 μs pulse width, and 0.4 mA intensity), while in several cardiovascular studies (doi: 10.1016/j.jacep.2019.11.008, doi: 10.1007/s10286-023-00997-z) the intensity (amplitude) of taVNS was differentially set based on the subjective pain/sensory threshold, that might be a marker of acute afferent neuronal engagement.

      We agree that analyzing the change in cardiovascular metrics and changes in inflammatory markers is an important next step. In particular, testing whether the acute elevation in heart rate correlates with changes in inflammatory markers could further establish heart rate as a biomarker to guide patient selection and optimize stimulation parameters. (Please refer to comment 1.3 and our responses). However, in this paper, the primary objective is the cardiovascular safety of the current taVNS protocol in SAH patients. This association between inflammatory markers and autonomic responses extends beyond the scope of the current manuscript and would be more appropriately addressed in a separate publication.

      Previous literature has shown a negative association between HRV and inflammatory markers in SAH patients (for example, Adam, J., 2023). It is reasonable to postulate that taVNS modulates the immune system and the autonomic system synergistically. We found that parasympathetic tone was higher in the taVNS treatment group, with the most notable differences observed between Days 2 and 4 following SAH (Figure 3F). In a separate study of the primary outcomes of this trial (Huguenard et al., 2024), serum levels of IL-6 (pro-inflammation cytokine) were also significantly lower in the taVNS treatment group on Day 4 (Figure 3A, in our preprint, https://doi.org/10.1101/2024.04.29.24306598).

      We appreciate your input regarding the potential mechanism behind acute heart rate changes. In this trial, all patients who were able to engage in verbal communication were asked if they felt any prickling or pain during all sessions. We confirmed that the current stimulation setting was sub-perception in all trialed patients, making it unlikely that the observed heart rate increase was due to pain or sensory perception. Our current hypothesis is that successful activation of the afferent vagal pathway by taVNS increased arousal, resulting in increased heart rate. We have revised the Discussion section based on your insight (p29): “All patients who were capable of verbal communication were asked if they felt any prickling or pain during all sessions. We confirmed that the current taVNS protocol is below the perception threshold for all trialed patients. Altogether, successful activation of the afferent vagal pathway by taVNS increased arousal, resulting in increased heart rate.50,51”

      Huguenard, A. L. et al. Auricular Vagus Nerve Stimulation Mitigates Inflammation and Vasospasm in Subarachnoid Hemorrhage: A Randomized Trial. (2024) doi:10.1101/2024.04.29.24306598.

      Adam, J., Rupprecht, S., Künstler, E. C. S. & Hoyer, D. Heart rate variability as a marker and predictor of inflammation, nosocomial infection, and sepsis – A systematic review. Autonomic Neuroscience vol. 249 103116 (2023).

      A new table should be provided with the mean (or median) values of the two arms of the population (taVNS and sham) including baseline clinical characteristics, comorbidities (mean age, % of female, % with known hypertension, diabetes, etc), ongoing medications (% on beta-betablockers, etc), and pre, during and post-treatment absolute values (expressed as mean or median depending on the distribution) of the studied parameters (QT and QTc absolute values, heart rate, SDNN, etc) in order for the reader to have a better understanding of how SAH affects these parameters. Absolute changes in the abovementioned parameters should also be presented in the table. For instance, the reported absolute increase in heart rate, based on Figure 5, panel C, seems very modest, below 2 bpm. This is very important to underlying for several reasons, including the fact that the evaluation of the impact of treatment on heart rate variability as assessed in the time domain might be influenced by concomitant changes in heart rate due to the nonlinearity of neural modulation of sinus node cycle length. Indeed, time-domain indexes of HRV intrinsically increase when heart rate decreases in a nonlinear way, while frequency domain indexes (e.g. the low frequency/high frequency (LF/HF) ratio), appear to be devoid of intrinsic rate-dependency (doi: 10.1016/s0008-6363(01)00240-1).

      Thank you for your suggestion. We have added the new table to the manuscript. In this table, we include clinical characteristics, the median of absolute values of cardiovascular metrics from 24-hour ECG recording, and the median absolute changes in these metrics for both arms. We believe that absolute values of cardiovascular metrics from 24-hour ECG recording are more informative about how SAH affects these parameters than metrics for the pre-, during-, and post-treatment periods.

      In Result (p7), we have added: “Supplementary Table 3 shows the clinical characteristics of the two treatment groups.” In Result, Acute effect of taVNS on cardiovascular function (p20), we have added: “Supplementary Table 3 summarizes the absolute changes in cardiovascular metrics for the treatment groups.”

      Thank you for raising the concern about HRV and providing the reference. We have now reported frequency domain indexes in our results: relative power of high-frequency power, which is negatively correlated with the LF/HF ratio. The high-frequency power is used to capture sinus arrhythmia, reflecting the parasympathetic modulation of the heart. Although the frequency domain metrics might be less susceptible to the rate-dependency (doi: 10.1016/s0008-6363(01)00240-1), there are circumstances when the frequency domain metrics might not accurately reflect the autonomic tone (Please see Reviewer 1 Publice Review and our responses).

      An attempt to correct the effect of taVNS on the evaluated autonomic parameters according to age should be provided, considering that there were no age limits and parasympathetic indexes, particularly at the sinus node level, are known to decrease with age, particularly for those older than 65 years.

      Thank you for the suggestion. We were aware of the influence of age on cardiac heart rate and heart rate variability. In our initial analysis, we compared the change in autonomic parameters from day 1 within each subject across the two treatment groups. This approach controls for individual differences, including those due to age. In addition to your comment, age is a risk factor for subarachnoid hemorrhage. Older individuals often face an increased risk of poor outcomes. To further verify if age influences autonomic changes following SAH, we performed ANCOVA on autonomic function parameters with age included as a covariate. This analysis showed that age was negatively correlated with changes in heart rate, SDNN, and RMSSD from Day 1, but not with changes in QT intervals. After adjusting for age, we found that RMSSD changes and SDNN changes were significantly higher in the taVNS treatment group, while QTc changes were significantly lower in this group. These results align with the main findings (Figures 2 and 3). In addition, autonomic changes following SAH may be influenced by age. Specifically, lower RMSSD and SDNN in older individuals suggest a greater shift toward sympathetic predominance following SAH. We have now reported these results in Supplementary Table 4 and discussed their implication in the Discussion section (p28): “To control for individual differences, including those due to age, our study compared the change in cardiovascular parameters from Day 1 within each subject across treatment groups. To further verify if age influences autonomic changes following SAH, we performed ANCOVA on autonomic function parameters with age included as a covariate. This analysis showed that age was negatively correlated with changes in heart rate, SDNN, and RMSSD from Day 1 but not with changes in QT intervals. After adjusting for age, we found that RMSSD changes and SDNN changes were significantly higher, while QTc changes were significantly lower in the taVNS treatment group (Supplementary Table 4). These results align with the conclusion that repetitive taVNS treatment increased HRV and was unlikely to cause bradycardia or QT prolongation. In addition, autonomic changes following SAH may be influenced by age. Specifically, lower RMSSD and SDNN in older individuals suggest a greater shift toward sympathetic predominance following SAH (Supplementary Table 4).”

      The results of the current study should be discussed considering what was previously demonstrated concerning the cardiovascular effects of taVNS (doi: 10.3389/fnins.2023.1227858).

      We appreciate the suggestion to consider previous findings on the cardiovascular effects of taVNS. However, it is important to note that most studies investigating the cardiovascular effects of taVNS involve healthy individuals, whereas our study focuses on SAH patients who are critically ill. Given the influence of SAH on cardiovascular parameters, we should be cautious when generalizing our findings to the broader population. Previous studies involving stroke populations have reported cardiovascular parameters descriptively as part of their safety assessments (doi: 10.1155/2020/8841752). Our study is currently the only one systematically investigating the cardiovascular safety of taVNS in SAH patients. Furthermore, the review paper (doi: 10.3389/fnins.2023.1227858) includes a highly heterogeneous mix of studies, such as auricular acupressure, auricular acupuncture, and electrical stimulation applied to different parts of the ear. For the subset of studies involving electrical stimulation, there is considerable variation in the parameters used, with frequencies ranging from 0.5 Hz to 100 Hz, currents from 0.1 mA to 45 mA, and durations spanning from 20 minutes to 168 days. These variations make direct comparisons with our findings challenging.

      It looks like QT measurements were performed automatically. It should be specified which method was used for the measurements (threshold, tangent, or superimposed method?).

      In our study, QT intervals were measured based on thresholding after wavelet transforming the ECG signals (Martínez, J. P., IEEE Transactions on Biomedical Engineering, 2004, doi: 10.1109/TBME.2003.821031). The local maxima of the wavelet transform correspond to significant changes in the ECG signal, such as the rapid upward or downward deflections associated with the QRS complex. The algorithm searches modulus maxima, that is, peaks of wavelet transform coefficients that exceed specific thresholds, to identify the QRS complex. R peaks are found as the zeros crossing between the positive-negative modulus maxima pair. After localizing the R peak, the Q onset is detected as the beginning of the first modulus maximum before the modulus maximum pair created by the R wave. To identify the T wave, the algorithm searches for local maxima in the absolute wavelet transform in a search window defined relative to the QRS complex. Thresholding is used to identify the offset of the T wave. Please refer to comments 3.4 and 3.5 and our responses for details. We have clarified the method for measuring QT in the Method section (p35): “This algorithm identifies the QRS complex by searching for modulus maxima, which are peaks in the wavelet transform coefficients that exceed specific thresholds. The onset of the QRS complex is determined as the beginning of the first modulus maximum before the modulus maximum pair created by the R wave. To identify the T wave, the algorithm searches for local maxima in the absolute wavelet transform in a search window defined relative to the QRS complex. Thresholding is used to identify the offset of the T wave.”

      QTc dispersion was not evaluated, and this should be listed as a limitation of the current study.

      We have added this limitation in the Discussion section: Limitations and outlook (p31): “The current study did not explore the effects of taVNS on less commonly used cardiovascular metrics, such as QTc dispersion.”

      It has been recently suggested (doi: 10.1016/j.brs.2018.12.510) that QTc, as a potential indirect marker of HRV, might be used as a biomarker for VNS response in the treatment of resistant depression. The author should try to assess whether in the current study baseline QTc before taVNS is associated with outcome and with taVNS response.

      Thank you for the suggestion. The conference abstract in the provided doi stated that QTc as an indirect marker of HRV before implantation was correlated with changes in the depression rating scale. The mechanism seems to be that QTc has information about the pathophysiology of the depression (10.1097/YCT.0000000000000684). The current study focused on the comparison between taVNS treatment and sham treatment. Our future study will further test if SAH patients’ response to taVNS can be predicted by baseline QTc.

      The dose-response issue that affects both VNS and taVNS in different settings should be mentioned (doi: 10.1093/eurheartjsupp/suac036.) as well as the need for more dose-finding preclinical as well as clinical studies in different settings (the best stimulation protocol is likely to be disease-specific).

      Please refer to our responses to comment 3.

      Minor Comments

      Some typos or commas instead of affirmative points and vice versa.

      Thank you for pointing this out. We have carefully proofread the manuscript and made the necessary corrections to ensure proper punctuation and grammar throughout.

      Table 1: why age is expressed as a range for each person?

      MedRxiv asks authors to remove all identifying information. Precise ages are direct identifiers, as opposed to age ranges. We have now revised the age column to ‘decade of life’ in the updated table. We believe this modification reduces confusion while adhering to MedRxiv’s guidelines.

      Although already reported in the study protocol (doi: 10.1101/2024.03.18.24304239), the heart rate limits for inclusion should be reported (sustained bradycardia on arrival with a heart rate < 50 beats per minute for > 5 minutes, implanted pacemaker or another electrical device).

      We have now added the specific inclusion and exclusion criteria in the Method details section (p33): “Inclusion criteria were: (1) Patients with SAH confirmed by CT scan; (2) Age > 18; (3) Patients or their legally authorized representative are able to give consent. Exclusion criteria were: (1) Age < 18; (2) Use of immunosuppressive medications; (3) Receiving ongoing cancer therapy; (4) Implanted electrical device; (5) Sustained bradycardia on admission with a heart rate < 50 beats per minute for > 5 minutes; (6) Considered moribund/at risk of imminent death.”

      Why did the authors choose a taVNS schedule of two times per day of 30 minutes each as compared for instance to one hour per day? Please comment on that also referring to other taVNS studies in the acute setting such as the one by Dasari T et al (doi: 10.1007/s10286-023-00997-z.) where taVNS was applied for 4 hours twice daily. For instance, Yum Kim et al (doi: 10.1038/s41598-022-25864-1) recently reported in a systematic review and meta-analysis of taVNS, safety, that repeated sessions and sessions lasting 60 min or more were shown to be more likely to lead to adverse events.

      The International Consensus-Based Review and Recommendations for Minimum Reporting Standards in Research on Transcutaneous Vagus Nerve Stimulation should be referred to and contextualized (doi: 10.3389/fnhum.2020.568051).

      Thank you for raising this question and providing relevant references. We have reviewed the proposed checklist for minimum reporting items in taVNS research (10.3389/fnhum.2020.568051) and have ensured that our manuscript complies with the recommended reporting items.

      The current taVNS schedule was based on findings from Addorisio et al. (2019). We have revised the manuscript to clarify the rationale behind the current taVNS protocol. Please refer to our response to comment 1.2. The two studies mentioned in the comments were published after our trial was designed and initiated (https://clinicaltrials.gov/study/NCT04557618). Based on the meta-analysis by Yum Kim et al., the short duration of treatment sessions might explain the cardiovascular safety of the current taVNS protocol. We are also currently assessing the effects of our taVNS protocol on inflammatory markers.

      Reviewer #3 (Public Review):

      Summary:

      The authors aimed to characterize the cardiovascular effects of acute and repetitive taVNS as an index of safety. The authors concluded that taVNS treatment did not induce adverse cardiovascular effects, such as bradycardia or QT prolongation.

      Strengths:

      This study has the potential to contribute important information about the clinical utility of taVNS as a safe immunomodulatory treatment approach for SAH patients.

      Weaknesses:

      A number of limitations were identified:

      (1) A primary hypothesis should be clearly stated. Even though the authors state the design is a randomized clinical trial, several aspects of the study appear to be exploratory. The method of randomization was not stated. I am assuming it is a forced randomization given the small sample size and approximately equal numbers in each arm.

      Thank you for the suggestion. The current study is part of the NAVSaH trial (NCT04557618), aiming to define the effects of taVNS on inflammatory markers, vasospasm, hydrocephalus, and continuous physiology data. This study focuses on the effects of repetitive and acute taVNS on continuous physiology data to evaluate the cardiovascular safety of the current taVNS protocol. The primary hypothesis tested in our study is that repetitive taVNS increased HRV but did not cause bradycardia and QT prolongation. Following your comments, we have clarified this in the Introduction section (p6): “This interim analysis aims to evaluate the cardiovascular safety of the taVNS protocol and to provide insights that will inform the application of taVNS in SAH patients. The primary outcomes of this trial, including change in the inflammatory cytokine TNF-α and rate of radiographic vasospasm, are available as a pre-print and currently under review.26 Based on a meta-analysis, repeated sessions lasting 60 min or more are likely to lead to aversive effects; therefore, we hypothesized that repetitive taVNS increased HRV but did not cause bradycardia and QT prolongation.23”

      (2) The authors "first investigated whether taVNS treatment induced bradycardia or QT prolongation, both potential adverse effects of vagus nerve stimulation. This analysis showed no significant differences in heart rate calculated from 24-hour ECG recording between groups." A justification should be provided for why a difference is expected from 20 minutes of taVNS over a period of 24 hours. Acute ECG changes are a concern for increasing arrhythmic risk, for example, due to cardiac electrical restitution properties.

      A human study (Clancy, L. A. et al., Brain Stimulation, 2017, https://doi.org/10.1016/j.brs.2014.07.031) has found that 15-min taVNS led to reduced sympathetic activity measured by low-frequency/high-frequency (LF/HF) ratio. The sympathetic activity remained lower than baseline levels during the recovery period, suggesting potential long-term effects of taVNS on cardiovascular function. In addition, the repetitive taVNS treatment in this clinical trial was intended to maintain a steady low-inflammatory state. Given the potential life-threatening implications of bradycardia and QT prolongation in these critically ill patients, we deemed it crucial to evaluate heart rate and QT interval both acutely and from 24-hour ECG monitoring. We have now provided the justification in the Result section (p11): “A study has shown that 15 minutes of taVNS reduced sympathetic activity in healthy individuals, with effects that persist during the recovery period.33 This finding suggests that taVNS may exert long-term effects on cardiovascular function. Therefore, we investigated whether repetitive taVNS treatment affects heart rate and QT interval, key indicators of bradycardia or QT prolongation, using 24-hour ECG recording.”

      An additional value of analyzing 24-hour ECG recording is that we can detect bradycardia or QT prolongation that happen outside the period of the stimulation, which could caused by repetitive taVNS. To this end, we reanalyzed the data and calculated the percentage of prolonged QT intervals using 500ms criterion (Giudicessi, J. R., Noseworthy, P. A. & Ackerman, M. J. The QT Interval. Circulation, 2019). When comparing the percentage of prolonged QT intervals between the treatment groups, we found that changes in prolonged QT intervals percentage from Day 1 were higher in the Sham group (Figure 3F, Mann–Whitney U test, N(taVNS) = 94, N(Sham)=95, p-value < 0.001, Cohen’s d = -0.72). We have now reported the results in the Result section (p11): “To ensure that repetitive taVNS did not lead to QT prolongation happening outside the period of stimulation, we calculated the percentage of prolonged QT intervals. Prolonged QT intervals were defined as corrected QT interval >= 500 ms. We found that changes in prolonged QT intervals percentage from Day 1 were higher in the Sham group (Figure 3F, Mann–Whitney U test, N(taVNS) = 94, N(Sham)=95, p-value < 0.001, Cohen’s d = -0.72).

      The concern regarding acute ECG changes related to increased arrhythmic risk is valid. We have improved the reasoning behind analyzing acute ECG change, which now reads (p20): “Assessing the acute effect of taVNS on cardiovascular is crucial for its safe translation into clinical practice. We compared the acute change of heart rate, corrected QT interval, and heart rate variability between treatment groups, as abrupt changes in the pacing cycle may increase the risk of arrhythmias.”

      (3) More rigorous evaluation is necessary to support the conclusion that taVNS did not change heart rate, HRV, QTc, etc. For example, shifts in peak frequencies of the high-frequency vs. low-frequency power may be effective at distinguishing the effects of taVNS. Further, compensatory sympathetic responses due to taVNS should be explored by quantifying the changes in the trajectory of these metrics during and following taVNS.

      We appreciate your concerns regarding the potential effects on the autonomic system associated with taVNS treatment. We would like to clarify that the primary objective of our study was to evaluate the cardiovascular safety of the taVNS protocol in SAH, with a specific focus on detecting any acute changes in heart rate and QT interval. As you highlighted, such acute ECG changes are a concern for increasing arrhythmic risk. By directly studying the trend of heart rate, HRV, and QT over the acute treatment periods, we found no significant change in these metrics between treatment groups. In addition, these metrics remained within 0.5 standard deviations of their daily fluctuations during and following taVNS treatment (Figure 5 and Supplementary Figure 6). These findings support the conclusion that the current protocol is unlikely to cause cardiac complications.

      In response to your suggestion to conduct a more rigorous analysis, particularly concerning peak frequencies within the high-frequency (HF) and low-frequency (LF) bands, we pursued this analysis to explore more nuanced effects of taVNS on the autonomic system. We compared the shifts in peak frequencies within these bands between the treatment groups and found no significant changes that would suggest a sympathetic or parasympathetic shift following acute taVNS.

      In detail, we have made the following revisions following your comments:

      (1) We have clarified the motivation behind studying the acute change of cardiac metrics following taVNS treatment – monitoring the cardiovascular safety of current taVNS protocol in SAH patients (p18): please refer to response to comment 3.2.

      (2) We compared the peak frequencies of the high-frequency and low-frequency bands following taVNS. added the results in the supplementary materials:

      We note that neurophysiology underlying peak frequencies has not been thoroughly studied in the literature compared to the LF-band power or HF-band power. Therefore, we report this result as an exploratory analysis.

      (3) We have added the changes of QTc during and following taVNS in Figure 5 and showed that they were within 0.5 standard deviations of their daily fluctuations during and following taVNS treatment. We have now shown the changes of HRV during and following taVNS in Supplementary Figure 6 A-D. We added the change of high-frequency power following Reviewer #1’s comment 1.1. Overall, our results suggest that repetitive taVNS increased parasympathetic activities, while there is no evidence that acute taVNS significantly affected heart rate or QT.

      (4) The authors do not state how the QT was corrected and at what range of heart rates. Because all forms of corrections are approximations, the actual QT data should be reported along with the corrected QT.

      The corrected QT interval (QTc) estimates the QT interval at a standard heart rate of 60 bpm. In practice, we removed RR intervals outside of the 300 – 2000 ms range. Further, we removed ectopic beats, defined as RR intervals differing by more than 20% from the one proceeding. We used the Bazett formula to correct the QT intervals: . We have now clarified how QT was corrected in the Method section – Data processing (p35-36): “R-peaks were detected as local maxima in the QRS complexes. P-waves, T-waves, and QRS waves were delineated based on the wavelet transform (Figure 2A-C).34  RR intervals were preprocessed to exclude outliers, defined as RR intervals greater than 2 s or less than 300 ms. RR intervals with > 20% relative difference to the previous interval were considered ectopic beats and excluded from analyses. After preprocessing, RR intervals were used to calculate heart rate, heart rate variability, and corrected QT (QTc) based on Bazett's formula: .44 The corrected QT interval (QTc) estimates the QT interval at a standard heart rate of 60 bpm.”

      We have reported the actual QT data in the Result section (p10 and p 19):” Moreover, changes in corrected QT interval from Day 1 were significantly higher in the Sham group compared to the taVNS group (Figure 3B, Mann–Whitney U test, N(taVNS) = 94, N(Sham)=95, p-value < 0.001, Cohen’s d = -0.57). Similarly, uncorrected QT intervals from Day 1 were higher in the Sham group (Supplementary Figure 10A, Cohen’s d = -0.42).”

      “Supplementary Figure 10B-C shows the acute changes in uncorrected QT interval.”

      (5) The QT extraction method needs to be more robust. For example, in Figure 2C, the baseline voltage of the ECG is shifting while the threshold appears to be fixed. If indeed the threshold is not dynamic and does not account for baseline fluctuations (e.g., due to impedance changes from respiration), then the measures of the QT intervals were likely inaccurate.

      A robust method to estimate the QT interval is essential in our study. To this end, we used the state-of-the-art method to calculate QT intervals. We first applied a 0.5 Hz fifth-order high-pass Butterworth filter and a 60 Hz powerline filter on the ECG recording. The high-pass filtering is used to correct potential baseline fluctuations. Subsequently, a wavelet-based algorithm was used to delineate the QRS complex and T wave (Martínez, J. P., IEEE Transactions on Biomedical Engineering, 2004). In short, this algorithm identifies QRS based on modulus maxima of the wavelet transform of ECG signals. After localizing the R peak, the Q onset is detected as the beginning of the first modulus maximum before the modulus maximum pair created by the R wave. The detection is performed on wavelet transform at a small scale rather than on the original signal, minimizing the effect of baseline shift (see III Detection methods, (5), Cuiwei Li et al., IEEE TBME, 1995, Detection of ECG Characteristic Points Using Wavelet Transforms). T wave is detected similarly based on wavelet transform. Please refer to our response to comment 2.9.

      Martínez, J. P., Almeida, R., Olmos, S., Rocha, A. P., & Laguna, P. (2004). A wavelet-based ECG delineator: evaluation on standard databases. IEEE Transactions on Biomedical Engineering, 51(4), 570-581.

      In Figure 2C, the purple and green lines take the value of 1 at the QRS onset or the T wave offset; otherwise, 0, which might appear to be a threshold. We have now used verticle lines to denote the detected QRS onsets and T wave offsets. Please see below for a comparison of the annotation:

      We have clarified the details of extracting QT intervals from ECG recordings in the Method section (p31): “To calculate cardiac metrics, we first applied a 0.5 Hz fifth-order high-pass Butterworth filter and a 60 Hz powerline filter on ECG data to reduce artifacts. 35 We detected QRS complexes based on the steepness of the absolute gradient of the ECG signal using the Neurokit2 software package.35 R-peaks were detected as local maxima in the QRS complexes. P waves, T waves, and QRS complexes were delineated based on the wavelet transform of the ECG signals proposed by Martinez J. P. et al. (Figure 2A-C).36 This algorithm identifies the QRS complex by searching for modulus maxima, which are peaks in the wavelet transform coefficients that exceed specific thresholds. The onset of the QRS complex is determined as the beginning of the first modulus maximum before the modulus maximum pair created by the R wave. To identify the T wave, the algorithm searches for local maxima in the absolute wavelet transform in a search window defined relative to the QRS complex. Thresholding is used to identify the offset of the T wave.”

      We have modified Figure 2C for better clarity:

      More statistical rigor is needed. For example, in Figure 2D, the change in heart rate for days 5-7, 8-10, and 11-13 is clearly a bimodal distribution and as such, should not be analyzed as a single distribution. Similarly, Figure 2E also shows a bimodal distribution. Without the QT data, it is unclear whether this is due to the application of the heart rate correction method.

      Thank you for raising this concern. Several factors could contribute to the observed distribution of changes in heart rate for days 5-7, 8-10, and 11-13, as shown in Figure 2D. One such factor is the smaller sample size in the later days. The mean duration of hospitalization for the 24 subjects included in this study was 11.29 days, with a standard deviation of 6.43, respectively. Other factors include variations in medical history, SAH pathology, and clinical outcomes during hospitalization. Further analysis revealed that heart rate was lower in patients with improved mRS scores (Supplementary Figure 4B), suggesting that clinical outcomes might impact changes in heart rate. Understanding the association between cardiovascular metrics and clinical assessments, such as vasospasm and inflammation, could help decide whether future taVNS trials should control for these factors when evaluating the effects of taVNS on cardiovascular function. We are currently continuing to recruit SAH patients in this clinical trial, and we plan to perform such analyses in future studies.

      In the manuscript, we reported the effect size between the treatment groups for days 5-7, 8-10, and 11-13. This should be interpreted in conjunction with the characteristics of the distribution. To provide a rigorous interpretation of our results, we have now discussed these considerations in the discussion section (p28): “We noticed a high variance of change in heart rate for days 5 – 7, 8 – 10, and 11 – 13 for both treatment groups (Figure 2D). This may be due to the small sample size in the later days, given that the mean duration of hospitalization for the 24 subjects included in this study was 11.3 days with a standard deviation of 6.4. Differences in medical history and clinical outcomes during hospitalization may also explain the variance of change in heart rate for the later days. For example. heart rate was lower in patients with improved mRS scores (Supplementary Figure 4B). Understanding the association between cardiovascular metrics and clinical assessments, such as vasospasm and inflammation, could help decide whether future taVNS trials should control for these factors when evaluating the effects of taVNS on cardiovascular function.”

      To test our hypothesis that repetitive taVNS does not induce significant heart rate change, we performed a two-tailed equivalence test of heart rate change between the two treatment groups, including data from days 2-13 (Figure 2D, left panel). To verify the validity of this approach, we calculated the Bimodality Coefficient (BC) and performed the Dip Test for unimodality for the distribution of heart rate change for the two treatment groups. The Bimodality Coefficient (BC) is a measure that combines skewness and kurtosis to assess whether a distribution is bimodal or unimodal. A BC value greater than 0.555 typically indicates a bimodal distribution, whereas a BC value less than or equal to 0.555 suggests an unimodal distribution. The Dip Test is a statistical test that assesses the unimodality of a distribution. A non-significant p-value (p-value ≥ 0.05) indicates that the distribution is likely unimodal. This analysis suggests that the distributions of heart rate changes in both treatment groups (days 2 - 13) are unimodal (BC = 0.457 and p = 0.374 for the taVNS treatment group; BC = 0.421 and p = 0.656 for the sham treatment group). This finding provides justification for our statistical approaches.

      Figure 3A shows a number of outliers. A SDNN range of 200 msec should raise concern for a non-sinus rhythm such as arrhythmia or artifact, instead of sinus arrhythmia. Moreover, Figure 3B shows that the Sham RMSSD data distribution is substantially skewed by the presence of at least 3 outliers, resulting in lower RMSSD values compared to taVNS. What types of artifact or arrhythmia discrimination did the authors employ to ensure the reported analysis is on sinus rhythm? The overall results seem to be driven by outliers.

      Mild cardiac abnormalities are common in SAH patients. Therefore, change in cardiovascular metrics was expected to differ from healthy individuals, which makes studying the cardiovascular effect on taVNS extremely important in this context. Following your comment, we investigated whether the large SDNN change was due to arrhythmia or artifacts. Except for a single instance where one subject exhibited an SDNN change of 200 ms on a particular day, all other SDNN changes were less than 150 msec. We identified the subject and day associated with the largest SDNN change, which is Day 7. As shown in Author response image 1A and B, SDNN of this subject increased on day 7 while the heart rate (HR) of this subject decreased. Changes in HRV were inversely related to HR changes, suggesting shifts in sympathetic and parasympathetic tone. We checked the ECG recording and the extracted NN intervals (processed RR intervals) on that day. The NN intervals are more variate on day 7 compared to day 1 (Author response image 1C and D). To determine whether the significant variance observed between 5:01 am and 5:02 am was due to arrhythmia or artifacts, we closely examined the corresponding ECG signals (Author response image 1E and F). Based on our analysis, the elevated SDNN is unlikely to be attributed to artifacts.

      Author response image 1.

      Similarly, we identified the subjects and days corresponding to the most prominent RMSSD decrease in the sham treatment group. We verified the ECG quality for this subject and the accuracy of RR interval identification, and that there was no significant cardiovascular event during the subject’s stay in the ICU. Based on the inclusion and exclusion criteria defined in our protocol (Huguenard A et al.m PLOS ONE, 2024), we did not exclude these data from our analysis.

      Huguenard A, Tan G, Johnson G, Adamek M, Coxon A, et al. (2024) Non-invasive Auricular Vagus nerve stimulation for Subarachnoid Hemorrhage (NAVSaH): Protocol for a prospective, triple-blinded, randomized controlled trial. PLOS ONE 19(8): e0301154. https://doi.org/10.1371/journal.pone.0301154

      To ensure accurate inferences about sympathetic and parasympathetic tone from these cardiovascular metrics, we have rigorously refined our methodologies, including correcting RR intervals outliers, correcting ectopic peaks, using state-of-art algorithms to identify QRS complex, P wave, and T wave (please refer to response to comment 3.5), and performing factor analysis. In addition, no significant cardiac complications have been reported by the attending physicians for the subjects included in this study. Nonetheless, it is important to note that ECG patterns in patients with SAH differ from those in healthy individuals, potentially impacting the accuracy of R peak identification. For example, one identified R peak (out of 73) was Q peak (F in the above figure). The pathology associated with SAH complicates the precise calculation of cardiovascular metrics and the interpretation of the results. We are committed to continually improving our methodologies for assessing autonomic function in SAH patients. We have now discussed these limitations in the Discussion section (p31-32): “Mild cardiac abnormalities are common in SAH patients5, complicating the precise calculation of cardiovascular metrics from ECG signals and the interpretation of the results. Systematic verification of methods for calculating cardiovascular metrics to ensure their applicability in SAH patients is crucial.”

      The above concern will also affect the power analysis, which was reported by authors to have been performed based on the t-test assuming the medium effect size, but the details of sample size calculations were not reported, e.g., X% power, t-test assumed Bonferroni correction in the power analysis, etc.

      Thank you for raising this concern. The current study is part of the NAVSaH trial (NCT04557618), focusing on the trial’s secondary outcomes (Please refer to comment 2.1 and our responses). The main objective of this interim analysis is to evaluate the cardiovascular safety of the current taVNS protocol. Goal enrollment for the pilot NAVSaH trial is 50 patients, based on power calculations to detect significant differences in inflammatory cytokines, radiographic vasospasm, and chronic hydrocephalus. The detailed power analysis is described in the protocol (Huguenard A et al.m PLOS ONE, 2024):

      “Under a 2-by-2 repeated measures design consisting of two groups of patients, each measured at two time points, our goal is to compare the change across time in the taVNS group to the change across time in the Sham group. Based upon previous work from Koopman et al. [67], we assume our study will observe 1.1 standardized inflammatory cytokines mean change difference between the two groups. Using a two-sided, two-sample t-test, assuming both time points have equal variance and there is a weak correlation (i.e., 0.15) between measurement pairs, a sample size of 25 in each group achieves at least 80% power to detect a standardized difference of 1.1 in mean changes, with a significance level (alpha) of 0.05 [68].

      Based on our preliminary data, we assume this study will observe 25% and 55% severe vasospasm in the taVNS and Sham groups, respectively. Under a design with 2 repeated measurements (i.e., 2 raters), assuming a compound symmetry covariance structure with a Rho of 0.2, at a significance level (alpha) of 0.05, a sample size of 25 in each group achieves at least 80% power when the null proportion is 0.55, and the alternative proportion is 0.25 [69–71].

      As previously described, LV et al. [8] studied the relationship between cytokine levels and clinical endpoints in SAH, including hydrocephalus. From their outcomes, we predict a needed enrollment of approximately 50 to detect these endpoints. From our own preliminary data, with an incidence of chronic hydrocephalus 0% in treated patients and 28.6% in control (despite grade of hemorrhage), alpha = 0.05 and power = 0.80, the projected sample size to capture that change is approximately 44 patients.”

      In this study, we used power analysis to report the achieved power of insignificant findings. For example, a Mann-Whitney U test on heart rate change between the treatment groups revealed no significant differences. We then used power analysis to calculate the achieved power. We have added the details of power analysis in the Method section (p34): “We calculated the achieved power of tests on heart rate change between the treatment groups assuming a medium effect size (Cohen’s d of 0.5) and a Type I error probability (a) of 0.05. Given that the Mann-Whitney U test is a non-parametric counterpart to the t-test and that the asymptotic relative efficiency of the U test relative to the t-test is 0.95 with normal distributions, we estimated the achieved power based on the power of a two-sample t-test, which is 0.93. We have clarified this in the introduction section and in the method section (p6 and p38):

      “The current study is part of the NAVSaH trial (NCT04557618) and focuses on the trial’s secondary outcomes, including heart rate, QT interval, HRV, and blood pressure.30 This interim analysis aims to evaluate the cardiovascular safety of the taVNS protocol and to provide insights that will inform the application of taVNS in SAH patients. The primary outcomes of this trial, including change in the inflammatory cytokine TNF-α and rate of radiographic vasospasm, are available as a pre-print and currently under review.24”

      “In this study, we reported the statistical power achieved for tests that yielded non-significant results. The achieved power is calculated based on a two-sample t-test assuming a medium effect size (Cohen’s d of 0.5) and a Type I error probability (a) of 0.05.”

      If the study was designed to show a cardiovascular effect, I am surprised that N=10 per group was considered to be sufficiently powered given the extensive reports in the literature on how HRV measures (except when pathologically low) vary within individuals. Moreover, HRV measures are especially susceptible to noise, artifacts, and outliers.

      If the study was designed to show a lack of cardiovascular effect (as the conclusions and introduction seem to suggest), then a several-fold larger sample size is warranted.

      The primary goal of this study is to assess the cardiovascular safety of the current taVNS protocol in SAH patients (please refer to comments 2.1 and 3.8 and our responses). More specifically, we want to assess whether the current taVNS protocol is associated with bradycardia or QT prolongation. The data in this study included ECG signals and vital signals from 24 subjects recruited between 2021 and 2024. The total number of days in the ICU is 271 days, which corresponds to 542 taVNS/sham treatment sessions. These data allow us to detect significant cardiovascular effects of acute taVNS with high power. For example, the comparison of heart rate from pre- to post-treatment sessions between treatment groups had power > 99% (N1 = 188, N2 = 199, assuming 0.05 type I error probability, medium effect size two sample t-test).

      To safely conclude that there is no significant cardiovascular effect of repetitive taVNS on any given day following SAH, we would need to perform statistical tests between treatment groups on Day 1, Day 2, and Day N. In this context, 64 subjects per treatment group are required to achieve 80% power assuming medium effect size and 0.05 type I error probability (two-sample t-test). We have acknowledged this limitation in the Discussion section. Thank you for raising this concern!

      The results reported in this study treat each day as an independent sample for several reasons. First, heart rate and HRV metrics exhibited great daily variations (Figure in comment 3.7, for example). Their value on one day was not predictive of the metrics on another day, which could be due to medications, interventions, or individualized SAH recovery process during the patient’s stay in the ICU. Second, SAH patients in the ICU often experience rapid/daily changes in clinical status, including fluctuations in intracranial pressure, blood pressure, neurological status, and other vital signs. Also, the recovery process from SAH is highly individualized, with different patients exhibiting distinct trajectories of recovery or complications. Day-to-day cardiovascular function changes varied as the patient recovered or encountered setbacks. Moreover, we verified ECG signal quality, corrected outliers and artifacts in ECG processing, and employed a state-of-the-art QRS delineation method (Please refer to comment 3.5). All these ensure the accuracy of our reported results.

      The revised Discussion section now reads (31): ” Our study considers each day as an independent sample for the following considerations: 1. heart rate and HRV metrics exhibited great daily variations. Their value on one day was not predictive of the metrics on another day, which could be due to medications, interventions, or individualized SAH recovery process during the patient’s stay in the ICU. 2. SAH patients in the ICU often experience daily changes in clinical status, including fluctuations in intracranial pressure, blood pressure, neurological status, and other vital signs. 3. Day-to-day cardiovascular function changes varied as the patient recovered or encountered setbacks. To conclusively establish that there is no significant cardiovascular effect of repetitive taVNS on any given day following SAH, we would need to perform statistical tests between treatment groups for each day. In this context, 64 subjects per treatment group are required to achieve 80% power assuming medium effect size and 0.05 type I error probability (two-sample t-test).”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      (1) It is a nice study but lacks some functional data required to determine how useful these alleles will be in practice, especially in comparison with the figure line that stimulated their creation.

      We are grateful for this comment. For the usefulness of these alleles, figure 3 shows that specific and efficient genetic manipulation of one cell subpopulation can be achieved by mating across the DreER mouse strain to the rox-Cre mouse strain. In addition, figure 6 shows that R26-loxCre-tdT can effectively ensure Cre-loxP recombination on some gene alleles and for genetic manipulation. The expression of the tdT protein is aligned with the expression of the Cre protein (Alb roxCre-tdT and R26-loxCre-tdT, figure 2 and figure 5), which ensures the accuracy of the tracing experiments. We believe more functional data can be shown in future articles that use mice lines mentioned in this manuscript.

      (2) The data in Figure 5 show strong activity at the Confetti locus, but the design of the newly reported R26-loxCre line lacks a WPRE sequence that was included in the iSure-Cre line to drive very robust protein expression.

      Thank you for bringing up this point in the manuscript. In the R26-loxCre-tdT mice knock-in strategy, the WPRE sequence is added behind the loxCre-P2A-tdT sequence, as shown in Supplementary Figure 9.

      (3) The most valuable experiment for such a new tool would be a head-to-head comparison with iSure (or the latest iSure version from the Benedito lab) using the same CreER and target foxed allele. At the very least a comparison of Cre protein expression between the two lines using identical CreER activators is needed.

      Thank you for your valuable and insightful comment. The comparison results of R26-loxCre-tdT with iSuRe-Cre using Alb-CreER and targeting R26-Confetti can be found in Supplementary Figure 7 C-E, according to the reviewer’s suggestion.

      (4) Why did the authors not use the same driver to compare mCre 1, 4, 7, and 10? The study in Figure 2 uses Alb-roxCre for 1 and 7 and Cdh5-roxCre for 4 and 10, with clearly different levels of activity driven by the two alleles in vivo. Thus whether mCre1 is really better than mCre4 or 10 is not clear.

      Thank you for raising this concern. After screening out four robust versions of mCre, we generated these four roxCre knock-in mice. It is unpredictable for us which is the most robust mCre in vivo. It might be one or two mCre versions that work efficiently. For example, if Alb-mCre1 was competitive with Cdh5-mCre10, we can use them for targeting genes in different cell types, broadening the potential utility of these mice.

      (5) Technical details are lacking. The authors provide little specific information regarding the precise way that the new alleles were generated, i.e. exactly what nucleotide sites were used and what the sequence of the introduced transgenes is. Such valuable information must be gleaned from schematic diagrams that are insufficient to fully explain the approach.

      We appreciate your thoughtful suggestions. The schematic figures, along with the nucleotide sequences for the generation of mice, can be found in the revised Supplementary Figure 9.

      Reviewer #2 (Public Review):

      (1) The scenario where the lines would demonstrate their full potential compared to existing models has not been tested.

      Thank you for your thoughtful and constructive comment. The comparative analysis of R26-loxCre-tdT with iSuRe-Cre, employing Alb-CreER to target R26-Confetti, is provided in Supplementary Figure 7 C-E.

      (2) The challenge lies in performing such experiments, as low doses of tamoxifen needed for inducing mosaic gene deletion may not be sufficient to efficiently recombine multiple alleles in individual cells while at the same time accurately reporting gene deletion. Therefore, a demonstration of the efficient deletion of multiple floxed alleles in a mosaic fashion would be a valuable addition.

      Thank you for your constructive comments. Mosaic analysis using sparse labeling and efficient gene deletion would be our future direction using roxCre and loxCre strategies.

      (3) When combined with the confetti line, the reporter cassette will continue flipping, potentially leading to misleading lineage tracing results.

      Thank you for your professional comments. Indeed, the confetti used in this study can continue flipping, which would lead to potentially misleading lineage tracing results. Our use of R26-Confetti is to demonstrate the robustness of mCre for recombination. Some multiple-color mice lines that don’t flip have been published, for example, R26-Confetti2(10.1038/s41588-019-0346-6) and Rainbow (10.1161/CIRCULATIONAHA.120.045750). These reporters could be used for tracing Cre-expressing cells, without concerns of flipping of reporter cassettes.

      (4) Constitutive expression of Cre is also associated with toxicity, as discussed by the authors in the introduction.

      Thank you for your professional comments. The toxicity of constitutive expression of Cre and the toxicity associated with tamoxifen treatment in CreER mice line (10.1038/s44161-022-00125-6) are known to the field. This study can’t solve the toxicity of the constitutive expression of Cre in this work. Many mouse lines with constitutive Cre driven by different promoters are present across various fields, representing similar toxicity. To solve this issue, it would be possible to construct a new strategy that enables the removal of Cre after its expression.

      Reviewer #3 (Public Review):

      (1) Although leakiness is rather minor according to the original publication and the senior author of the study wrote in a review a few years ago that there is no leakiness(https://doi.org/10.1016/j.jbc.2021.100509).

      Thank you so much for your careful check. In this review (https://doi.org/10.1016/j.jbc.2021.100509), the writer’s comments on iSuRe-Cre are on the reader's side, and all summary words are based on the original published paper (10.1038/s41467-019-10239-4). Currently, we have tested iSuRe-Cre in our hands. We did detect some leakiness in the heart and muscle, but hardly in other tissues as shown in Author response image 1.

      Author response image 1.

      Leakiness in Alb CreER;iSuRe-Cre mouse line Pictures are representative results for 5 mice. Scale bars, white 100 µm.

      (2) I would have preferred to see a study, which uses the wonderful new tools to address a major biological question, rather than a primarily technical report, which describes the ongoing efforts to further improve Cre and Dre recombinase-mediated recombination.

      We gratefully appreciate your valuable comment. The roxCre and loxCre mice mentioned in this study provide more effective methods for inducible genetic manipulation in studying gene function. We hope that the application of our new genetic tools could help address some major biological questions in different biomedical fields in the future.

      (3) Very high levels of Cre expression may cause toxic effects as previously reported for the hearts of Myh6-Cre mice. Thus, it seems sensible to test for unspecific toxic effects, which may be done by bulk RNA-seq analysis, cell viability, and cell proliferation assays. It should also be analyzed whether the combination of R26-roxCre-tdT with the Tnni3-Dre allele causes cardiac dysfunction, although such dysfunctions should be apparent from potential changes in gene expression.

      We are sorry that we mistakenly spelled R26-loxCre-tdT into R26-roxCre-tdT in our manuscript. We have not generated the R26-roxCre-tdT mouse line. We also thank the reviewer for concerns about the toxicity of high Cre expression. The toxicity of constitutive expression of Cre and the toxicity of tamoxifen treatment of CreER mice line (10.1038/s44161-022-00125-6) are known to the field. This study can’t solve the toxicity of the constitutive expression of Cre in this work. Many mouse lines with constitutive Cre driven by different promoters are present across various fields, representing similar toxicity. To solve this issue, it would be possible to construct a new strategy that enables the removal of Cre after its expression.

      (4) Is there any leakiness when the inducible DreER allele is introduced but no tamoxifen treatment is applied? This should be documented. The same also applies to loxCre mice.

      In this study, we come up with new mice tool lines, including Alb roxCre1-tdT, Cdh5 roxCre4-tdT, Alb roxCre7-GFP, Cdh5 roxCre10-GFP and R26-loxCre-tdT. As the data shown in supplementary figure 1, supplementary figure 2, and figure 4D, Alb roxCre1-tdT, Cdh5 roxCre4-tdT, Alb roxCre7-GFP, Cdh5 roxCre10-GFP and R26-loxCre-tdT are not leaky. Therefore, if there is any leakiness driven by the inducible DreER or CreER allele, the leakiness is derived from the DreER or CreER. Additional pertinent experimental data can be referenced in Figure S4C, Figure S7A-B, and Figure S8A.

      (5) It would be very helpful to include a dose-response curve for determining the minimum dosage required in Alb-CreER; R26-loxCre-tdT; Ctnnb1flox/flox mice for efficient recombination.

      Thank you for your suggestion. We value your feedback and have incorporated your suggestion to strengthen our study. Relevant experimental data can be referenced in Figure S8E-G.

      (6) In the liver panel of Figure 4F, tdT signals do not seem to colocalize with the VE-cad signals, which is odd. Is there any compelling explanation?

      The staining in Figure 4F in the revision is intended to deliver optimized and high-resolution images.

      (7) The authors claim that "virtually all tdT+ endothelial cells simultaneously expressed YFP/mCFP" (right panel of Figure 5D). Well, it seems that the abundance of tdT is much lower compared to YFP/mCFP. If the recombination of R26-Confetti was mainly triggered by R26-loxCre-tdT, the expression of tdT and YFP/mCFP should be comparable. This should be clarified.

      Thank you so much for your careful check. We checked these signals carefully and didn't find the “much lower” tdT signal. As the file-loading website has a file size limitation, the compressed image results in some signal unclear. We attached clear high-resolution images here. Author response image 2 shows how we split the tdT signal and compared it with YFP/mCFP.

      Author response image 2.

      (8) In several cases, the authors seem to have mixed up "R26-roxCre-tdT" with "R26-loxCre-tdT". There are errors in #251 and #256.Furthermore, in the passage from line #278 to #301. In the lines #297 and #300 it should probably read "Alb-CreER; R26-loxCretdT;Ctnnb1flox/flox"" rather than "Alb-CreER;R26-tdT2;Ctnnb1flox/flox".

      We are grateful for these careful observations. We have corrected these typos accordingly.

      Recommendations for the authors:

      Reviewer #1:

      (1) However, for it to be useful to investigators a more direct comparison with the Benedito iSure line (or the latest version) is required as that is the crux of the study.

      Thank you for emphasizing this point, which we have now addressed in the revised manuscript and in Figure S7D-G.

      (2) I would like to know how the authors will make these new lines available to outside investigators.

      Please contact the lead author by email to consult about the availability of new mouse lines developed in this study.

      (3) The discussion is overly long and fails to address potential weaknesses. Much of it reiterates what was already said in the results section.

      We are thankful for your critical evaluation, which has helped us improve our discussion.

      Reviewer #2:

      (1) Assessing the efficiency and accuracy of the lines in mosaic deletions of multiple alleles and reporting them in single cells after low-dose tamoxifen exposure would be highly beneficial to demonstrate the full potential of the models.

      We appreciate your careful consideration of this issue. Our future endeavors will focus on mosaic analysis utilizing sparse labeling and efficient gene deletion, employing both roxCre and loxCre strategies.

      (2) Performing FACS analysis to confirm that all targeted (Cre reporter-positive) cells are also tdT-positive would provide more precise data and avoid vague statements like 'virtually all' or 'almost complete' in the results section:

      Line 166: Although mCre efficiently labeled virtually all targeted cells (Figure S3A-E)…

      Line 293: ... and not a single tdT+ hepatocyte 293 expressed Cyp2e1 (Figure 6D)... However, the authors do not provide any quantification. FACS would be ideal here.

      Line 244: ...expression of beta-catenin and GS almost disappeared in the 4W mutant sample... The resolution in the provided PDF is not adequate for assessment.

      Line 296: ... revealed almost complete deletion of Ctnnb1 in the Alb-CreER;R26-tdT2;Ctnnb1flox/flox mice...

      Thank you for suggesting these improvements, which have strengthened the robustness of our conclusions. In the revised version, we have incorporated FACS results that correspond to related sections. Additionally, a quantification statement has been included in the statistical analysis section. We appreciate your meticulous review and comments, which have significantly improved the clarity of our manuscript.

      (3) In the beginning of the results section, it is not clear which results are from this study and which are known background information (like Figure 1A). For example, it is not clear if Figure 1C presents data from R26-iSuRe-Cre. Please revise the text to more clearly present the experimental details and new findings.

      Thank you for this observation. Figure 1C belongs to this study, and the revised version has been modified to the related statement for improved clarity.

      (4) Experimental details regarding the genetic constructs and genotyping of the new knock-in lines are missing. Are R26 constructs driven by the endogenous R26 promoter or were additional enhancers used?

      Thank you for emphasizing this point. The schematic figures and nucleotide sequences for the generation of mice can be found in the revised Supplementary Figure 9, which can help to address this issue.

      (5) The method used to quantify mCre activity in terms of reporter+ target cells is not specified. From images or by FACS?

      Additionally, if images were used for quantification, it would be important to provide details on the number of images analyzed, the number of cells counted per image, and how individual cells were identified.

      Thank you for your comment. We have included the quantification statement in the statistical analysis section. Analyzing R26-Confetti+ target cells using FACS is challenging due to the limitations of the sorting instrument. Consequently, we quantified the related data by images. Each dot on the chart represents one sample, and the quantification for each mouse was conducted by averaging the data from five 10x fields taken from different sections.

      (6) Line 160: These data demonstrate that roxCre was functionally efficient yet non-leaky. Functional efficiency in vivo was not shown in the preceding experiments.

      Functional efficiency in vivo can be referred to in Figures S1-S2 and S4C.

      (7) It would be useful to provide a reference for easy vs low-efficiency recombination of different reporter alleles (lines 56-58).

      We are grateful for this comment, as it has allowed us to improve the clarity of our explanation. Consequently, we have made the necessary modifications.

      (8) Discussion on the potential drawbacks and limitations of the lines would be useful.

      We are thankful for your evaluation, which has significantly contributed to the enhancement of our discourse.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      TMC7 knockout mice were generated by the authors and the phenotype was analyzed. They found that Tmc7 is localized to Golgi and is needed for acrosome biogenesis.

      Strengths:

      The phenotype of infertility is clear, and the results of TMC7 localization and the failed acrosome formation are highly reliable. In this respect, they made a significant discovery regarding spermatogenesis.

      Weaknesses:

      There are also some concerns, which are mainly related to the molecular function of TMC7 and Figure 5.

      (1) It is understandable that TMC7 exhibits some channel activity in the Golgi and somehow affects luminal pH or Ca2+, leading to the failure of acrosome formation. On the other hand, since they are conducting the pH and calcium imaging from the cytoplasm, I do not think that the effect of TMC7 channel function in Golgi is detectable with their methods.

      We agree with the reviewer that there are no direct evidences showing the effect of TMC7 channel function in Golgi. We have changed the description in the revised manuscript.

      (2) Rather, it is more likely that they are detecting apoptotic cells that have no longer normal ion homeostasis.

      We thank the reviewer for raising this concern. We apologize for not labeling the postnatal stage in original Figure 5. We measured intracellular Ca2+, pH and ROS in PD30 testes (revised Fig. S6a-c), no apoptotic cells were observed at this stage (revised Fig. S6e, f). Apoptotic cells were found in the seminiferous tubules and cauda epididymis of 9-week-old Tmc7–/– mice (revised Fig. 5e-f). We have included TUNEL data in testis of PD21, PD30 and 9-week-old mice (revised Fig. 5e, f and Fig. S6e, f). In accordance with our findings, Tmc1 mutation has also been shown to result in reduced Ca2+ permeability, thus triggering hair cell apoptosis (Fettiplace, R, PNAS. 2022) [1].

      (3) Another concern is that n is only 3 for these imaging experiments.

      As suggested by the reviewer, more replicates were included in imaging experiments.

      Reviewer #2 (Public Review):

      Summary:

      This study presents a significant finding that enhances our understanding of spermatogenesis. TMC7 belongs to a family of transmembrane channel-like proteins (TMC1-8), primarily known for their role in the ear. Mutations to TMC1/2 are linked to deafness in humans and mice and were originally characterized as auditory mechanosensitive ion channels. However, the function of the other TMC family members remains poorly characterized. In this study, the authors begin to elucidate the function of TMC7 in acrosome biogenesis during spermatogenesis. Through analysis of transcriptomics datasets, they identify TMC7 as a transmembrane channel-like protein with elevated transcript levels in round spermatids in both mouse and human testis. They then generate Tmc7-/- mice and find that male mice exhibit smaller testes and complete infertility. Examination of different developmental stages reveals spermatogenesis defects, including reduced sperm count, elongated spermatids, and large vacuoles. Additionally, abnormal acrosome morphology is observed beginning at the early-stage Golgi phase, indicating TMC7's involvement in proacrosomal vesicle trafficking and fusion. They observed localization of TMC7 in the cis-Golgi and suggest that its presence is required for maintaining Golgi integrity, with Tmc7-/- leading to reduced intracellular Ca2+, elevated pH, and increased ROS levels, likely resulting in spermatid apoptosis. Overall, the work delineates a new function of TMC7 in spermatogenesis and the authors suggest that its ion channel activity is likely important for Golgi homeostasis. This work is of significant interest to the community and is of high quality.

      Strengths:

      The biggest strength of the paper is the phenotypic characterization of the TMC7-/- mouse model, which has clear acrosome biogenesis/spermatogenesis defects. This is the main claim of the paper and it is supported by the data that are presented.

      Weaknesses:

      The claim is that TMC7 functions as an ion channel. It is reasonable to assume this given what has been previously published on the more well-characterized TMCs (TMC1/2), but the data supporting this is preliminary here, and more needs to be done to solidify this hypothesis. The authors are careful in their interpretation and present this merely as a hypothesis supporting this idea.

      We appreciate the insightful comment. It is indeed a limitation of our study that we lack strong evidences to support that TMC7 functions as an ion channel. We have planned to conduct cellular electrophysiology in GC-1 cells heterologous expression of TMC7. However, TMC7 was trapped in the endoplasmic reticulum like TMC1 and TMC2 (Yu X, PNAS. 2020)[2], and failed to localize to the Golgi. According to the reviewer’s suggestion, we have made careful and more detailed interpretation the molecular function of TMC7 in the revised manuscript.

      Reviewer #3 (Public Review):

      Summary:

      In this study, Wang et al. have demonstrated that TMC7, a testis-enriched multipass transmembrane protein, is essential for male reproduction in mice. Tmc7 KO male mice are sterile due to reduced sperm count and abnormal sperm morphology. TMC7 co-localizes with GM130, a cis-Golgi marker, in round spermatids. The absence of TMC7 results in reduced levels of Golgi proteins, elevated abundance of ER stress markers, as well as changes of Ca2+ and pH levels in the KO testis. However, further confirmation is required because the analyses were performed with whole testis samples in spite of the differences in the germ cell composition in WT and KO testis. In addition, the causal relationships between the reported anomalies await thorough interrogation.

      Strengths:

      The microscopic images are of great quality, all figures are properly arranged, and the entire manuscript is very easy to follow.

      Weaknesses:

      (1) Tmc7 KO male mice show multiple anomalies in sperm production and morphogenesis, such as reduced sperm count, abnormal sperm head, and deformed midpiece. Thus, it is confusing that the authors focused solely on impaired acrosome biogenesis.

      We are grateful to your comments and suggestions. We agree and have added these defects in spermiogenesis of Tmc7–/– mice in the abstract and discussion sections of revised manuscript.

      (2) Further investigations are warranted to determine whether the abnormalities reported in this manuscript (e.g., changes in protein, Ca2+, and pH levels) are directly associated with the molecular function of TMC7 or are the byproducts of partially arrested spermiogenesis. Please find additional comments in "Recommendations for the authors".

      Thank you for raising this concern. Per your comments, we have included data of intracellular Ca2+, pH and ROS in PD21 testes. The intracellular homeostasis was impaired as early as PD21, indicating TMC7 depletion impairs cellular homeostasis which in turn results in arrested spermiogenesis.

      Recommendations for the authors:

      Reviewing Editor (Recommendations For The Authors):

      As noted by all three reviewers, current flow cytometry data does not necessarily support the 'ion channel' hypothesis, thus the phenotypic analysis is compelling but the molecular mechanism of how TMC7 facilitates acrosome biogenesis remains incomplete. It is highly recommended for the authors to at least discuss or test alternative hypotheses (as reviewer #2 suggested) such as the possibility of acting as 'lipid scramblase'. Also, the authors need to provide further explanation for other morphological defects if TMC7 is truly a functional ion channel in Golgi (and thus later at acrosome), which is also related to the key question of whether TMC7 is a functional ion channel.

      We thank the reviewing editor for the comments and suggestions. We agree that our study lack strong evidences to support that TMC7 functions as an ion channel. We have discussed the possibility of TMC7 acting as 'lipid scramblase' as suggested. We have also included data of intracellular Ca2+, pH and ROS in PD21, PD30 testes.

      Indeed, Tmc7–/– mice exhibits other defects including abnormal head morphology and disorganized mitochondrial sheaths. As TMC7 is localized to the cis-Golgi apparatus and is required for maintaining Golgi integrity. Previous studies on Golgi localized proteins including GOPC (Yao R, PNAS. 2002)[2], HRB (Kang-Decker N. Science. 2001)[3] and PICK1(Xiao N, JCI. 2009)[4] exhibit similar defects in spermiogenesis with Tmc7–/– mice. It is possible that defects morphologies in Tmc7–/– mice might be due to impaired function of Golgi.

      Reviewer #1 (Recommendations For The Authors):

      (1) The authors should provide more details about the imaging experiments using FACS. Since they only describe catalog numbers (Beyotime, S1056, S1006, S0033S) for imaging reagents, it is not immediately clear what reagents they actually used. Since they used Fluo3, BCECF, and DCFH, it would be better to mention their names.

      Thanks. We have provided more detailed antibody information as suggested.

      (2) I am also concerned that in the FACS there is no information at all about laser wavelength and filter properties. This is especially important for BCECF because the wavelength spectrum changes with pH. Also, if there are any positive controls for these imaging reagents, such as ionophores, it would be more convincing to include them.

      Thank you for your comment. Excitation wavelength is 488nm for detecting Ca2+, pH and ROS in FACS. BCECF is the most popular pH probe to monitor cellular pH and the reagent from Beyotime (S1006) has been used by other studies (Chen S, Blood. 2016)[5], (Liu H, Cell Death Dis. 2022)[6]. To make the results more reliable, we have repeated these experiments in PD21 testes (revised Figure 5a-c). No positive controls for these reagents were used in our experiments.

      (3) As noted above, it is better to avoid directly linking the cell's abnormal ion homeostasis to TMC7 ion channel function in the text. The discussion should be changed to emphasize that the TMC7-deficient cells are apoptotic and that these physiological phenomena are occurring as a side effect of this apoptosis.

      Thank you for raising this concern. We agree with the reviewer that there are no direct evidences showing the effect of TMC7 channel function in Golgi and we have changed the description in the revised manuscript.

      We performed new experiment to measure apoptosis and intracellular Ca2+, pH and ROS in PD21 testes. No apoptotic cells were observed at this stage. However, impaired cellular homeostasis was still found in testis of PD21 Tmc7-/- mice. These data suggest that TMC7 depletion impairs cellular homeostasis and hence induces spermatid apoptosis.

      (4) While I understand that it appears to be difficult to experimentally verify the ion channel function of TMC7, it may be supportive to compare its amino acid sequence and/or 3D predicted structure with that of TMC1/2. Including a supplemental figure for this purpose would emphasize the possibility that TMC7 functions as an ion channel.

      We thank the reviewer for making this great suggestion. We compared the amino acid sequence and structure of TMC1, TMC2 with TMC7 respectively. TMC1 had 81% sequence similarity with TMC7 and the RMSD (Root Mean Square Deviation) was 3.079. TMC2 had 82% sequence similarity with TMC7, the RMSD was 2.176. These data suggest that TMC7 has similar amino acid sequence and predicted structure with TMC1/2 and might functions as an ion channel. We have included the predicted structures in revised Fig. S7.

      Author response image 1.

      Reviewer #2 (Recommendations For The Authors):

      I do not have any experimental comments or concerns to address, but I do ask that the authors consider an alternative hypothesis. Based on prior data demonstrating that TMC1 is a mechanosensitive ion channel, the authors reasonably assume that TMC7 may also function as an ion channel. Although the authors observe alterations in cytosolic Ca2+ and pH upon loss of TMC7 by flow cytometry, which begins to support this hypothesis, these data do not directly demonstrate ion channel activity.

      I was wondering if the authors had considered whether TMC7 could also function as a lipid scramblase. TMC1 has also been proposed to function as a Ca2+-inhibited scramblase, where knockout of TMC1 leads to a loss of phosphatidylserine (PS) exposure and membrane blebbing at the apical region of hair cells (Ballesteros, A. and Swartz, K., Science Advances, 2022). Furthermore, TMC proteins are structurally related to the Anoctamin/TMEM16 family of chloride channels and lipid scramblases, where TMEM16A-B are bona fide Ca2+-activated chloride channels, and TMEM16C-H are characterized as Ca2+-dependent scramblases. Based on their structural similarity and the observation that TMC1 may also exhibit lipid scrambling properties based on the PS exposure, I wonder if the authors may have data that support a TMC7 scramblase hypothesis. I was intrigued by this idea, especially given the authors' observations of large vacuoles in the seminiferous tubules and cauda epididymis and the vesicle accumulation phenotype in their TEM data. Incorporating this hypothesis into the discussion section, at minimum, could provide a valuable perspective, and this line of thought may lead to interesting data interpretation throughout the paper.

      We thank the reviewer for the valuable suggestion. We have discussed the possibility of TMC7 acting as 'lipid scramblase' as suggested.

      Reviewer #3 (Recommendations For The Authors):

      (1) Gene symbols should be italicized, and protein symbols should be capitalized.

      Thanks. We have made changes to the manuscript as recommended.

      (2) Tmc7 KO males show reduced sperm count, which alters the germ cell composition in the testis (Figure 2g). Thus, it is inappropriate to compare protein levels using whole testis lysates (Figure 3e, 4h, 5d, 5f). Instead, the same immunoblotting analyses could be done with purified round spermatids or 3-wk-old testis. Likewise, the significance of the intracellular Ca2+ and pH measurements is potentially diminished by the differences in the germ cell composition in WT and KO mice.

      We appreciate this constructive suggestion. We agree with the reviewer that whole testis lysates diminished the differences between WT and _Tmc7-/-_mice. However, we are unable purify round spermatids due to the lack of specific markers.

      (3) Figures 2i, 2j: How sperm motility was measured should be specified in the Methods.

      We thank you for your significant reminding and have added sperm motility assessment in Methods section.

      (4) Figure 4g: It does not make sense to compare the fluorescence intensity of these proteins without making sure that the seminiferous tubules are in the same stage. As shown in Figures S5a and S5b, TMC7 exhibits varied abundance in spermatids at different steps.

      We thank the reviewer for the insightful comment. We have replaced images in the same stage seminiferous tubules and compared the fluorescence intensity of new images as suggested.

      (5) Figure 4h: How were the band intensities measured? The third band from the left is visually stronger than the first one, but it does not seem to be so according to the column graph. The reviewer measured the intensity of GRASP65 bands relative to alpha-tubulin by ImageJ and obtained relative intensities of 0.35, 0.87, 0.6, and 0.08 for the bands from left to right. Additional replicates of the western blots should be included in the supplementary figures.

      Thank you for this insightful comment. The density and size of the blots were quantified by Image J. We have checked the first band from the left of GRASP65 and it seems that the protein was not fully transferred onto the PVDF membrane. We have performed new experiments and replaced the original bands (Revised Fig. 4h). Additional replicates of the western blots have been included in revised Fig. S8.

      (6) Figures 5a, 5b: Based on the observation of abnormal intracellular Ca2+ and pH levels in the KO germ cells, the authors concluded that TMC7 maintains the homeostasis of Golgi pH and ion (Lines 223-224, 263-264). However, intracellular Ca2+ and pH levels do not directly reflect those in the Golgi apparatus.

      We thank the reviewer for this important comment. We agree and have changed “Golgi” to “intracellular” as suggested.

      (7) Figure 5c: ROS is produced during apoptosis. Thus, it is not appropriate to conclude that the increased ROS levels in Tmc7 KO germ cells lead to apoptosis.

      According to the reviewer’s comment, we measured ROS and apoptosis in testis of PD21 and PD30 mice. ROS levels were increased, but no apoptotic cells were observed in testis of PD21 and PD30 Tmc7–/– mice. Apoptotic cells were observed in testis of 9-week-old Tmc7–/– mice (Revised Fig. 5e-f). These data suggest that TMC7 depletion results in the accumulation of ROS, thereby leads to apoptosis.

      (1) Fettiplace, R., D.N. Furness, and M. Beurg, The conductance and organization of the TMC1-containing mechanotransducer channel complex in auditory hair cells. Proc Natl Acad Sci U S A, 2022. 119(41): p. e2210849119.

      (2) Yu, X., et al., Deafness mutation D572N of TMC1 destabilizes TMC1 expression by disrupting LHFPL5 binding. Proc Natl Acad Sci U S A, 2020. 117(47): p. 29894-29903.

      (3) Kang-Decker, N., et al., Lack of acrosome formation in Hrb-deficient mice. Science, 2001. 294(5546): p. 1531-3.

      (4) Xiao, N., et al., PICK1 deficiency causes male infertility in mice by disrupting acrosome formation. J Clin Invest, 2009. 119(4): p. 802-12.

      (5) Chen, S., et al., Sympathetic stimulation facilitates thrombopoiesis by promoting megakaryocyte adhesion, migration, and proplatelet formation. Blood, 2016. 127(8): p. 1024-35.

      (6) Liu, H., et al., PRMT5 critically mediates TMAO-induced inflammatory response in vascular smooth muscle cells. Cell Death Dis, 2022. 13(4): p. 299.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This manuscript presents useful findings on several phage from deep sea isolates of Lentisphaerae strains WC36 and zth2 that further our understanding of deep sea microbial life. The manuscript's primary claim is that phage isolates augment polysaccharide use in Pseudomonas bacteria via auxiliary metabolic genes (AMGs). However, the strength of the evidence is incomplete and does not support the primary claims. Namely, there are not data presented to rule out phage contamination in the polysaccharide stock solution, AMGs are potentially misidentified, and there is missing evidence of successful infection.

      Thanks for the Editor’s and Reviewers’ positive and constructive comments, which help us improve the quality of our manuscript entitled “Deep-sea bacteriophages facilitate host utilization of polysaccharides” (paper#eLife-RP-RA-2023-92345). The comments are valuable, and we have studied the comments carefully and have made corresponding revisions according to the suggestions. We removed some uncertain results and strengthened other parts of the manuscript, which evidently improved the accuracy and impact of the revised version. Revised portions are marked in blue in the modified manuscript. Please find the detailed responses as following.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary: This manuscript describes the identification and isolation of several phage from deep sea isolates of Lentisphaerae strains WC36 and zth2. The authors observe induction of several putative chronic phages with the introduction of additional polysaccharides to the media. The authors suggest that two of the recovered phage genomes encode AMGs associated with polysaccharide use. The authors also suggest that adding the purified phage to cultures of Pseudomonas stutzeri 273 increased the growth of this bacterium due to augmented polysaccharide use genes from the phage. While the findings were of interest and relevance to the field, it is my opinion that several of the analysis fall short of supporting the key assertions presented.

      Thanks for your comments. We removed some uncertain results and strengthened other parts of the manuscript, which evidently improved the accuracy and impact of the revised version. Please find the detailed responses as following.

      Strengths: Interesting isolate of deep sea Lentisphaerae strains which will undoubtedly further our understanding of deep sea microbial life.

      Thanks for your positive comments.  

      Weaknesses:

      (1) Many of the findings are consistent with a phage contamination in the polysaccharide stock solution. 

      Thanks for your comments. We are very sure that the phages are specifically derived from the Lentisphaerae strain WC36 but not the polysaccharide stock solution. The reasons are as following: (1) the polysaccharide stock solution was strictly sterilized to remove any phage contamination; (2) we have performed multiple TEM checks of the rich medium supplemented with 10 g/L laminarin alone (Supplementary Fig. 1A) or in 10 g/L starch alone (Supplementary Fig. 1B), and there were not any phage-like structures, which confirmed that the polysaccharides (laminarin/starch) we used were not contaminated with any phage-like structures; in addition, we also observed the polysaccharides (laminarin/starch) directly by TEM and did not find any phage-like structures (Supplementary Fig. 2); (3) the polysaccharide (starch) alone could not promote the growth of Pseudomonas stutzeri 273, however, the supplement of starch together with the extracted Phages-WC36 could effectively facilitate the growth of Pseudomonas stutzeri 273 (Author response image 1). The above results clearly indicated the phages were derived from the Lentisphaerae strain WC36 but not the polysaccharide stock solution. 

      Author response image 1.

      Growth curve and status of Pseudomonas stutzeri 273 cultivated in basal medium, basal medium supplemented with 20 μl/mL Phages-WC36, basal medium supplemented with 5 g/L starch, basal medium supplemented with 5 g/L starch and 20 μl/mL Phages-WC36. 

       

      (2) The genes presented as AMGs are largely well known and studied phage genes which play a role in infection cycles.

      Thanks for your comments. Indeed, these AMGs may be only common in virulent phages, while have never been reported in chronic phages. In virulent phages, these genes typically act as lysozymes, facilitating the release of virions from the host cell upon lysis, or injection of viral DNA upon infection. However, the chronic phages do not lyse the host. Therefore, the persistence of these genes in chronic phages may be due to their ability to assist the host in metabolizing polysaccharides. Finally, according to your suggestions, we have weakened the role of AMGs and added “potential” in front of it. The detailed information is shown below.

      (3) The evidence that the isolated phage can infect Pseudomonas stutzeri 273 is lacking, putting into question the dependent results.

      Thanks for your comments. Actually, we selected many marine strains (Pseudomonadota, Planctomycetes, Verrucomicrobia, Fusobacteria, and Tenericutes isolates) to investigate whether Phages-WC36 could assist them in degradation and utilization of polysaccharides, and found that Phages-WC36 could only promote the growth of strain 273. It is reported that filamentous phages could recognize and bind to the host pili, which causes the pili to shrink and brings the filamentous phages closer to and possibly through the outer membrane of host cells. The possible mechanism of other chronic phages release without breaking the host might be that it was enclosed in lipid membrane and released from the host cells by a nonlytic manner. Thus, these chronic phages may have a wider host range. However, we were unable to further reveal the infection mechanism due to some techniques absence. Therefore, according to your suggestions, we have deleted this section in the revised manuscript.

      Reviewer #1 (Recommendations For The Authors):

      I have previously reviewed this manuscript as a submission to another journal in 2022. My recommendations here mirror those of my prior suggestions, now with further added details.

      Thanks for your great efforts for reviewing our manuscript and valuable suggestions for last and this versions.

      Specific comments:

      Comment 1: Line 32. Rephrase to "polysaccharides cause the induction of multiple temperate phages infecting two strains of Lentisphaerae (WC36 and zth2) from the deep sea."

      Thanks for your positive suggestion. We have modified this description as “Here, we found for the first time that polysaccharides induced the production of multiple temperate phages infecting two deep-sea Lentisphaerae strains (WC36 and zth2).” in the revised manuscript (Lines 31-33). 

      Comment 2: Line 66. "Chronic" infections are not "lysogenic" as described here, suggesting the former is a subcategory of the latter. If you are going to introduce lifecycles you need a brief sentence distinguishing "chronic" from "lysogenic"

      Thanks for your positive suggestion. We added this sentence as “Currently, more and more attention has been paid to chronic life cycles where bacterial growth continues despite phage reproduction (Hoffmann Berling and Maze, 1964), which was different from the lysogenic life cycle that could possibly lyse the host under some specific conditions.” in the revised manuscript (Lines 66-69).

      Comment 3: Line 72. Please avoid generalized statements like "a hand-full" (or "plenty" line 85). Try to be at least somewhat quantitative regarding how many chronic phages are known. This is a fairly common strategy among archaeal viruses. 

      Thanks for your suggestion. Given that some filamentous phages also have a chronic life cycle that is not explicitly reported, we cannot accurately estimate their numbers. According to your suggestions, we have modified these descriptions as “however, to our best knowledge, only few phages have been described for prokaryotes in the pure isolates up to date (Roux et al., 2019; Alarcón-Schumacher et al., 2022; Liu et al., 2022).” in the revised manuscript (Lines 73-75). In addition, the number of chronic phages in the biosphere cannot be accurately estimated, according to the latest report (Chevallereau et al., 2022), which showed that “a large fraction of phages in the biosphere are produced through chronic life cycles”. Therefore, we have modified this description as “Therefore, a large percentage of phages in nature are proposed to replicate through chronic life cycles” in the revised manuscript (Lines 87-88). 

      Comment 4: Line 93. While Breitbart 2012 is a good paper to cite here, there have been several, much more advanced analysis of the oceans virome. https://doi.org/10.1016/j.cell.2019.03.040 is one example, but there are several others. A deeper literature review is required in this section.  

      Thanks for your valuable suggestions. We have added some literatures and modified this description as “A majority of these viruses are bacteriophages, which exist widely in oceans and affect the life activities of microbes (Breitbart, 2012; Roux et al., 2016; Gregory et al., 2019; Dominguez-Huerta et al., 2022).” in the revised manuscript (Lines 94-97). 

      References related to this response:

      Roux, S., Brum, J.R., Dutilh, B.E., Sunagawa, S., Duhaime, M.B., Loy, A., Poulos, B.T., Solonenko, N., Lara, E., Poulain, J., et al. (2016) Ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses. Nature 537:689-693. 

      Gregory, A.C., Zayed, A.A., Conceição-Neto, N., Temperton, B., Bolduc, B., Alberti, A., Ardyna, M., Arkhipova, K., Carmichael, M., Cruaud, C., et al. (2019) Marine DNA Viral Macro- and Microdiversity from Pole to Pole. Cell 177:1109-1123.e1114. 

      Dominguez-Huerta, G., Zayed, A.A., Wainaina, J.M., Guo, J., Tian, F., Pratama, A.A., Bolduc, B., Mohssen, M., Zablocki, O., Pelletier, E., et al. (2022) Diversity and ecological footprint of Global Ocean RNA viruses. Science 376:1202-1208.

      Comment 5: Line 137. I see the phage upregulation in Figure 1, however in the text and figure it would be good to also elaborate on what the background expression generally looks like. Perhaps a transcriptomic read normalization and recruitment to the genome with a display of the coverage map, highlighting the prophage would be helpful. Are the polysacharides directly influencing phage induction or is there some potential for another cascading effect?  

      Thanks for your comments. We have elaborated all expressions of phage-associated genes under different conditions in the Supplementary Table 1, which showed that the background expressions were very low. The numbers in Fig. 1C were the gene expressions (by taking log2 values) of strain WC36 cultured in rich medium supplemented with 10 g/L laminarin compared with the rich medium alone.

      In addition, our RT-qPCR results (Fig. 1D) also confirmed that these genes encoding phage-associated proteins were significantly upregulated when 10 g/L laminarin was added in the rich medium. According to your suggestions, we have modified this description as “In addition to the up-regulation of genes related to glycan transport and degradation, when 10 g/L laminarin was added in the rich medium, the most upregulated genes were phage-associated (e. g. phage integrase, phage portal protein) (Fig. 1C and Supplementary Table 1), which were expressed at the background level in the rich medium alone.” in the revised manuscript (Lines 136-140). Based on the present results, we speculate that polysaccharides might directly induce phage production, which needs to be verified by a large number of experiments in the future.

      Comment 6: Line 179. We need some assurance that phage was not introduced by your laminarin or starch supplement. Perhaps a check on the TEM/sequencing check of supplement itself would be helpful? This may be what is meant on Line 188 "without culturing bacterial cells" however this is not clearly worded if that is the case. Additional note, further reading reinforces this as a key concern. Many of the subsequent results are consistent with a contaminated starch stock. 

      Thanks for your comments. We are very sure that the phages are specifically derived from the Lentisphaerae strain WC36 but not the polysaccharide stock solution. The reasons are as following: (1) we have performed multiple TEM checks of the rich medium supplemented with 10 g/L laminarin alone (Supplementary Fig. 1A) or in 10 g/L starch alone (Supplementary Fig. 1B), and there were not any phage-like structures, which confirmed that the polysaccharides (laminarin/starch) we used are not contaminated with any phage-like structures. In addition, we also observed the polysaccharides (laminarin/starch) directly by TEM and did not find any phage-like structures (Supplementary Fig. 2). According to your suggestions, we have modified this description as “We also tested and confirmed that there were not any phage-like structures in rich medium supplemented with 10 g/L laminarin alone (Supplementary Fig. 1A) or in 10 g/L starch alone (Supplementary Fig. 1B), ruling out the possibility of phage contamination from the polysaccharides (laminarin/ starch).” in the revised manuscript (Lines 158-162) and “Meanwhile, we also checked the polysaccharides (laminarin/ starch) in rich medium directly by TEM and did not find any phage-like structures (Supplementary Fig. 2).” in the revised manuscript (Lines 178-180). (2) the polysaccharide stock solution was strictly sterilized to remove any phage contamination. (3) the polysaccharide (starch) alone could not promote the growth of Pseudomonas stutzeri 273, however, the supplement of starch together with the extracted Phages-WC36 could effectively facilitate the growth of Pseudomonas stutzeri 273 (Response Figure 1). The above results clearly indicated the phage was derived from the Lentisphaerae strain WC36 but not the polysaccharide stock solution. 

      In addition, given that polysaccharide was a kind of critical energy source for most microorganisms, we sought to ask whether polysaccharide also induces the production of bacteriophages in other deep-sea bacteria. To this end, we cultured deep-sea representatives from other four other phyla (including Chloroflexi, Tenericutes, Proteobacteria, and Actinobacteria) in the medium supplemented with laminarin/starch, and checked the supernatant of cells suspension through TEM as described above. We could not find any phage-like structures in these cells suspension (Author reaponse image 2), which also confirmed that there was no phage contamination in the polysaccharides.

      Author response image 2.

      Growth curve and status of Pseudomonas stutzeri 273 cultivated in basal medium, basal medium supplemented with 20 μl/mL Phages-WC36, basal medium supplemented with 5 g/L starch, basal medium supplemented with 5 g/L starch and 20 μl/mL Phages-WC36.   

      Author response image 3.

      TEM observation of the supernatant of cells suspension of a Chloroflexi strain, a Tenericutes strain, a Proteobacteria strain and an Actinobacteria strain that cultivated in the rich medium supplemented with 10 g/L laminarin and 10 g/L starch. No phage-like particles could be observed.  

      Comment 7: Line 223. Correct generalized wording "long time". 

      Thanks for your comments. We have changed “after for a long time” to “after 30 days” in the revised manuscript (Line 197).

      Comment 8: Line 229. Please more explicitly describe what these numbers are (counts of virion like structures - filamentous and hexagonal respectively?), the units (per µL?), and how these were derived. The word "around" should be replaced with mean and standard deviation values for each count from replicates, without which these are not meaningful.

      Thanks for your comments. The average numbers per microliter (µL) of filamentous and hexagonal phages in each condition were respectively calculated by randomly choosing ten TEM images. According to your suggestions, we have modified this description as “Specifically, the average number per microliter of filamentous phages (9.7, 29 or 65.3) extracted from the supernatant of strain WC36 cultured in rich medium supplemented with 10 g/L laminarin for 5, 10 or 30 days was higher than that cultured in rich medium supplemented with 5 g/L laminarin (4.3, 13.7 or 35.3) (Fig. 3B). The average number per microliter of hexagonal phages (9, 30, 46.7) extracted from the supernatant of strain WC36 cultured in rich medium supplemented with 10 g/L laminarin for 5, 10 or 30 days was higher than that cultured in rich medium supplemented with 5 g/L laminarin (4, 11.3 or 17.7) (Fig. 3C).” in the revised manuscript (Lines 203-210).

      Comment 9: Line 242. This section should be included in the discussion of Figure 2 - around line 194.

      Thanks. According to your suggestion, we have moved this section to the discussion corresponding to Figure 2 (Lines 183-191).

      Comment 10: Figure 3. Stay consistent in the types of figures generated per strain. Figure 3A should be a growth curve.

      Thanks for your comments. Actually, figure 3A was a growth curve, the corresponding description “(A) Growth curve of strain WC36 cultivated in either rich medium alone or rich medium supplemented with 5 g/L or 10 g/L laminarin for 30 days.” was shown in the Figure 3A legend in this manuscript.

      Comment 11: Line 312. Move the discussion of AMGs to after the discussion of the phage genome identification.

      Thanks for your valuable comments. According to your suggestions, we have moved the discussion of AMGs to after the discussion of the phage genome identification.

      Comment 12: Line 312. It would be informative to sequence in-bulk each of your treatments as opposed to just sequencing the viral isolates (starch and no host included) to see what viruses can be identified in each. ABySS is also not a common assembler for viral analysis. Is there literature to support it as a sufficient tool in assembling viral genomes? What sequencing depths were obtained in your samples?

      Thanks for your comments. In previous studies, we did sequence the starch or laminarin alone (no host included) and did not detect any phage-related sequences. The introduction of ABySS software was shown in these literatures (Jackman SD, Vandervalk BP, Mohamadi H, Chu J, Yeo S, Hammond SA, Jahesh G, Khan H, Coombe L, Warren RL, Birol I. ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter. Genome Res. 2017 May;27(5):768-777; Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009 Jun;19(6):1117-23.), which were also used to assemble viral genomes in these literatures (Guo Y, Jiang T. First Report of Sugarcane Mosaic Virus Infecting Goose Grass in Shandong Province, China. Plant Dis. 2024 Mar 21. doi: 10.1094/PDIS-11-23-2514-PDN; Tang M, Chen Z, Grover CE, Wang Y, Li S, Liu G, Ma Z, Wendel JF, Hua J. Rapid evolutionary divergence of Gossypium barbadense and G. hirsutum mitochondrial genomes. BMC Genomics. 2015 Oct 12;16:770.). The sequencing depth of the phages of strain WC36 and zth2 were 350x and 365x, respectively.

      Comment 13: Line 323. Replace "eventually" with more detail about what was done to derive the genomes. Were these the only four sequences identified as viral?

      Thanks for your comments. We have used the ABySS software (http://www.bcgsc.ca/platform/bioinfo/software/abyss) to perform genome assembly with multiple-Kmer parameters. VIBRANT v1.2.1 (Kieft et al., 2020), DRAM-v (Shaffer et al., 2020), VirSorter v1.0.5 (with categories 1 (“pretty sure”) and 2 (“quite sure”)) (Roux et al., 2015) and VirFinder v1.1 (with statistically significant viral prediction: score > 0.9 and P-value < 0.05) (Ren et al., 2017) with default parameters were used to identify viral genomes from these assembly sequences by searching against the both cultured and non-cultured viral NCBI-RefSeq database (http://blast.ncbi.nlm.nih.gov/) and IMG/VR database (Camargo et al., 2023). The GapCloser software (https://sourceforge.net/projects/soapdenovo2/files/GapCloser/) was subsequently applied to fill up the remaining local inner gaps and correct the single base polymorphism for the final assembly results. All the detailed processes were described in the supplementary information. The virus sequences with higher scores are only these four, but they are not complete genomes. Some virus sequences with shorter sequences and lower scores were excluded.

      Comment 14: Line 328. We need some details about the host genomes here. How were these derived? What is their completeness/contamination? What is their size? If the bins are poor, these would not serve as a reliable comparison to identify integrated phage.

      Thanks for your comments. For genomic sequencing, strains WC36 and zth2 were grown in the liquid rich medium supplemented with 5 g/L laminarin and starch and harvested after one week of incubation at 28 °C. Genomic DNA was isolated by using the PowerSoil DNA isolation kit (Mo Bio Laboratories Inc., Carlsbad, CA). Thereafter, the genome sequencing was carried out with both the Illumina NovaSeq PE150 (San Diego, USA) and Nanopore PromethION platform (Oxford, UK) at the Beijing Novogene Bioinformatics Technology Co., Ltd. A complete description of the library construction, sequencing, and assembly was performed as previously described (Zheng et al., 2021). We used seven databases to predict gene functions, including Pfam (Protein Families Database, http://pfam.xfam.org/), GO (Gene Ontology, http://geneontology.org/) (Ashburner et al., 2000), KEGG (Kyoto Encyclopedia of Genes and Genomes, http://www.genome.jp/kegg/) (Kanehisa et al., 2004), COG (Clusters of Orthologous Groups, http://www.ncbi.nlm.nih.gov/COG/) (Galperin et al., 2015), NR (Non-Redundant Protein Database databases), TCDB (Transporter Classification Database), and Swiss-Prot (http://www.ebi.ac.uk/uniprot/) (Bairoch and Apweiler, 2000). A whole genome Blast search (E-value less than 1e-5, minimal alignment length percentage larger than 40%) was performed against above seven databases.

      The completeness of the genomes of strains WC36 and zth2 were 100%, which were checked by the CheckM v1.2.2. The size of the genome of strains WC36 and zth2 were 3,660,783 bp and 3,198,720bp, respectively. The complete genome sequences of strains WC36 and zth2 presented in this study have been deposited in the GenBank database with accession numbers CP085689 and CP071032, respectively. 

      Moreover, to verify whether the absence of microbial contamination in phage sequencing results, we used the new alignment algorithm BWA-MEM (version 0.7.15) to perform reads mapping of host WGS to these phages. We found that all the raw reads of host strains (WC36 and zth2) were not mapping to these phages sequences (Author response image 3, shown as below). In addition, we also performed the evaluation of the assembly graph underlying the host consensus assemblies. Clean reads were mapped to the bacterial complete genome sequences by the Bowtie 2 (version 2.5.0), BWA (version 0.7.8) and SAMTOOLS (version 0.1.18). The results showed that the total mismatch rate of strains WC36 and zth2 were almost 0% and 0.03%, respectively (Author response table 1, shown as below). In addition, we also collected the cells of strains WC36 and zth2, and then sent them to another company for whole genome sequencing (named WC36G and ZTH, GenBank accession numbers CP151801 and CP119760, respectively). The completeness of the genomes of strains WC36G and ZTH were also 100%. The size of the genome of strains WC36G and ZTH were 3,660,783bp and 3,198,714bp, respectively. The raw reads of strains WC36G and zth2 were also not mapping to the phages sequences. Therefore, we can confirm that these bacteriophage genomes were completely outside of the host chromosomes. 

      Author response image 4.

      The read mapping from WGS to phage sequences.

      Author response table 1.

      Sequencing depth and coverage statistics.

      References related to this response:

      Zheng, R., Liu, R., Shan, Y., Cai, R., Liu, G., and Sun, C. (2021b) Characterization of the first cultured free-living representative of Candidatus Izemoplasma uncovers its unique biology ISME J 15:2676-2691. 

      Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium Nat Genet 25:25-29. 

      Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., and Hattori, M. (2004) The KEGG resource for deciphering the genome Nucleic Acids Res 32:D277-280. 

      Galperin, M.Y., Makarova, K.S., Wolf, Y.I., and Koonin, E.V. (2015) Expanded microbial genome coverage and improved protein family annotation in the COG database Nucleic Acids Res 43:D261-269. 

      Bairoch, A., and Apweiler, R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 Nucleic Acids Res 28:45-48.

      Comment 15: Line 333. This also needs some details. What evidence do you have that these are not chromosomal? If not chromosomal where can they be found? Sequencing efforts should also be able to yield extrachromosomal elements such as plasmids etc... If you were to sequence your purified isolate cultures from the rich media alone and include all assemblies (not just those binned for example) as a reference, would you be able to recruit viral reads? The way this reads suggests that Chevallereau et al., worked specifically with these phage, which is not the case - please rephrase.

      Thanks for your comments. We carefully compared the bacteriophage genomes with those of the corresponding hosts (strains WC36 and zth2) using Galaxy Version 2.6.0 (https://galaxy.pasteur.fr/) (Afgan et al., 2018) with the NCBI BLASTN method and used BWA-mem software for read mapping from host whole genome sequencing (WGS) to these bacteriophages. These analyses both showed that the bacteriophage genomes are completely outside of the host chromosomes. Therefore, we hypothesized that the phage genomes might exist in the host in the form similar to that of plasmid.

      Comment 16: Line 335. More to the point here that we need confirmation that these phages were not introduced in the polysaccharide treatment

      Thanks for your comments. Please find our answers for this concern in the responses for comment 1 of “weakness” part and comment 6 of “Recommendations For The Authors” part.

      Comment 17: Line 342. Lacking significant detail here. Phylogeny based on what gene(s), how were the alignments computed/refined, what model used etc..?

      Thanks for your comments. According to your suggestions, all the related information was shown in this section “Materials and methods” of this manuscript. The maximum likelihood phylogenetic tree of Phage-WC36-2 and Phage-zth2-2 was constructed based on the terminase large subunit protein (terL). These proteins used to construct the phylogenetic trees were all obtained from the NCBI databases. All the sequences were aligned by MAFFT version 7 (Katoh et al., 2019) and manually corrected. The phylogenetic trees were constructed using the W-IQ-TREE web server (http://iqtree.cibiv.univie.ac.at) with the “GTR+F+I+G4” model (Trifinopoulos et al., 2016). Finally, we used the online tool Interactive Tree of Life (iTOL v5) (Letunic and Bork, 2021) to edit the tree. 

      Comment 18: Line 346. How are you specifically defining AMGs in this study? Most of these are well-known and studied phage genes with specific life cycle functions and could not be considered as polysaccharide processing AMGs even though in host cells many do play a role in polysaccharide processing systems. A substantially deeper literature review is needed in this section, which would ultimately eliminate most of these from the potential AMG pools. Further, the simple HMM/BLASTp evalues are not sufficient to support the functional annotation of these genes. At a minimum, catalytic/conserved regions should be identified, secondary structures compared, and phylogenetic analysis (where possible) developed etc... My recommendation is to eliminate this section entirely from the manuscript. 

      Categorically:

      - Glycoside hydrolase (various families), glucosaminidases, and transglycosylase are all very common to phage and operate generally as a lysins, facilitating the release of virions from the host cell upon lysis, or injection of viral DNA upon infection https://doi.org/10.3389/fmicb.2016.00745 (and citations therein) https://doi.org/10.1016/j.cmi.2023.10.018 etc... In order to confirm these as distinct AMGs we would need a very detailed analysis indicating that these are not phage infection cycle/host recognition related, however I strongly suspect that under such interrogation, these would prove to be as such.

      -TonB related systems including ExbB are well studied among phages as part of the trans-location step in infection. These could not be considered as AMGs. https://doi.org/10.1128/JB.00428-19. Other TonB dependent receptors play a role in host recognition.

      -Several phage acetyltransferases play a role in suppressing host RNA polymerase in order to reserve host cell resources for virion production, including polysaccharide production. https://doi.org/10.3390/v12090976. Further it has been shown that the E. coli gene neuO (O-acetyltransferase) is a homologue of lambdoid phage tail fiber genes https://doi.org/10.1073/pnas.0407428102. I suspect the latter is also the case here and this is a tail fiber gene.

      Thanks for your valuable comments. According to your suggestions, we have reanalyzed these AMGs and made some modifications (the new version Fig. 5A, shown as below). These genes encoding proteins associated with polysaccharide transport and degradation may be only common in virulent phages, and have never been reported in chronic phages. Unlike virulent phages, these genes typically act as lysozymes, facilitating the release of virions from the host cell upon lysis, or injection of viral DNA upon infection, chronic phages do not lyse the host. It is reported that, filamentous phages could recognize and bind to the host pili, which causes the pili to shrink and brings the filamentous phages closer to and possibly through the outer membrane of host cells (Riechmann et al., 1997; Sun et al., 1987). The possible mechanism of other chronic phage release without breaking the host might be that it was enclosed in lipid membrane and released from the host cells by a nonlytic manner. It has recently been reported that the tailless Caudoviricetes phage particles are enclosed in lipid membrane and are released from the host cells by a nonlytic manner (Liu et al., 2022), and the prophage induction contributes to the production of membrane vesicles by Lacticaseibacillus casei BL23 during cell growth (da Silva Barreira et al., 2022). Therefore, the persistence of these genes in chronic phages may be due to their ability to assist the host in metabolizing polysaccharides. 

      Finally, according to your suggestions, we have weakened the role of AMGs and added “potential” in front of it.

      References related to this response:

      Riechmann L, Holliger P. (1997) The C-terminal domain of TolA is the coreceptor for filamentous phage infection of E. coli Cell 90:351-60.

      Sun TP, Webster RE. (1987) Nucleotide sequence of a gene cluster involved in entry of E colicins and single-stranded DNA of infecting filamentous bacteriophages into Escherichia coli J Bacteriol 169:2667-74. 

      Liu Y, Alexeeva S, Bachmann H, Guerra Martníez J.A, Yeremenko N, Abee T et al. (2022) Chronic release of tailless phage particles from Lactococcus lactis Appl Environ Microbiol 88: e0148321. da Silva Barreira, D., Lapaquette, P., Novion Ducassou, J., Couté, Y., Guzzo, J., and Rieu, A. Spontaneous prophage induction contributes to the production of membrane vesicles by the gram-positive bacterium Lacticaseibacillus casei BL23. mBio_._ 2022;13:e0237522.

      Comment 19: Line 354. To make this statement that these genes are missing from the host, we would need to know that these genomes are complete.

      Thanks for your comments. The completeness of the genomes of strains WC36 and zth2 were 100%, which were checked by the CheckM v1.2.2. The size of the genome of strains WC36 and zth2 were 3,660,783 bp and 3,198,720bp, respectively. The complete genome sequences of strains WC36 and zth2 presented in this study have been deposited in the GenBank database with accession numbers CP085689 and CP071032, respectively. In addition, we also collected the cells of strains WC36 and zth2, and then sent it to another company for whole genome sequencing (named WC36G and ZTH, GenBank accession numbers CP151801 and CP119760, respectively). The completeness of the genomes of strains WC36G and ZTH were also 100%. The size of the genome of strains WC36G and ZTH were 3,660,783bp and 3,198,714bp, respectively. Therefore, these genomes of strains WC36 and zth2 were complete and circular.    

      Comment 20: Figure 5. Please see https://peerj.com/articles/11447/ and https://doi.org/10.1093/nar/gkaa621 for a detailed discussion on vetting AMGs. Several of these should be eliminated according to the standards set in the field. More specifically, and by anecdotal comparison with other inoviridae genomes, for Phage-WC36-1 and Phage-zth2-1, I am not convinced that the transactional regulator and glycoside hydrolase are a part of the phage genome. The phage genome probably ends at the strand switch.

      Thanks for your comments. According to your suggestions, we have analyzed these two articles carefully and modified the genome of Phage-WC36-1 and Phage-zth2-1 by anecdotal comparison with other inoviridae genomes. As you said, the transactional regulator and glycoside hydrolase are not a part of the phage genome.

      The new version Fig. 5A was shown.

      References related to this response:

      Shaffer, M., Borton, M.A., McGivern, B.B., Zayed, A.A., La Rosa, S.L., Solden, L.M., Liu, P., Narrowe, A.B., Rodrgíuez-Ramos, J., Bolduc, B., et al. (2020) DRAM for distilling microbial metabolism to automate the curation of microbiome function Nucleic Acids Res 48:8883-8900 

      Pratama, A.A., Bolduc, B., Zayed, A.A., Zhong, Z.P., Guo, J., Vik, D.R., Gazitúa, M.C., Wainaina, J.M., Roux, S., and Sullivan, M.B. (2021) Expanding standards in viromics: in silico evaluation of dsDNA viral genome identification, classification, and auxiliary metabolic gene curation PeerJ 9:e11447

      Comment 21: Line 380. This section needs to start with detailed evidence that this phage can even infect this particular strain. Added note, upon further reading the serial dilution cultures are not sufficient to prove these phage infect this Pseudomonas. We need at a minimum a one-step growth curve and wet mount microscopy. It is much more likely that some carry over contaminant is invading the culture and influencing OD600. With the given evidence, I am not at all convinced that these phages have anything to do with Pseudomonas polysaccharide use and I recommend either drastically revising this section or eliminating it entirely.

      Line 386-389. Could this be because you are observing your added phage in the starch enriched media while no phage were introduced with the "other types of media" so none would be observed? This could have nothing to do with infection dynamics. Further, this would also be consistent with your starch solution being contaminated by phage.

      Line 399. Again consistent with the starch media being contaminated.

      Line 401-408. This is more likely to do with the augmentation of the media with an additional carbon source and not involving the phage. 

      Line 410. I am not convinced that these viruses infect the Pseudomonas strain. Extensive further evidence of infection is needed to make these assertions.  Figure 6A. We need confirmation that the isolate culture remains pure and there are no other contaminants introduced with the phage.

      Thanks for your comments. We have proved that the polysaccharides (laminarin/ starch) didn't contaminate any phages above. Actually, we selected many marine strains (Pseudomonadota, Planctomycetes, Verrucomicrobia, Fusobacteria, and Tenericutes isolates) to investigate whether Phages-WC36 could assist them in degradation and utilization of polysaccharides, and found that Phages-WC36 could only promote the growth of strain 273. The presence of filamentous phages and hexagonal phages was detected in the supernatant of strain 273 cultured in basal medium supplemented with 5 g/L starch and 20 μl/mL Phages-WC36. After 3 passages of serial cultivation in basal medium supplemented with 5 g/L starch, we found that filamentous phages and hexagonal phages were also present in basal medium supplemented with starch, but not in the basal medium, which may mean that Phages-WC36 could infect strain 273 and starch is an important inducer. In addition, the Phages-WC36 used in the growth assay of strain 273 were multiple purified and eventually suspended in SM buffer (0.01% gelatin, 50 mM Tris-HCl, 100 mM NaCl and 10 mM MgSO4). Thus, these phages are provided do not contain some extracellular enzymes and/or nutrients. In addition, we set up three control groups in the growth assay of strain 273: basal medium, basal medium supplemented with Phages-WC36 and basal medium supplemented with starch. If the Phages-WC36 contains some extracellular enzymes and/or nutrients, strain 273 could also grow well in the basal medium supplemented only with Phages-WC36. However, the poor growth results of strain 273 cultivated in the basal medium supplemented with Phages-WC36 further confirmed that there were not some extracellular enzymes and/or nutrients in these phages.

      Finally, the possible mechanism of the chronic phage release without breaking the host might be that it was enclosed in lipid membrane and released from the host cells by a nonlytic manner. Thus, these chronic phages may have a wider host range. However, we were unable to further disclose the infection mechanism in this paper. Therefore, according to your suggestions, we have deleted this section entirely.

      Comment 27: Line 460. Details about how these genomes were reconstructed is needed here.  

      Thanks for your comments. According to your suggestions, we have added the detailed information about the genome sequencing, annotation, and analysis as “Genome sequencing, annotation, and analysis of strains WC36 and zth2 For genomic sequencing, strains WC36 and zth2 were grown in the liquid rich medium supplemented with 5 g/L laminarin and starch and harvested after one week of incubation at 28 °C. Genomic DNA was isolated by using the PowerSoil DNA isolation kit (Mo Bio Laboratories Inc., Carlsbad, CA). Thereafter, the genome sequencing was carried out with both the Illumina NovaSeq PE150 (San Diego, USA) and Nanopore PromethION platform (Oxford, UK) at the Beijing Novogene Bioinformatics Technology Co., Ltd. A complete description of the library construction, sequencing, and assembly was performed as previously described (Zheng et al., 2021b). We used seven databases to predict gene functions, including Pfam (Protein Families Database, http://pfam.xfam.org/), GO (Gene Ontology, http://geneontology.org/) (Ashburner et al., 2000), KEGG (Kyoto Encyclopedia of Genes and Genomes, http://www.genome.jp/kegg/) (Kanehisa et al., 2004), COG (Clusters of Orthologous Groups, http://www.ncbi.nlm.nih.gov/COG/) (Galperin et al., 2015), NR (Non-Redundant Protein Database databases), TCDB (Transporter Classification Database), and Swiss-Prot (http://www.ebi.ac.uk/uniprot/) (Bairoch and Apweiler, 2000). A whole genome Blast search (E-value less than 1e-5, minimal alignment length percentage larger than 40%) was performed against above seven databases.” in the revised manuscript (Lines 333-351).

      Comment 28: Line 462. Accession list of other taxa in the supplement would help here.  

      Thanks for your comments. The accession numbers of these strains were displayed behind these strains in Figure 1A. According to your suggestions, we have added an accession list of these taxa (Supplementary Table 6) in the revised manuscript.

      Comment 29: Line 463. Is there any literature to support that these are phylogenetically informative genes for Inoviridae?  

      Thanks for your comments. There are some literatures (Zeng et al, 2021; Evseev et al, 2023) to support that these are phylogenetically informative genes for Inoviridae. We have added these literatures in the revised manuscript. 

      References related to this response:

      Zeng, J., Wang, Y., Zhang, J., Yang, S., and Zhang, W. (2021) Multiple novel filamentous phages detected in the cloacal swab samples of birds using viral metagenomics approach Virol J 18:240

      Evseev, P., Bocharova, J., Shagin, D., and Chebotar, I. (2023) Analysis of Pseudomonas aeruginosa isolates from patients with cystic fibrosis revealed novel groups of filamentous bacteriophages. Viruses 15: 2215

      Reviewer #2 (Public Review):

      Summary: This paper investigates virus-host interactions in deep-sea bacteriophage systems which employ a seemingly mutualistic approach to viral replication in which the virus aids host cell polysaccharide import and utilization via metabolic reprogramming. The hypothesis being tested is supported with solid and convincing evidence and the findings are potentially generalizable with implications for our understanding of polysaccharide-mediated virus-host interactions and carbon cycles in marine ecosystems more broadly.

      Thanks for your positive comments.

      Strengths: This paper synthesizes sequencing and phylogenic analyses of two Lentisphaerae bacteria and three phage genomes; electron microscopy imaging of bacterial/phage particles; differential gene expression analyses; differential growth curve analyses, and differential phage proliferation assays to extract insights into whether laminarin and starch can induce both host growth and phage proliferation. The data presented convincingly demonstrate that both host culture density and phage proliferation increase as a result having host, phage, and polysaccharide carbon source together in culture.

      Thanks for your positive comments.  

      Weaknesses (suggestions for improvement): 

      (1) The article would be strengthened by the following additional experiment: providing the phage proteins hypothesized to be aiding host cell growth (red genes from Figure 5...TonB system energizer ExbB, glycosidases, etc) individually or in combination on plasmids rather than within the context of the actual phage itself to see if such additional genes are necessary and sufficient to realize the boosts in host cell growth/saturation levels observed in the presence of the phages tested.

      Thanks for your valuable comments. It is a really good idea to express individually or in combination on plasmids to see the effects of those polysaccharide-degradation proteins in the host cell. However, at present, we failed to construct the genetic and expression system for the strictly anaerobic strain WC36, which hindering our further detailed investigation of the functions of those polysaccharide-degradation proteins. In our lab, we are trying our best to build the genetic and expression system for strain WC36. We will definitely test your idea in the future. 

      (2) The paper would also benefit from additional experiments focused on determining how the polysaccharide processing, transport, and metabolism genes are being used by the phages to either directly increase viral infection/replication or else to indirectly do so by supporting the growth of the host in a more mutualistic manner (i.e. by improving their ability to import, degrade, and metabolize polysaccharides).  

      Thanks for your valuable comments. Indeed, due to the chronic phage genome is not within the chromosome of the host, it is very hard to disclose the exact auxiliary process and mechanism of chronic phages. At present, we are trying to construct a genetic manipulation system for the strictly anaerobic host WC36, and we will gradually reveal this auxiliary mechanism in the future. In addition, combined with the reviewer 1’s suggestions, the focus of revised manuscript is to emphasize that polysaccharides induce deep-sea bacteria to release chronic phages, and most of the content of phage assisting host metabolism of polysaccharides has been deleted.

      (3) The introduction would benefit from a discussion of what is known regarding phage and/or viral entry pathways that utilize carbohydrate anchors during host entry. The discussion could also be improved by linking the work presented to the concept of "selfishness" in bacterial systems (see for instance Giljan, G., Brown, S., Lloyd, C.C. et al. Selfish bacteria are active throughout the water column of the ocean. ISME COMMUN. 3, 11 (2023) https://doi.org/10.1038/s43705-023-00219-7). The bacteria under study are gram negative and it was recently demonstrated (https://www.nature.com/articles/ismej201726) that "selfish" bacteria sequester metabolizable polysaccharides in their periplasm to advantage. It is plausible that the phages may be hijacking this "selfishness" mechanism to improve infectivity and ENTRY rather than helping their hosts to grow and profilerate so they can reap the benefits of simply having more hosts to infect. The current work does not clearly distinguish between these two distinct mechanistic possibilities. The paper would be strengthened by at least a more detailed discussion of this possibility as well as the author's rationale for interpreting their data as they do to favor the "mutualistic" interpretation. In the same light, the paper would benefit from a more careful choice of words which can also help to make such a distinction more clear/evident/intentional. As currently written the authors seem to be actively avoiding giving insights wrt this question.  

      Thanks for your valuable comments. According to your suggestions, we have added the related discussion as “Moreover, it was recently demonstrated that selfish bacteria, which were common throughout the water column of the ocean, could bind, partially hydrolyze, and transport polysaccharides into the periplasmic space without loss of hydrolysis products (Reintjes et al., 2017; Giljan et al., 2023). Based on our results, we hypothesized that these chronic phages might also enter the host through this “selfishness” mechanism while assisting the host in metabolizing polysaccharides, thus not lysing the host. On the other hand, these chronic phages might hijack this “selfishness” mechanism to improve their infectivity and entry, rather than helping their hosts to grow and proliferate, so they could reap the benefits of simply having more hosts to infect. In the future, we need to construct a genetic operating system of the strictly anaerobic host strain WC36 to detailedly reveal the relationship between chronic phage and host.” in the revised manuscript (Lines 305-316). 

      References related to this response:

      Reintjes, G., Arnosti, C., Fuchs, B.M., and Amann, R. (2017) An alternative polysaccharide uptake mechanism of marine bacteria ISME J 11:1640-1650

      Giljan, G., Brown, S., Lloyd, C.C., Ghobrial, S., Amann, R., and Arnosti, C. (2023) Selfish bacteria are active throughout the water column of the ocean ISME Commun 3:11

      (4) Finally, I would be interested to know if the author’s sequencing datasets might be used to inform the question raised above by using bacterial immunity systems such as CRISPR/Cas9. For example, if the phage systems studied are truly beneficial/mutualistic for the bacteria then it’s less likely that there would be evidence of targeted immunity against that particular phage that has the beneficial genes that support polysaccharide metabolism.

      Thanks for your comments. According to your suggestions, we have carefully analyzed the genome of strain WC36, and found that there were no CRISPR/Cas9-related genes. Considering our results that the number of chronic phages was increased with the prolongation of culture time, we speculated that host might have no targeted immunity against these chronic phages.

      Reviewer #2 (Recommendations For The Authors):

      There are some minor grammatical errors and unclear statements (lines 99-100, 107-109, 163, 222, 223, 249-250, 254) which should also be fixed before final publication. 

      Thanks for your valuable comments. We have fixed these minor grammatical errors and unclear statements in the revised manuscript.

      Lines 99-100: we have modified this description as “For instance, AMGs of marine bacteriophages have been predicted to be involved in photosynthesis (Mann et al., 2003), nitrogen cycling (Ahlgren et al., 2019; Gazitúa et al., 2021), sulfur cycling (Anantharaman et al., 2014; Roux et al., 2016), phosphorus cycling (Zeng and Chisholm, 2012), nucleotide metabolism (Sullivan et al., 2005; Dwivedi et al., 2013; Enav et al., 2014), and almost all central carbon metabolisms in host cells (Hurwitz et al., 2013).” in the revised manuscript (Lines 100-105).

      Lines 107-109: we have modified this description as “However, due to the vast majority of deep-sea microbes cannot be cultivated in the laboratory, most bacteriophages could not be isolated.” in the revised manuscript (Lines 110-111).

      Line 163: we have modified this description as “Based on the growth curve of strain WC36, we found that the growth rate of strictly anaerobic strain WC36 was relatively slow.” in the revised manuscript (Lines 149-151).

      Lines 222-223: we have modified this description as “Regardless of whether the laminarin was present, the bacterial cells kept their cell shape intact, indicating they were still healthy after 30 days” in the revised manuscript (Lines 195-197).

      Lines 249-250: we have modified this description as “However, the entry and exit of the hexagonal phages into the WC36 cells were not observed.” in the revised manuscript (Lines 190-191).

      Line 254: we have modified this description as “To explore whether the production of bacteriophages induced by polysaccharide is an individual case, we further checked the effect of polysaccharides on another cultured deep-sea Lentisphaerae strain zth2.” in the revised manuscript (Lines 213-215).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Galanti et al. present an innovative new method to determine the susceptibility of large collections of plant accessions towards infestations by herbivores and pathogens. This work resulted from an unplanned infestation of plants in a greenhouse that was later harvested for sequencing. When these plants were extracted for DNA, associated pest DNA was extracted and sequenced as well. In a standard analysis, all sequencing reads would be mapped to the plant reference genome and unmapped reads, most likely originating from 'exogenous' pest DNA, would be discarded. Here, the authors argue that these unmapped reads contain valuable information and can be used to quantify plant infestation loads.

      For the present manuscript, the authors re-analysed a published dataset of 207 sequenced accessions of Thlaspi arvense. In this data, 0.5% of all reads had been classified as exogenous reads, while 99.5% mapped to the T. arvense reference genome. In a first step, however, the authors repeated read mapping against other reference genomes of potential pest species and found that a substantial fraction of 'ambiguous' reads mapped to at least one such species. Removing these reads improved the results of downstream GWAs, and is in itself an interesting tool that should be adopted more widely.

      The exogenous reads were primarily mapped to the genomes of the aphid Myzus persicae and the powdery mildew Erysiphe cruciferarum, from which the authors concluded that these were the likely pests present in their greenhouse. The authors then used these mapped pest read counts as an approximate measure of infestation load and performed GWA studies to identify plant gene regions across the T. arvense accessions that were associated with higher or lower pest read counts. In principle, this is an exciting approach that extracts useful information from 'junk' reads that are usually discarded. The results seem to support the authors' arguments, with relatively high heritabilities of pest read counts among T. arvense accessions, and GWA peaks close to known defence genes. Nonetheless, I do feel that more validation would be needed to support these conclusions, and given the radical novelty of this approach, additional experiments should be performed.

      A weakness of this study is that no actual aphid or mildew infestations of plants were recorded by the authors. They only mention that they anecdotally observed differences in infestations among accessions. As systematic quantification is no longer possible in retrospect, a smaller experiment could be performed in which a few accessions are infested with different quantities of aphids and/or mildew, followed by sequencing and pest read mapping. Such an approach would have the added benefit of allowing causally linking pest read count and pest load, thereby going beyond correlational associations.

      On a technical note, it seems feasible that mildew-infested leaves would have been selected for extraction, but it is harder to explain how aphid DNA would have been extracted alongside plant DNA. Presumably, all leaves would have been cleaned of live aphids before they were placed in extraction tubes. What then is the origin of aphid DNA in these samples? Are these trace amounts from aphid saliva and faeces/honeydew that were left on the leaves? If this is the case, I would expect there to be substantially more mildew DNA than aphid DNA, yet the absolute read counts for aphids are actually higher. Presumably read counts should only be used as a relative metric within a pest organism, but this unexpected result nonetheless raises questions about what these read counts reflect. Again, having experimental data from different aphid densities would make these results more convincing.

      We agree with the reviewer that additional aphid counts at the time of (or prior to) sequencing would have been ideal, but unfortunately we do not have these data. However, compared to such counts one strength of our sequencing-based approach is that it (presumably) integrates over longer periods than a single observation (e.g. if aphid abundances fluctuated, or winged aphids visited leaves only temporarily), and that it can detect pathogens even when invisible to our eyes, e.g. before a mildew colony becomes visible. Moreover, the key point of our study is that we can detect variation in pest abundance even in the absence of count data, which are really time consuming to collect.

      Conducting a new experiment, with controlled aphid infestations and continuous monitoring of their abundances, to test for correlation between pest abundance and the number of detected reads would require resequencing at least 30-50% of the collection for the results to be reliable. It would be a major experimental study in itself.

      Regarding the origin of aphid reads and the differences in read-counts between e.g. aphids and mildew, we believe this should not be of concern. DNA contamination is very common in all kinds of samples, but these reads are simply discarded in other studies. For example, although we collected and handled samples using gloves, MG-RAST detected human reads (Hominidae, S2 Table), possibly from handling the plants during transplanting or phenotyping 1-2 weeks before sequencing. Therefore, although we did remove aphids from the leaves at collection, aphid saliva or temporary presence on leaves must have been enough to leave detectable DNA traces. Additionally, the fact that the M. persicae load strongly correlates with the Buchnera aphidicola load (R2\=0.86, S6 Table), is reassuring. This obligate aphid symbiont is expected to be found in high amounts when sequencing aphids (see e.g. The International Aphid Genomics Consortium (2010))

      The higher amount of aphid compared to mildew reads, can probably be explained by aphids having expanded more than mildew at the time of plant collection, but most importantly, as already mentioned by the reviewer, the read-counts were meant to compare plant accessions rather then pests to one another. We are interested in relative not absolute values. Comparisons between pest species are a challenge because they can be influenced by several factors such as the availability of sequences in the MG-RAST database and the DNA extraction kit used, which is plant-specific and might bias towards certain groups. All these potential biases are not a concern when comparing different plants as they are equally subject to these biases.

      Reviewer #2 (Public Review):

      Summary:

      Galanti et al investigate genetic variation in plant pest resistance using non-target reads from whole-genome sequencing of 207 field lines spontaneously colonized by aphids and mildew. They calculate significant differences in pest DNA load between populations and lines, with heritability and correlation with climate and glucosinolate content. By genome-wide association analyses they identify known defence genes and novel regions potentially associated with pest load variation. Additionally, they suggest that differential methylation at transposons and some genes are involved in responses to pathogen pressure. The authors present in this study the potential of leveraging non-target sequencing reads to estimate plant biotic interactions, in general for GWAS, and provide insights into the defence mechanisms of Thlaspi arvense.

      Strengths:

      The authors ask an interesting and important question. Overall, I found the manuscript very well-written, with a very concrete and clear question, a well-structured experimental design, and clear differences from previous work. Their important results could potentially have implications and utility for many systems in phenotype-genotype prediction. In particular, I think the use of unmapped reads for GWAS is intriguing.

      Thank you for appreciating the originality and potential of our work.

      Weaknesses:

      I found that several of the conclusions are incomplete, not well supposed by the data and/or some methods/results require additional details to be able to be judged. I believe these analyses and/or additional clarifications should be considered.

      Thank you very much for the supportive and constructive comments. They helped us to improve the manuscript.

      Recommendations for the authors:

      Reviewing Editor (Recommendations For The Authors):

      The authors address an interesting and significant question, with a well-written manuscript that outlines a clear experimental design and distinguishes itself from previous work. However, some conclusions seem incomplete, lacking sufficient support from the data, or requiring additional methodological details for proper evaluation. Addressing these limitations through additional analyses or clarifications is recommended.

      Reviewer #2 (Recommendations For The Authors):

      Major comments:

      - So far it is not clear to me how read numbers were normalised and quantified. For instance, Figure 1C only reports raw read numbers. In L149: "Prior to these analyses, to avoid biases caused by different sequencing depths, we corrected the read counts for the total numbers of deduplicated reads in each library and used the residuals as unbiased estimates of aphid, mildew and microbe loads". Was library size considered? Is the load the ratio between exogenous vs no exogenous reads? It is described in L461, but according to this, read counts were normalised and duplicated reads were removed. Now, why read counts were used? As opposite to total coverage / or count of bases per base? I cannot follow how variation in sequencing quality was considered. I can imagine that samples with higher sequencing depth will tend to have higher exogenous reads (just higher resolution and power to detect something in a lower proportion).

      Correcting for sequencing depth/library size is indeed very important. As the reviewer noted, we had explained how we did this in the methods section (L464), and we now also point to it in the results (L151):

      “Finally, we log transformed all read counts to approximate normality, and corrected for the total number of deduplicated reads by extracting residuals from the following linear model, log(read_count + 1) ∼ log(deduplicated_reads), which allowed us to quantify non-Thlaspi loads, correcting for the sequencing depth of each sample.”

      We showed the uncorrected read-counts only in Fig 1 to illustrate the orders of magnitude but used the corrected read-counts (also referred to as “loads”) for all subsequent analyses.

      In our view, theoretically, the best metric to correct the number of reads of a specific contaminant organism, is the total number of DNA fragments captured. Importantly, this is not well reflected by the total number of raw reads because of PCR and optical duplicates occurring during library prep and sequencing. For this reason we estimated the total number of reads captured multiplying total raw reads (after trimming) by the deduplication rate obtained from FastQC (methods L409-411). This metric reflects the amount of DNA fragments sampled better than the raw reads. Also it better reflects MG-RAST metrics as this software also deduplicates reads (Author response image 1 below). We also removed duplicates in our strict mappings to the M. persicae and B. aphidicola genomes.

      Coverage is not a good option for correction, because it is defined for a specific reference genome and many of the read-counts output by MG-RAST do not have a corresponding full assembly. Moreover, coverage and base counts are influenced by read size, which depends on library prep and is not included in the read-counts produced by MG-RAST.

      Author response image 1.

      Linear correlations between the number of MG-RAST reads post-QC and either total (left) or deduplicated (right) reads from fastq files of four full samples (not only unmapped reads).

      - The general assumption is that plants with different origins will have genetic variants or epigenetic variations associated with pathogen resistance, which can be tracked in a GWAS. However, plants from different regions will also have all variants associated with their origin (isolation by state as presented in the manuscript). In line 169: "Having established that our method most likely captured variation in plant resistance, we were interested in the ecological drivers of this variation". It is not clear to me how variation in plant resistance is differentiated from geographical variation (population structure). in L203: "We corrected for population structure using an IBS matrix and only tested variants with Minor Allele Frequency (MAF) > 0.04 (see Methods).". However, if resistant variants are correlated with population structure as shown in Table 1, how are they differentiated? In my opinion, the analyses are strongly limited by the correlation between phenotype and population structure.

      The association of any given trait with population structure is surely a very important aspect in GWAS studies and when looking at correlations of traits with environmental variables. If a trait is strongly associated with population structure, then disentangling variants associated with population structure vs. the ones associated with the trait can indeed be challenging, a good example being flowering time in A. thaliana (e.g. Brachi et al. 2013).

      In our case, although the pest and microbiome loads are associated with population structure to some extent, this association is not very strong. This can be observed for example in Fig. 1C, where there is no clear separation of samples from different regions. This means that we can correct for population structure (in both GWAS and correlations with climatic variables) without removing the signals of association. It is possible that other associations were missed if specific variants were indeed strongly associated with structure, but these would be unreliable within our dataset, so it is prudent to exclude them.

      - Similarly, in L212: "we still found significant GWA peaks for Erysiphales but not for other types of exogenous reads (excluding isolated, unreliable variants) (Figure 3A and S3 Figure)." In a GWA analysis, multiple variants will constitute an association pick (as shown for instance in main Figure 3A) only when the pick is accentuated by lockage disequilibrium around the region under selection (or around the variant explaining phenotypic variation in this case). However, in this case, I suspect there is a strong component of population structure (which still needs to be corroborated as suggested in the previous comment). But if variants are filtered by population structure, the only variants considered are those polymorphic within populations. In this case, I do not think clear picks are expected since most of the signal, correlated with population has been removed. Under this scenario, I wonder how informative the analyses are.

      As mentioned above, the traits we analyse (aphid and mildew loads) are only partially associated with population structure. This is evident from Fig. 1C (see answer above) but also from the SNP-based heritability (Table 1, last column) which measures indeed the proportion of variance explained by genetic population structure. Although some variance is explained (i.e. the reviewer is correct that there is some association) there is still plenty of leftover variance to be used for GWAS and correlations with environmental variables. The fact that we still find GWAS peaks confirms this, as otherwise they would be lost by the population structure correction included in our mixed model.

      - How were heritability values calculated? Were related individuals filtered out? I suggest adding more detail in both the inference of heritability and the kinship matrix (IBS matrix). Currently missing in methods (for heritability I only found the mention of an R package in the caption of Table 1).

      We somehow missed this in the methods and thank the reviewer for noticing. We now added this paragraph to the chapter “Exogenous reads heritability and species identification”:<br /> “To test for variation between populations we used a general linear model with population as a predictor. To measure SNP-based heritability, i.e. the proportion of variance explained by kinship, we used the marker_h2() function from the R package heritability (Kruijer and Kooke 2019), which uses a genetic distance matrix as predictor to compute REML-estimates of the genetic and residual variance. We used the same IBS matrix as for GWAS and for the correlations with climatic variables.”

      We also added the reference to the R package heritability to the Table 1 caption.

      - Figure 2C. in line 188: "Although the baseline levels of benzyl glucosinolates were very low and probably sometimes below the detection level, plant lines where benzyl glucosinolate was detected had significantly lower aphid loads (over 70% less reads) in the glasshouse (Figure 3C)". It is not clear to me how to see these values in Figure 2C. From the boxplot, the difference in aphid loads between detected and not detected benzyl seems significantly lower. From the boxplot distribution is not clear how this difference is statistically significant. It rather seems like a sampling bias (a lot of non-detected vs low detected values). Is the difference still significant when random subsampling of groups is considered?

      Here the “70% less reads” refers to the uncorrected read-counts directly (difference in means between samples where benzyl-GS were detected vs. not). We agree with the reviewer that this is confusing when referred to figure 2C which depicts the corrected M. persicae load (residuals). We therefore removed that information.

      Regarding the significance of the difference, we re-calculated the p value with the Welch's t-test, which accounts for unequal variances, and with a bootstrap t-test. Both tests still found a significant difference. We now report the p value of the Welch’s t-test.

      - I think additional information regarding the read statistics needs to be improved. At the moment some sections are difficult to follow. I found this information mainly in Supplementary Table 1. I could not follow the difference in the manuscript and supplementary materials between read (read count), fragment, ambiguous fragments, target fragments, etc. I didn't find information regarding mean coverage per sample and relative plant vs parasite coverage. This lack of clarity led me to some confusion. For instance, in L207: "We suspected that this might be because some non-Thlaspi reads were very similar to these highly conserved regions and, by mapping there, generated false variants only in samples containing many non-Thlaspi reads". I find it difficult to follow how non-Thlaspi reads will interfere with genotyping. I think the fact that the large pick is lost after filtering reads is already quite insightful. However, in principle I would expect the relative coverage between non-Thlaspi:Thlaspi reads to be rather low in all cases. I would say below 1%. Thus, genotyping should be relatively accurate for the plant variants for the most part. In particular, considering genotyping was done with GATK, where low-frequency variants (relative coverage) should normally be called reference allele for the most part.

      We agree with the reviewer that some clarification over these points is necessary! We modified Supplementary Table 1 to include coverage information for all samples before and after removal of ambiguous reads and explained thoroughly how each value in the table was obtained. Regarding reads and fragments, we define each fragment as having two reads (R1 and R2). The classification into Target, Ambiguous and Unmapped reads was based on fragments, so we used that term in the table, but referring to reads has the same meaning in this context as for example an unmapped read is a read whose fragment was classified as unmapped.

      We did not include the pest coverage specifically, because this cannot be calculated for any of the read counts obtained with MG-RAST as this tool is mapping to online databases where genome size is not necessarily known. What is more meaningful instead are the read counts, which are in Supplementary tables 2 and 6. Importantly as mentioned in other answers, if different taxa are differently represented in the databases this does not affect the comparison of read counts across different samples, but only the comparison of different taxa which was not used for any further analyses.

      Regarding the ambiguous reads causing unreliable variants, these occur only in very few regions of the Thlaspi genome that are highly conserved in evolution or of very low complexity. In these regions reads generated from both plant or for instance aphid DNA, can map, but the ones from aphid might contain variants when mapping to the Thlaspi reference genome (L207 and L300). The reviewer is right that there is only a very small difference in average coverage when removing those ambiguous reads (~1X, S1 Table), but that is not true for those few regions where coverage changes massively when removing ambiguous reads as shown on the right side Y axes of S2 Figure. Therefore these unreliable variants are not low-frequency and therefore not removed by GATK.

      - L215. I am not very convinced with the enrichment analyses, justified with a reference (52). For instance, how many of the predicted picks are not close to resistance genes? How was the randomisation done? At the moment, the manuscript reads rather anecdotally by describing only those picks that effectively are "close" to resistance genes. For instance, if random windows (let's say 20kb windows) are sampled along the genome, how often there are resistant genes in those random windows, and how is the random sampling compared with observed picks (windows).

      Enrichment is by definition an increase in the proportion of true positives (observed frequency: proportion of significant SNPs located close to a priori candidate genes) compared to the background frequency (number of all SNPs located close to a priori candidate genes). So the background likelihood of SNPs to fall into a priori candidate SNPs (i.e. the occurrence of a priori candidate genes in randomly sampled windows, as suggested by the reviewer) is already taken into account as the background frequency. We now explained more extensively how enrichment is calculated in the relevant methods section (L545-549), but it is an extensively used method, established in a large body of literature, so it can be found in many papers (e.g. Atwell et al. 2010, Brachi et al. 2010, Kawakatsu et al. 2016, Kerdaffrec et al. 2017, Sasaki et al. 2015-2019-2022, Galanti et al. 2022, Contreras-Garrido et al. 2024).

      Although we had already calculated an upper bound for the FDR based on the a priori candidates, as in previous literature, we now further calculated the significance of the enrichment for the Bonferroni-corrected -log(p) threshold for Erysiphales. Calculating significance requires adopting a genome rotation scheme that preserves the LD structure of the data, as described in the previously mentioned literature (eg. Kawakatsu et al. 2016, Sasaki et al. 2022). Briefly, we calculated a null distribution of enrichments by randomly rotating the p values and a priori candidate status of the genetic variants within each chromosome, for 10 million permutations. We then assessed significance by comparing the observed enrichment to the null distribution. We found that the enrichment at the Bonferroni corrected -log(p) threshold is indeed significant for Erysiphales (p = 0.016). We added this to the relevant methods section and the code to the github page.

      In addition, many other genes very close (few kb max) to significant SNPs were not annotated with the “defense response” GO term but still had functions relatable to it. Some examples are CAR8, involved in ABA signalling, PBL7 in stomata closure and SRF3 in cell wall building and stress response  (Fig 3D). This means that our enrichment is actually most likely underestimated compared to if we had a more complete functional annotation.

      - L247. Additional information is needed regarding sampling. It is not clear to me why methylation analyses are restricted to 20 samples, contrary to whole genome analyses.

      The sampling is best described in the original paper (on natural DNA methylation variation; Galanti et al. 2022), although the most important parts are repeated in the first chapter of the methods.<br /> Regarding methylation analysis, they are not restricted to 20 samples. Only the DMR calling was restricted to the 20 vs. 20 samples with the most divergent values (of pest loads) to identify regions of variation. This analysis was used to subset the genome to potential regions associated with pest presence rather than thoroughly testing actual methylation variants associated with pest presence. The latter was done in the second step, EWAS, which was based on the whole dataset with the exclusions of samples with high non-conversion rate. This left 188 samples for EWAS. We added this number in the new manuscript (L251 and L571).

      To clarify, we made a few additions to the results (L250) and methods (last two subchapters) sections, where we explain the above.

      - No clear association with TEs: in L364: "Erysiphales load was associated with hypomethylated Copia TEs upstream of MAPKKK20, a gene involved in ABA-mediated signaling and stomatal closure. Since stomatal closure is a known defense mechanism to block pathogen access (21), it is tempting to conclude that hypomethylation of the MAPKKK20 promoter might induce its overexpression and consequent stomatal closure, thereby preventing mildew access to the leaf blade. Overall, we found associations between pathogen load and TE methylation that could act both in cis (eg. Copia TE methylation in MAPKKK20 promoter) and in trans, possibly through transposon reactivation (eg. LINE, Helitron, and Ty3/Gypsi TEs isolated from genes)." I find the whole discussion related to transposable elements, first, rather anecdotical, and second very speculative. To claim: "Overall, we found associations between pathogen load and TE methylation", I believe a more detailed analysis is needed. For instance, how often there is an association? In general, there are some rather anecdotical examples, several of which are presented as association with pathogen load on the basis of being "in proximity" to a particular region/pick. The same regions contain multiple other genes and annotations, but the authors limit the discussion to the particular gene or TE concordant with the hypothesis. This is for both the discussion and results sections.

      Here we are referring to associations in a purely statistical sense. The fact that “Overall, we found associations between pathogen load and TE methylation” is simply a conclusion drawn from Fig. 4b, without implying any causality. Some methylation variants are statistically associated with the traits (aphid or mildew loads), and whether they are true positives or causal is of course more difficult to assess.

      Regarding the methylation variants associated with mildew load in proximity of MAPKKK20, those are the only two significant ones, located close to each other and close to many other variants that, although not significant, have low P-values (Author response image 2 below), so it is the most obvious association warranting further exploration. The reviewer is correct that there are other genes flanking the large DMR that covers the TEs (Fig. 4D), but the DMR is downstream of these genes, so less likely to affect their transcription.

      Author response image 2.

      Regarding all other associations found with M. persicae load, we stated that these are not really reliable due to a skewed P-value distribution (L269, S5B Fig), but we think that for future reference it is still worth reporting the closeby genes and TEs.

      We slightly changed the wording of the passage the reviewer is citing above to make it clearer that we are only offering potential explanations for the associations we observe with TE methylation, but by no means we state that TE reactivation is surely what is happening.

      - One conclusion in the manuscript is that DMRs have been mostly the result of hypomethylation. This is shown for instance in supplementary Figure 4. However, no general statistic is shown of methylation distribution (not only restricted to DMRs). Was the ratio methylation over de-methylation proportional along the genome? Thus the finding in DMRs is out of the genome-wide distribution? Or on the contrary, the DMRs are just a random sampling of the global distribution. The same for different annotated regions. For instance, I would expect that in general coding regions would be less methylated (not restricted to DMRs).

      Complete and exhaustive analyses of the methylomes were already published in the original manuscript (Galanti et al 2022). However, the variation among these methylomes is complex and influenced by multiple factors including genetic background and environment of origin, and talking about these things would have been beyond the scope of our paper. In this paper, we just took advantage of the existing methylome information to identify the few genomic regions that are consistently differentially methylated between samples with extreme values of pest loads. As for the GWAS, the phenotypes are only partially associated with population structure, so the 20 samples with the lowest and the 20 with the highest pathogen loads are not e.g. all Swedish vs. all German but they are a mixture, which allowed us to correct for population structure running EWAS with a mixed model that includes a genetic distance matrix.

      In this study we called DMRs between two defined groups: samples with the lowest amounts of pathogen DNA (not-infected; the “control” group) vs. samples with the highest amounts of pathogens (infected or the “treatment” group), so we could define a directionality (“hyper vs. “hypo” methylation). However, this is not the case for population DMRs called between many different combinations of populations. This is why the hyper- and hypomethylated regions found here cannot be compared to the genome-wide averages, which are influenced by other factors than the pathogens. Even with relaxed thresholds we indeed found very few DMRs associated to pathogen presence here.

      Specifically about coding regions, the reviewer is correct that they are less methylated, especially because T. arvense has largely lost gene body methylation (Nunn et al. 2021, Galanti et al. 2022), but this is unrelated and was discussed in the original publication (Galanti et al. 2022).

      Minor comments:- Figure 1B: it would be good to add also percentage values.

      As the figure is already tightly packed, we rather keep it simple. As the chart gives a good impression of frequencies of different kingdoms, and the frequences of several relevant groups. Also, as explained in a previous answer, comparing different taxonomic groups could be imprecise (as opposed to comparing the same group between different samples), so exact percentages seem unnecessary. If needed, the exact percentages can still be calculated from S2 Table.

      - L159: It is not clear to me what "enemy variation" is referring to here.

      We are referring to variation in enemy densities (attack rates) in the field, that could potentially be carried over to the greenhouse to cause the patterns of infection we observed. We changed it to “variation in enemy densities” to make it more clear.

      - L259: "In accordance with previous studies (8,9), most DMRs were hypomethylated in the affected samples, indicating that genes needed for defense might be activated through demethylation". Not clear to me what "affected samples" is referring to. Samples with lower load?

      Affected samples have a higher load of pathogen reads. We changed it to “infested” to make it more clear.

      - L336. Figure should be Fig 3E.

      We fixed it, thanks for noticing.

      ADDITIONAL CHANGES

      We updated reference 43 to point to the published paper rather than the preprint.

      We corrected the phenotype names in S3 Fig, to make them consistent with the rest of the manuscript and increased font size on the axes to make it more readable.